AI Inference Platform / RAG / Multi-Tenant Systems

CampusRAG: Multi-Tenant AI Inference Platform

A university computing-center prototype for secure RAG-based AI services: tenant-isolated document retrieval, OpenAI-compatible inference APIs, usage accounting, request limits, and Prometheus-style monitoring.

View Source Open Demo

Why it matches RRZE / HPC@FAU

Web UI and API for AI inference workflows
RAG-capable environment with tenant separation
Usage accounting and fair-resource controls
Docker-first architecture, Kubernetes and Slurm ready
Monitoring-ready service metrics for operations

3

Tenants

separate documents, vector collections, and usage records

5

Core APIs

/upload, /chat, /tenants, /status, and /metrics

RAG

Retrieval

source-grounded answers from tenant-local knowledge

SSO

Ready Design

planned Keycloak/OIDC integration for institutions

Interactive Prototype

Tenant-Isolated RAG Demo

Select tenant

Isolated documents

No documents uploaded for this tenant yet.

Upload tenant document

Choose a .txt file or paste text manually.

No document uploaded in this session.

Query

RAG answer

Choose a tenant and run a query. The response will come from the CampusRAG API route.

source: not queried yettenant: medicinelimit remaining: 10

Architecture

Inference Service Layers

01

Web UI

Next.js chat, upload, tenant dashboard

02

FastAPI

/chat, /upload, /usage, /metrics

03

Tenant Layer

separate docs, vectors, limits, accounting

04

RAG Store

ChromaDB or pgvector collections per tenant

05

Model Gateway

OpenAI-compatible, LiteLLM/vLLM-ready

06

Metrics

Prometheus-ready usage and error counters

Operations Control Plane

Routing, Access, and GPU Resource Status

Access Control

tenant_id + role check before document retrieval

decision: allowed_tenant_scoped

Model Routing

OpenAI-compatible LiteLLM-style gateway

route: litellm/clinical-llama

GPU/HPC Resource

Slurm/Kubernetes-ready dispatch layer

gpu-clinical: 2 GPU share

Observability

Prometheus/Grafana-ready labels

latency: 420 ms, queue: 3

Runtime uploads and accounting mutations are persisted through a file-backed service store under data/campusrag-state.json, keeping the API layer ready for a later SQLite or PostgreSQL swap.

Accounting

Usage and Resource Management

0

Requests

Medicine chat calls this month

0

Documents

tenant-local uploads indexed for RAG

0

Tokens

estimated input and output usage

EUR 0.00

Cost

simple attribution estimate

Operations

Prometheus-Style Metrics

campusrag_requests_total{tenant="medicine"} 0
campusrag_documents_total{tenant="medicine"} 0
campusrag_tokens_total{tenant="medicine"} 0
campusrag_limit_remaining{tenant="medicine"} 10
campusrag_errors_total{tenant="medicine"} 0

Operational behavior

Every request is tagged with tenant_id
Usage counters can feed cost attribution
Rate limits protect shared GPU capacity
Metrics are shaped for Grafana dashboards

Implementation Plan

From Prototype to Real Service

01

Replace demo retrieval with ChromaDB or pgvector-backed embeddings

02

Route model calls through LiteLLM to Ollama, vLLM, or external APIs

03

Add Keycloak/OIDC SSO with project and group based access control

04

Deploy with Docker Compose today, Kubernetes or Slurm workers later

05

Attach Grafana dashboards for request, latency, token, and cost views