AI Inference Platform / RAG / Multi-Tenant Systems
CampusRAG: Multi-Tenant AI Inference Platform
A university computing-center prototype for secure RAG-based AI services: tenant-isolated document retrieval, OpenAI-compatible inference APIs, usage accounting, request limits, and Prometheus-style monitoring.
Why it matches RRZE / HPC@FAU
- Web UI and API for AI inference workflows
- RAG-capable environment with tenant separation
- Usage accounting and fair-resource controls
- Docker-first architecture, Kubernetes and Slurm ready
- Monitoring-ready service metrics for operations
3
Tenants
separate documents, vector collections, and usage records
5
Core APIs
/upload, /chat, /tenants, /status, and /metrics
RAG
Retrieval
source-grounded answers from tenant-local knowledge
SSO
Ready Design
planned Keycloak/OIDC integration for institutions
Interactive Prototype
Tenant-Isolated RAG Demo
Select tenant
No documents uploaded for this tenant yet.
Choose a .txt file or paste text manually.
No document uploaded in this session.
Choose a tenant and run a query. The response will come from the CampusRAG API route.
Architecture
Inference Service Layers
Web UI
Next.js chat, upload, tenant dashboard
FastAPI
/chat, /upload, /usage, /metrics
Tenant Layer
separate docs, vectors, limits, accounting
RAG Store
ChromaDB or pgvector collections per tenant
Model Gateway
OpenAI-compatible, LiteLLM/vLLM-ready
Metrics
Prometheus-ready usage and error counters
Operations Control Plane
Routing, Access, and GPU Resource Status
Access Control
tenant_id + role check before document retrieval
decision: allowed_tenant_scoped
Model Routing
OpenAI-compatible LiteLLM-style gateway
route: litellm/clinical-llama
GPU/HPC Resource
Slurm/Kubernetes-ready dispatch layer
gpu-clinical: 2 GPU share
Observability
Prometheus/Grafana-ready labels
latency: 420 ms, queue: 3
Runtime uploads and accounting mutations are persisted through a file-backed service store under data/campusrag-state.json, keeping the API layer ready for a later SQLite or PostgreSQL swap.
Accounting
Usage and Resource Management
0
Requests
Medicine chat calls this month
0
Documents
tenant-local uploads indexed for RAG
0
Tokens
estimated input and output usage
EUR 0.00
Cost
simple attribution estimate
Operations
Prometheus-Style Metrics
campusrag_requests_total{tenant="medicine"} 0
campusrag_documents_total{tenant="medicine"} 0
campusrag_tokens_total{tenant="medicine"} 0
campusrag_limit_remaining{tenant="medicine"} 10
campusrag_errors_total{tenant="medicine"} 0Operational behavior
- Every request is tagged with tenant_id
- Usage counters can feed cost attribution
- Rate limits protect shared GPU capacity
- Metrics are shaped for Grafana dashboards
Implementation Plan
From Prototype to Real Service
01
Replace demo retrieval with ChromaDB or pgvector-backed embeddings
02
Route model calls through LiteLLM to Ollama, vLLM, or external APIs
03
Add Keycloak/OIDC SSO with project and group based access control
04
Deploy with Docker Compose today, Kubernetes or Slurm workers later
05
Attach Grafana dashboards for request, latency, token, and cost views