New LLM proxy with RAG context
| src/queuegate_proxy | ||
| Dockerfile | ||
| README.md | ||
| requirements.txt | ||
| run.sh | ||
QueueGate
QueueGate is an OpenAI-compatible LLM proxy that:
- routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
- provides a simple priority queue (user > agent)
- supports stream and non-stream (no fake streaming)
- supports sticky worker affinity (same chat -> same upstream when possible)
Quick start
1) Configure
Minimal env:
LLM_UPSTREAMS(comma-separated URLs)- e.g.
http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions
- e.g.
Optional:
LLM_MAX_CONCURRENCY(defaults to number of upstreams)STICKY_HEADER(default:X-Chat-Id)AFFINITY_TTL_SEC(default:60)QUEUE_NOTIFY_USER=auto|always|never(default:auto)QUEUE_NOTIFY_MIN_MS(default:1200)
2) Run
uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080
3) Health
GET /healthz
4) OpenAI endpoint
POST /v1/chat/completions
Notes
- Tool calls are detected and suppressed in streaming output (to prevent leakage).
- This first version is a proxy-only MVP; tool execution can be wired in later.