New LLM proxy with RAG context

Go to file

QueueGate

QueueGate is an OpenAI-compatible LLM proxy that:

routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
provides a simple priority queue (user > agent)
supports stream and non-stream (no fake streaming)
supports sticky worker affinity (same chat -> same upstream when possible)

Quick start

Minimal env:

LLM_UPSTREAMS (comma-separated URLs)
- e.g. http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions

Optional:

uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080

GET /healthz

POST /v1/chat/completions

Tool calls are detected and suppressed in streaming output (to prevent leakage).
This first version is a proxy-only MVP; tool execution can be wired in later.