# QueueGate QueueGate is an OpenAI-compatible **LLM proxy** that: - routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp) - provides a simple **priority queue** (user > agent) - supports **stream** and **non-stream** (no fake streaming) - supports **sticky worker affinity** (same chat -> same upstream when possible) ## Quick start ### 1) Configure Minimal env: - `LLM_UPSTREAMS` (comma-separated URLs) - e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions` Optional: - `LLM_MAX_CONCURRENCY` (defaults to number of upstreams) - `STICKY_HEADER` (default: `X-Chat-Id`) - `AFFINITY_TTL_SEC` (default: `60`) - `QUEUE_NOTIFY_USER` = `auto|always|never` (default: `auto`) - `QUEUE_NOTIFY_MIN_MS` (default: `1200`) ### 2) Run ```bash uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080 ``` ### 3) Health `GET /healthz` ### 4) OpenAI endpoint `POST /v1/chat/completions` ## Notes - Tool calls are detected and suppressed in streaming output (to prevent leakage). - This first version is a **proxy-only MVP**; tool execution can be wired in later.