# QueueGate QueueGate is an OpenAI-compatible **LLM proxy** that: - routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp) - provides a simple **priority queue** (user > agent) - supports **stream** and **non-stream** (no fake streaming) - supports **sticky worker affinity** (same chat -> same upstream when possible) ## Quick start ### 1) Configure Minimal env: - `LLM_UPSTREAMS` (comma-separated URLs) - e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions` Recommended (for clients like OpenWebUI): - `PROXY_MODELS` (comma-separated **virtual model ids** exposed via `GET /v1/models`) - e.g. `PROXY_MODELS=ministral-3-14b-reasoning` - `PROXY_OWNED_BY` (shows up in `/v1/models`, default `queuegate`) Optional: - `LLM_MAX_CONCURRENCY` (defaults to number of upstreams) - `STICKY_HEADER` (default: `X-Chat-Id`) - `AFFINITY_TTL_SEC` (default: `60`) - `QUEUE_NOTIFY_USER` = `auto|always|never` (default: `auto`) - `QUEUE_NOTIFY_MIN_MS` (default: `1200`) ## Chat Memory (RAG) via ToolServer If you run QueueGate with `TOOLCALL_MODE=execute` and a ToolServer that exposes `memory_query` + `memory_upsert` (backed by Chroma + Meili), QueueGate can keep the upstream context *tiny* by: - retrieving relevant prior chat snippets (`memory_query`) for the latest user message - (optionally) truncating the forwarded chat history to only the last N messages - injecting retrieved memory as a short system/user message - upserting the latest user+assistant turn back into memory (`memory_upsert`) Enable with: - `CHAT_MEMORY_ENABLE=1` - `TOOLSERVER_URL=http://:` Tuning: - `CHAT_MEMORY_TRUNCATE_HISTORY=1` (default: true) If true, forwards only system messages + the last `CHAT_MEMORY_KEEP_LAST` user/assistant messages (plus injected memory). - `CHAT_MEMORY_KEEP_LAST=4` (default: 4) - `CHAT_MEMORY_QUERY_K=8` (default: 8) - `CHAT_MEMORY_INJECT_ROLE=system` (`system|user`) - `CHAT_MEMORY_HINT=1` (default: true) – adds a short hint that more memory can be queried if needed - `CHAT_MEMORY_UPSERT=1` (default: true) - `CHAT_MEMORY_MAX_UPSERT_CHARS=12000` (default: 12000) - `CHAT_MEMORY_FOR_AGENTS=0` (default: false) Namespace selection: QueueGate uses (in order) `STICKY_HEADER`, then OpenWebUI chat/conversation headers, then body fields like `chat_id/conversation_id`, and finally falls back to the computed `thread_key`. ### 2) Run ```bash uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080 ``` ### 3) Health `GET /healthz` ### 4) OpenAI endpoint `POST /v1/chat/completions` ### 5) Model list endpoint `GET /v1/models` ### 5) Models list `GET /v1/models` ## Tool calling QueueGate supports three modes (set `TOOLCALL_MODE`): - `execute` (default): proxy executes tool calls via `TOOLSERVER_URL` and continues until final answer - `passthrough`: forward upstream tool calls to the client (or convert `[TOOL_CALLS]` text into tool_calls for the client) - `suppress`: drop tool_calls (useful for pure chat backends) Toolserver settings: - `TOOLSERVER_URL` e.g. `http://toolserver:8081` - `TOOLSERVER_PREFIX` (default `/openapi`) Extra endpoints: - `POST /v1/chat/completions` (main; uses `TOOLCALL_MODE`) - `POST /v1/chat/completions_passthrough` (forced passthrough; intended for clients with their own tools) - `POST /v1/agent/chat/completions` (agent-priority queue + execute tools)