New LLM proxy with RAG context
| src/queuegate_proxy | ||
| Dockerfile | ||
| README.md | ||
| requirements.txt | ||
| run.sh | ||
QueueGate
QueueGate is an OpenAI-compatible LLM proxy that:
- routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
- provides a simple priority queue (user > agent)
- supports stream and non-stream (no fake streaming)
- supports sticky worker affinity (same chat -> same upstream when possible)
Quick start
1) Configure
Minimal env:
LLM_UPSTREAMS(comma-separated URLs)- e.g.
http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions
- e.g.
Recommended (for clients like OpenWebUI):
PROXY_MODELS(comma-separated virtual model ids exposed viaGET /v1/models)- e.g.
PROXY_MODELS=ministral-3-14b-reasoning
- e.g.
PROXY_OWNED_BY(shows up in/v1/models, defaultqueuegate)
Optional:
LLM_MAX_CONCURRENCY(defaults to number of upstreams)STICKY_HEADER(default:X-Chat-Id)AFFINITY_TTL_SEC(default:60)QUEUE_NOTIFY_USER=auto|always|never(default:auto)QUEUE_NOTIFY_MIN_MS(default:1200)
Chat Memory (RAG) via ToolServer
If you run QueueGate with TOOLCALL_MODE=execute and a ToolServer that exposes memory_query + memory_upsert
(backed by Chroma + Meili), QueueGate can keep the upstream context tiny by:
- retrieving relevant prior chat snippets (
memory_query) for the latest user message - (optionally) truncating the forwarded chat history to only the last N messages
- injecting retrieved memory as a short system/user message
- upserting the latest user+assistant turn back into memory (
memory_upsert)
Enable with:
CHAT_MEMORY_ENABLE=1TOOLSERVER_URL=http://<toolserver-host>:<port>
Tuning:
CHAT_MEMORY_TRUNCATE_HISTORY=1(default: true)
If true, forwards only system messages + the lastCHAT_MEMORY_KEEP_LASTuser/assistant messages (plus injected memory).CHAT_MEMORY_KEEP_LAST=4(default: 4)CHAT_MEMORY_QUERY_K=8(default: 8)CHAT_MEMORY_INJECT_ROLE=system(system|user)CHAT_MEMORY_HINT=1(default: true) – adds a short hint that more memory can be queried if neededCHAT_MEMORY_UPSERT=1(default: true)CHAT_MEMORY_MAX_UPSERT_CHARS=12000(default: 12000)CHAT_MEMORY_FOR_AGENTS=0(default: false)
Namespace selection:
QueueGate uses (in order) STICKY_HEADER, then OpenWebUI chat/conversation headers, then body fields like
chat_id/conversation_id, and finally falls back to the computed thread_key.
2) Run
uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080
3) Health
GET /healthz
4) OpenAI endpoint
POST /v1/chat/completions
5) Model list endpoint
GET /v1/models
5) Models list
GET /v1/models
Tool calling
QueueGate supports three modes (set TOOLCALL_MODE):
execute(default): proxy executes tool calls viaTOOLSERVER_URLand continues until final answerpassthrough: forward upstream tool calls to the client (or convert[TOOL_CALLS]text into tool_calls for the client)suppress: drop tool_calls (useful for pure chat backends)
Toolserver settings:
TOOLSERVER_URLe.g.http://toolserver:8081TOOLSERVER_PREFIX(default/openapi)
Extra endpoints:
POST /v1/chat/completions(main; usesTOOLCALL_MODE)POST /v1/chat/completions_passthrough(forced passthrough; intended for clients with their own tools)POST /v1/agent/chat/completions(agent-priority queue + execute tools)