queuegate/README.md
2026-02-09 16:48:42 +01:00

3.3 KiB
Raw Permalink Blame History

QueueGate

QueueGate is an OpenAI-compatible LLM proxy that:

  • routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
  • provides a simple priority queue (user > agent)
  • supports stream and non-stream (no fake streaming)
  • supports sticky worker affinity (same chat -> same upstream when possible)

Quick start

1) Configure

Minimal env:

  • LLM_UPSTREAMS (comma-separated URLs)
    • e.g. http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions

Recommended (for clients like OpenWebUI):

  • PROXY_MODELS (comma-separated virtual model ids exposed via GET /v1/models)
    • e.g. PROXY_MODELS=ministral-3-14b-reasoning
  • PROXY_OWNED_BY (shows up in /v1/models, default queuegate)

Optional:

  • LLM_MAX_CONCURRENCY (defaults to number of upstreams)
  • STICKY_HEADER (default: X-Chat-Id)
  • AFFINITY_TTL_SEC (default: 60)
  • QUEUE_NOTIFY_USER = auto|always|never (default: auto)
  • QUEUE_NOTIFY_MIN_MS (default: 1200)

Chat Memory (RAG) via ToolServer

If you run QueueGate with TOOLCALL_MODE=execute and a ToolServer that exposes memory_query + memory_upsert (backed by Chroma + Meili), QueueGate can keep the upstream context tiny by:

  • retrieving relevant prior chat snippets (memory_query) for the latest user message
  • (optionally) truncating the forwarded chat history to only the last N messages
  • injecting retrieved memory as a short system/user message
  • upserting the latest user+assistant turn back into memory (memory_upsert)

Enable with:

  • CHAT_MEMORY_ENABLE=1
  • TOOLSERVER_URL=http://<toolserver-host>:<port>

Tuning:

  • CHAT_MEMORY_TRUNCATE_HISTORY=1 (default: true)
    If true, forwards only system messages + the last CHAT_MEMORY_KEEP_LAST user/assistant messages (plus injected memory).
  • CHAT_MEMORY_KEEP_LAST=4 (default: 4)
  • CHAT_MEMORY_QUERY_K=8 (default: 8)
  • CHAT_MEMORY_INJECT_ROLE=system (system|user)
  • CHAT_MEMORY_HINT=1 (default: true) adds a short hint that more memory can be queried if needed
  • CHAT_MEMORY_UPSERT=1 (default: true)
  • CHAT_MEMORY_MAX_UPSERT_CHARS=12000 (default: 12000)
  • CHAT_MEMORY_FOR_AGENTS=0 (default: false)

Namespace selection: QueueGate uses (in order) STICKY_HEADER, then OpenWebUI chat/conversation headers, then body fields like chat_id/conversation_id, and finally falls back to the computed thread_key.

2) Run

uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080

3) Health

GET /healthz

4) OpenAI endpoint

POST /v1/chat/completions

5) Model list endpoint

GET /v1/models

5) Models list

GET /v1/models

Tool calling

QueueGate supports three modes (set TOOLCALL_MODE):

  • execute (default): proxy executes tool calls via TOOLSERVER_URL and continues until final answer
  • passthrough: forward upstream tool calls to the client (or convert [TOOL_CALLS] text into tool_calls for the client)
  • suppress: drop tool_calls (useful for pure chat backends)

Toolserver settings:

  • TOOLSERVER_URL e.g. http://toolserver:8081
  • TOOLSERVER_PREFIX (default /openapi)

Extra endpoints:

  • POST /v1/chat/completions (main; uses TOOLCALL_MODE)
  • POST /v1/chat/completions_passthrough (forced passthrough; intended for clients with their own tools)
  • POST /v1/agent/chat/completions (agent-priority queue + execute tools)