queuegate/README.md
2026-02-02 11:03:25 +01:00

1.1 KiB

QueueGate

QueueGate is an OpenAI-compatible LLM proxy that:

  • routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
  • provides a simple priority queue (user > agent)
  • supports stream and non-stream (no fake streaming)
  • supports sticky worker affinity (same chat -> same upstream when possible)

Quick start

1) Configure

Minimal env:

  • LLM_UPSTREAMS (comma-separated URLs)
    • e.g. http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions

Optional:

  • LLM_MAX_CONCURRENCY (defaults to number of upstreams)
  • STICKY_HEADER (default: X-Chat-Id)
  • AFFINITY_TTL_SEC (default: 60)
  • QUEUE_NOTIFY_USER = auto|always|never (default: auto)
  • QUEUE_NOTIFY_MIN_MS (default: 1200)

2) Run

uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080

3) Health

GET /healthz

4) OpenAI endpoint

POST /v1/chat/completions

Notes

  • Tool calls are detected and suppressed in streaming output (to prevent leakage).
  • This first version is a proxy-only MVP; tool execution can be wired in later.