New LLM proxy with RAG context

Go to file

admin 1307e4fd3a update		2026-02-09 16:48:42 +01:00
src/queuegate_proxy	update	2026-02-09 16:48:42 +01:00
Dockerfile	Initial proxy MVP	2026-02-02 11:03:25 +01:00
README.md	update	2026-02-09 16:48:42 +01:00
requirements.txt	Initial proxy MVP	2026-02-02 11:03:25 +01:00
run.sh	Initial proxy MVP	2026-02-02 11:03:25 +01:00

README.md

QueueGate

QueueGate is an OpenAI-compatible LLM proxy that:

routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
provides a simple priority queue (user > agent)
supports stream and non-stream (no fake streaming)
supports sticky worker affinity (same chat -> same upstream when possible)

Quick start

1) Configure

Minimal env:

LLM_UPSTREAMS (comma-separated URLs)
- e.g. http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions

Recommended (for clients like OpenWebUI):

PROXY_MODELS (comma-separated virtual model ids exposed via GET /v1/models)
- e.g. PROXY_MODELS=ministral-3-14b-reasoning
PROXY_OWNED_BY (shows up in /v1/models, default queuegate)

Optional:

LLM_MAX_CONCURRENCY (defaults to number of upstreams)
STICKY_HEADER (default: X-Chat-Id)
AFFINITY_TTL_SEC (default: 60)
QUEUE_NOTIFY_USER = auto|always|never (default: auto)
QUEUE_NOTIFY_MIN_MS (default: 1200)

Chat Memory (RAG) via ToolServer

If you run QueueGate with TOOLCALL_MODE=execute and a ToolServer that exposes memory_query + memory_upsert (backed by Chroma + Meili), QueueGate can keep the upstream context tiny by:

retrieving relevant prior chat snippets (memory_query) for the latest user message
(optionally) truncating the forwarded chat history to only the last N messages
injecting retrieved memory as a short system/user message
upserting the latest user+assistant turn back into memory (memory_upsert)

Enable with:

CHAT_MEMORY_ENABLE=1
TOOLSERVER_URL=http://<toolserver-host>:<port>

Tuning:

CHAT_MEMORY_TRUNCATE_HISTORY=1 (default: true)
If true, forwards only system messages + the last CHAT_MEMORY_KEEP_LAST user/assistant messages (plus injected memory).
CHAT_MEMORY_KEEP_LAST=4 (default: 4)
CHAT_MEMORY_QUERY_K=8 (default: 8)
CHAT_MEMORY_INJECT_ROLE=system (system|user)
CHAT_MEMORY_HINT=1 (default: true) – adds a short hint that more memory can be queried if needed
CHAT_MEMORY_UPSERT=1 (default: true)
CHAT_MEMORY_MAX_UPSERT_CHARS=12000 (default: 12000)
CHAT_MEMORY_FOR_AGENTS=0 (default: false)

Namespace selection: QueueGate uses (in order) STICKY_HEADER, then OpenWebUI chat/conversation headers, then body fields like chat_id/conversation_id, and finally falls back to the computed thread_key.

2) Run

uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080

3) Health

GET /healthz

4) OpenAI endpoint

POST /v1/chat/completions

5) Model list endpoint

GET /v1/models

5) Models list