queuegate/README.md
2026-02-09 16:48:42 +01:00

94 lines
3.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# QueueGate
QueueGate is an OpenAI-compatible **LLM proxy** that:
- routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
- provides a simple **priority queue** (user > agent)
- supports **stream** and **non-stream** (no fake streaming)
- supports **sticky worker affinity** (same chat -> same upstream when possible)
## Quick start
### 1) Configure
Minimal env:
- `LLM_UPSTREAMS` (comma-separated URLs)
- e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions`
Recommended (for clients like OpenWebUI):
- `PROXY_MODELS` (comma-separated **virtual model ids** exposed via `GET /v1/models`)
- e.g. `PROXY_MODELS=ministral-3-14b-reasoning`
- `PROXY_OWNED_BY` (shows up in `/v1/models`, default `queuegate`)
Optional:
- `LLM_MAX_CONCURRENCY` (defaults to number of upstreams)
- `STICKY_HEADER` (default: `X-Chat-Id`)
- `AFFINITY_TTL_SEC` (default: `60`)
- `QUEUE_NOTIFY_USER` = `auto|always|never` (default: `auto`)
- `QUEUE_NOTIFY_MIN_MS` (default: `1200`)
## Chat Memory (RAG) via ToolServer
If you run QueueGate with `TOOLCALL_MODE=execute` and a ToolServer that exposes `memory_query` + `memory_upsert`
(backed by Chroma + Meili), QueueGate can keep the upstream context *tiny* by:
- retrieving relevant prior chat snippets (`memory_query`) for the latest user message
- (optionally) truncating the forwarded chat history to only the last N messages
- injecting retrieved memory as a short system/user message
- upserting the latest user+assistant turn back into memory (`memory_upsert`)
Enable with:
- `CHAT_MEMORY_ENABLE=1`
- `TOOLSERVER_URL=http://<toolserver-host>:<port>`
Tuning:
- `CHAT_MEMORY_TRUNCATE_HISTORY=1` (default: true)
If true, forwards only system messages + the last `CHAT_MEMORY_KEEP_LAST` user/assistant messages (plus injected memory).
- `CHAT_MEMORY_KEEP_LAST=4` (default: 4)
- `CHAT_MEMORY_QUERY_K=8` (default: 8)
- `CHAT_MEMORY_INJECT_ROLE=system` (`system|user`)
- `CHAT_MEMORY_HINT=1` (default: true) adds a short hint that more memory can be queried if needed
- `CHAT_MEMORY_UPSERT=1` (default: true)
- `CHAT_MEMORY_MAX_UPSERT_CHARS=12000` (default: 12000)
- `CHAT_MEMORY_FOR_AGENTS=0` (default: false)
Namespace selection:
QueueGate uses (in order) `STICKY_HEADER`, then OpenWebUI chat/conversation headers, then body fields like
`chat_id/conversation_id`, and finally falls back to the computed `thread_key`.
### 2) Run
```bash
uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080
```
### 3) Health
`GET /healthz`
### 4) OpenAI endpoint
`POST /v1/chat/completions`
### 5) Model list endpoint
`GET /v1/models`
### 5) Models list
`GET /v1/models`
## Tool calling
QueueGate supports three modes (set `TOOLCALL_MODE`):
- `execute` (default): proxy executes tool calls via `TOOLSERVER_URL` and continues until final answer
- `passthrough`: forward upstream tool calls to the client (or convert `[TOOL_CALLS]` text into tool_calls for the client)
- `suppress`: drop tool_calls (useful for pure chat backends)
Toolserver settings:
- `TOOLSERVER_URL` e.g. `http://toolserver:8081`
- `TOOLSERVER_PREFIX` (default `/openapi`)
Extra endpoints:
- `POST /v1/chat/completions` (main; uses `TOOLCALL_MODE`)
- `POST /v1/chat/completions_passthrough` (forced passthrough; intended for clients with their own tools)
- `POST /v1/agent/chat/completions` (agent-priority queue + execute tools)