queuegate/README.md

# QueueGate

QueueGate is an OpenAI-compatible **LLM proxy** that:
- routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
- provides a simple **priority queue** (user > agent)
- supports **stream** and **non-stream** (no fake streaming)
- supports **sticky worker affinity** (same chat -> same upstream when possible)

## Quick start

### 1) Configure

Minimal env:
- `LLM_UPSTREAMS` (comma-separated URLs)
  - e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions`

Recommended (for clients like OpenWebUI):
- `PROXY_MODELS` (comma-separated **virtual model ids** exposed via `GET /v1/models`)
  - e.g. `PROXY_MODELS=ministral-3-14b-reasoning`
- `PROXY_OWNED_BY` (shows up in `/v1/models`, default `queuegate`)

Optional:
- `LLM_MAX_CONCURRENCY` (defaults to number of upstreams)
- `STICKY_HEADER` (default: `X-Chat-Id`)
- `AFFINITY_TTL_SEC` (default: `60`)
- `QUEUE_NOTIFY_USER` = `auto|always|never` (default: `auto`)
- `QUEUE_NOTIFY_MIN_MS` (default: `1200`)

## Chat Memory (RAG) via ToolServer

If you run QueueGate with `TOOLCALL_MODE=execute` and a ToolServer that exposes `memory_query` + `memory_upsert`
(backed by Chroma + Meili), QueueGate can keep the upstream context *tiny* by:
- retrieving relevant prior chat snippets (`memory_query`) for the latest user message
- (optionally) truncating the forwarded chat history to only the last N messages
- injecting retrieved memory as a short system/user message
- upserting the latest user+assistant turn back into memory (`memory_upsert`)

Enable with:
- `CHAT_MEMORY_ENABLE=1`
- `TOOLSERVER_URL=http://<toolserver-host>:<port>`

Tuning:
- `CHAT_MEMORY_TRUNCATE_HISTORY=1` (default: true)  
  If true, forwards only system messages + the last `CHAT_MEMORY_KEEP_LAST` user/assistant messages (plus injected memory).
- `CHAT_MEMORY_KEEP_LAST=4` (default: 4)
- `CHAT_MEMORY_QUERY_K=8` (default: 8)
- `CHAT_MEMORY_INJECT_ROLE=system` (`system|user`)
- `CHAT_MEMORY_HINT=1` (default: true) – adds a short hint that more memory can be queried if needed
- `CHAT_MEMORY_UPSERT=1` (default: true)
- `CHAT_MEMORY_MAX_UPSERT_CHARS=12000` (default: 12000)
- `CHAT_MEMORY_FOR_AGENTS=0` (default: false)

Namespace selection:
QueueGate uses (in order) `STICKY_HEADER`, then OpenWebUI chat/conversation headers, then body fields like
`chat_id/conversation_id`, and finally falls back to the computed `thread_key`.

### 2) Run

```bash
uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080
```

### 3) Health

`GET /healthz`

### 4) OpenAI endpoint

`POST /v1/chat/completions`

### 5) Model list endpoint

`GET /v1/models`

### 5) Models list

`GET /v1/models`

## Tool calling

QueueGate supports three modes (set `TOOLCALL_MODE`):
- `execute` (default): proxy executes tool calls via `TOOLSERVER_URL` and continues until final answer
- `passthrough`: forward upstream tool calls to the client (or convert `[TOOL_CALLS]` text into tool_calls for the client)
- `suppress`: drop tool_calls (useful for pure chat backends)

Toolserver settings:
- `TOOLSERVER_URL` e.g. `http://toolserver:8081`
- `TOOLSERVER_PREFIX` (default `/openapi`)

Extra endpoints:
- `POST /v1/chat/completions` (main; uses `TOOLCALL_MODE`)
- `POST /v1/chat/completions_passthrough` (forced passthrough; intended for clients with their own tools)
- `POST /v1/agent/chat/completions` (agent-priority queue + execute tools)
-												Initial proxy MVP

											
										
										
											2026-02-02 10:03:25 +00:00
+								# QueueGate
 								QueueGate is an OpenAI-compatible **LLM proxy** that:
 								- routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
 								- provides a simple **priority queue** (user > agent)
 								- supports **stream** and **non-stream** (no fake streaming)
 								- supports **sticky worker affinity** (same chat -> same upstream when possible)
 								## Quick start
 								### 1) Configure
 								Minimal env:
 								- `LLM_UPSTREAMS` (comma-separated URLs)
 								  - e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions`
-												update

											
										
										
											2026-02-09 15:48:42 +00:00
+								Recommended (for clients like OpenWebUI):
 								- `PROXY_MODELS` (comma-separated **virtual model ids** exposed via `GET /v1/models`)
 								  - e.g. `PROXY_MODELS=ministral-3-14b-reasoning`
 								- `PROXY_OWNED_BY` (shows up in `/v1/models`, default `queuegate`)
-												Initial proxy MVP

											
										
										
											2026-02-02 10:03:25 +00:00
+								Optional:
 								- `LLM_MAX_CONCURRENCY` (defaults to number of upstreams)
 								- `STICKY_HEADER` (default: `X-Chat-Id`)
 								- `AFFINITY_TTL_SEC` (default: `60`)
 								- `QUEUE_NOTIFY_USER` = `auto|always|never` (default: `auto`)
 								- `QUEUE_NOTIFY_MIN_MS` (default: `1200`)
-												update

											
										
										
											2026-02-09 15:48:42 +00:00
+								## Chat Memory (RAG) via ToolServer
 								If you run QueueGate with `TOOLCALL_MODE=execute` and a ToolServer that exposes `memory_query` + `memory_upsert`
 								(backed by Chroma + Meili), QueueGate can keep the upstream context *tiny* by:
 								- retrieving relevant prior chat snippets (`memory_query`) for the latest user message
 								- (optionally) truncating the forwarded chat history to only the last N messages
 								- injecting retrieved memory as a short system/user message
 								- upserting the latest user+assistant turn back into memory (`memory_upsert`)
 								Enable with:
 								- `CHAT_MEMORY_ENABLE=1`
 								- `TOOLSERVER_URL=http://<toolserver-host>:<port>`
 								Tuning:
 								- `CHAT_MEMORY_TRUNCATE_HISTORY=1` (default: true)
 								  If true, forwards only system messages + the last `CHAT_MEMORY_KEEP_LAST` user/assistant messages (plus injected memory).
 								- `CHAT_MEMORY_KEEP_LAST=4` (default: 4)
 								- `CHAT_MEMORY_QUERY_K=8` (default: 8)
 								- `CHAT_MEMORY_INJECT_ROLE=system` (`system|user`)
 								- `CHAT_MEMORY_HINT=1` (default: true) – adds a short hint that more memory can be queried if needed
 								- `CHAT_MEMORY_UPSERT=1` (default: true)
 								- `CHAT_MEMORY_MAX_UPSERT_CHARS=12000` (default: 12000)
 								- `CHAT_MEMORY_FOR_AGENTS=0` (default: false)
 								Namespace selection:
 								QueueGate uses (in order) `STICKY_HEADER`, then OpenWebUI chat/conversation headers, then body fields like
 								`chat_id/conversation_id`, and finally falls back to the computed `thread_key`.
-												Initial proxy MVP

											
										
										
											2026-02-02 10:03:25 +00:00
+								### 2) Run
 								```bash
 								uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080
 								```
 								### 3) Health
 								`GET /healthz`
 								### 4) OpenAI endpoint
 								`POST /v1/chat/completions`
-												update

											
										
										
											2026-02-09 15:48:42 +00:00
+								### 5) Model list endpoint
 								`GET /v1/models`
 								### 5) Models list
 								`GET /v1/models`
 								## Tool calling
 								QueueGate supports three modes (set `TOOLCALL_MODE`):
 								- `execute` (default): proxy executes tool calls via `TOOLSERVER_URL` and continues until final answer
 								- `passthrough`: forward upstream tool calls to the client (or convert `[TOOL_CALLS]` text into tool_calls for the client)
 								- `suppress`: drop tool_calls (useful for pure chat backends)
 								Toolserver settings:
 								- `TOOLSERVER_URL` e.g. `http://toolserver:8081`
 								- `TOOLSERVER_PREFIX` (default `/openapi`)
 								Extra endpoints:
 								- `POST /v1/chat/completions` (main; uses `TOOLCALL_MODE`)
 								- `POST /v1/chat/completions_passthrough` (forced passthrough; intended for clients with their own tools)
 								- `POST /v1/agent/chat/completions` (agent-priority queue + execute tools)