queuegate/README.md

# QueueGate

QueueGate is an OpenAI-compatible **LLM proxy** that:
- routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)
- provides a simple **priority queue** (user > agent)
- supports **stream** and **non-stream** (no fake streaming)
- supports **sticky worker affinity** (same chat -> same upstream when possible)

## Quick start

### 1) Configure

Minimal env:
- `LLM_UPSTREAMS` (comma-separated URLs)
  - e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions`

Optional:
- `LLM_MAX_CONCURRENCY` (defaults to number of upstreams)
- `STICKY_HEADER` (default: `X-Chat-Id`)
- `AFFINITY_TTL_SEC` (default: `60`)
- `QUEUE_NOTIFY_USER` = `auto|always|never` (default: `auto`)
- `QUEUE_NOTIFY_MIN_MS` (default: `1200`)

### 2) Run

```bash
uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080
```

### 3) Health

`GET /healthz`

### 4) OpenAI endpoint

`POST /v1/chat/completions`

## Notes
- Tool calls are detected and suppressed in streaming output (to prevent leakage).
- This first version is a **proxy-only MVP**; tool execution can be wired in later.
Initial proxy MVP 2026-02-02 10:03:25 +00:00			`# QueueGate`

			`QueueGate is an OpenAI-compatible LLM proxy that:`
			`- routes requests to multiple upstream OpenAI-compatible backends (e.g. llama.cpp)`
			`- provides a simple priority queue (user > agent)`
			`- supports stream and non-stream (no fake streaming)`
			`- supports sticky worker affinity (same chat -> same upstream when possible)`

			`## Quick start`

			`### 1) Configure`

			`Minimal env:`
			- `LLM_UPSTREAMS` (comma-separated URLs)
			- e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions`

			`Optional:`
			- `LLM_MAX_CONCURRENCY` (defaults to number of upstreams)
			- `STICKY_HEADER` (default: `X-Chat-Id`)
			- `AFFINITY_TTL_SEC` (default: `60`)
			- `QUEUE_NOTIFY_USER` = `auto\|always\|never` (default: `auto`)
			- `QUEUE_NOTIFY_MIN_MS` (default: `1200`)

			`### 2) Run`

			```bash
			`uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080`
			```

			`### 3) Health`

			`GET /healthz`

			`### 4) OpenAI endpoint`

			`POST /v1/chat/completions`

			`## Notes`
			`- Tool calls are detected and suppressed in streaming output (to prevent leakage).`
			`- This first version is a proxy-only MVP; tool execution can be wired in later.`