update
This commit is contained in:
parent
6f533fc15a
commit
1307e4fd3a
59
README.md
59
README.md
@ -14,6 +14,11 @@ Minimal env:
|
|||||||
- `LLM_UPSTREAMS` (comma-separated URLs)
|
- `LLM_UPSTREAMS` (comma-separated URLs)
|
||||||
- e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions`
|
- e.g. `http://llama0:8000/v1/chat/completions,http://llama1:8000/v1/chat/completions`
|
||||||
|
|
||||||
|
Recommended (for clients like OpenWebUI):
|
||||||
|
- `PROXY_MODELS` (comma-separated **virtual model ids** exposed via `GET /v1/models`)
|
||||||
|
- e.g. `PROXY_MODELS=ministral-3-14b-reasoning`
|
||||||
|
- `PROXY_OWNED_BY` (shows up in `/v1/models`, default `queuegate`)
|
||||||
|
|
||||||
Optional:
|
Optional:
|
||||||
- `LLM_MAX_CONCURRENCY` (defaults to number of upstreams)
|
- `LLM_MAX_CONCURRENCY` (defaults to number of upstreams)
|
||||||
- `STICKY_HEADER` (default: `X-Chat-Id`)
|
- `STICKY_HEADER` (default: `X-Chat-Id`)
|
||||||
@ -21,6 +26,34 @@ Optional:
|
|||||||
- `QUEUE_NOTIFY_USER` = `auto|always|never` (default: `auto`)
|
- `QUEUE_NOTIFY_USER` = `auto|always|never` (default: `auto`)
|
||||||
- `QUEUE_NOTIFY_MIN_MS` (default: `1200`)
|
- `QUEUE_NOTIFY_MIN_MS` (default: `1200`)
|
||||||
|
|
||||||
|
## Chat Memory (RAG) via ToolServer
|
||||||
|
|
||||||
|
If you run QueueGate with `TOOLCALL_MODE=execute` and a ToolServer that exposes `memory_query` + `memory_upsert`
|
||||||
|
(backed by Chroma + Meili), QueueGate can keep the upstream context *tiny* by:
|
||||||
|
- retrieving relevant prior chat snippets (`memory_query`) for the latest user message
|
||||||
|
- (optionally) truncating the forwarded chat history to only the last N messages
|
||||||
|
- injecting retrieved memory as a short system/user message
|
||||||
|
- upserting the latest user+assistant turn back into memory (`memory_upsert`)
|
||||||
|
|
||||||
|
Enable with:
|
||||||
|
- `CHAT_MEMORY_ENABLE=1`
|
||||||
|
- `TOOLSERVER_URL=http://<toolserver-host>:<port>`
|
||||||
|
|
||||||
|
Tuning:
|
||||||
|
- `CHAT_MEMORY_TRUNCATE_HISTORY=1` (default: true)
|
||||||
|
If true, forwards only system messages + the last `CHAT_MEMORY_KEEP_LAST` user/assistant messages (plus injected memory).
|
||||||
|
- `CHAT_MEMORY_KEEP_LAST=4` (default: 4)
|
||||||
|
- `CHAT_MEMORY_QUERY_K=8` (default: 8)
|
||||||
|
- `CHAT_MEMORY_INJECT_ROLE=system` (`system|user`)
|
||||||
|
- `CHAT_MEMORY_HINT=1` (default: true) – adds a short hint that more memory can be queried if needed
|
||||||
|
- `CHAT_MEMORY_UPSERT=1` (default: true)
|
||||||
|
- `CHAT_MEMORY_MAX_UPSERT_CHARS=12000` (default: 12000)
|
||||||
|
- `CHAT_MEMORY_FOR_AGENTS=0` (default: false)
|
||||||
|
|
||||||
|
Namespace selection:
|
||||||
|
QueueGate uses (in order) `STICKY_HEADER`, then OpenWebUI chat/conversation headers, then body fields like
|
||||||
|
`chat_id/conversation_id`, and finally falls back to the computed `thread_key`.
|
||||||
|
|
||||||
### 2) Run
|
### 2) Run
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -35,6 +68,26 @@ uvicorn queuegate_proxy.app:app --host 0.0.0.0 --port 8080
|
|||||||
|
|
||||||
`POST /v1/chat/completions`
|
`POST /v1/chat/completions`
|
||||||
|
|
||||||
## Notes
|
### 5) Model list endpoint
|
||||||
- Tool calls are detected and suppressed in streaming output (to prevent leakage).
|
|
||||||
- This first version is a **proxy-only MVP**; tool execution can be wired in later.
|
`GET /v1/models`
|
||||||
|
|
||||||
|
### 5) Models list
|
||||||
|
|
||||||
|
`GET /v1/models`
|
||||||
|
|
||||||
|
## Tool calling
|
||||||
|
|
||||||
|
QueueGate supports three modes (set `TOOLCALL_MODE`):
|
||||||
|
- `execute` (default): proxy executes tool calls via `TOOLSERVER_URL` and continues until final answer
|
||||||
|
- `passthrough`: forward upstream tool calls to the client (or convert `[TOOL_CALLS]` text into tool_calls for the client)
|
||||||
|
- `suppress`: drop tool_calls (useful for pure chat backends)
|
||||||
|
|
||||||
|
Toolserver settings:
|
||||||
|
- `TOOLSERVER_URL` e.g. `http://toolserver:8081`
|
||||||
|
- `TOOLSERVER_PREFIX` (default `/openapi`)
|
||||||
|
|
||||||
|
Extra endpoints:
|
||||||
|
- `POST /v1/chat/completions` (main; uses `TOOLCALL_MODE`)
|
||||||
|
- `POST /v1/chat/completions_passthrough` (forced passthrough; intended for clients with their own tools)
|
||||||
|
- `POST /v1/agent/chat/completions` (agent-priority queue + execute tools)
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue
Block a user