toolserver/README_PDF.md

# PDF endpoints in toolserver

This build adds PDF ingestion + page text + page rendering (+ optional embedded image extraction) and indexing into Chroma / Meilisearch.

## Enable / config (env)

- ENABLE_PDF=1
- PDF_STORE_DIR=/data/pdf_store
- PDF_MAX_MB=80
- MEILI_PDF_INDEX=pdf_docs   (optional; used by /pdf/{doc_id}/index when add_meili=true)

PyMuPDF is required:
- pip install pymupdf
- import name: `fitz`

## Endpoints

- POST /pdf/ingest   (multipart/form-data, field: file)
  -> {doc_id, n_pages, sha256}

- GET /pdf/{doc_id}/text?page=1&mode=blocks|text|dict
  -> blocks includes bbox + text

- GET /pdf/{doc_id}/render?page=1&dpi=200
  -> image/png (cached on disk)

- GET /pdf/{doc_id}/images?page=1
  -> list of embedded images (xref ids)

- GET /pdf/{doc_id}/image/{xref}
  -> download embedded image

- POST /pdf/{doc_id}/index
  body: PdfIndexRequest
  -> chunks to Chroma + optional Meili, supports extra_text_by_page for vision captions.
added PDF stuff 2026-02-23 15:39:46 +00:00			`# PDF endpoints in toolserver`

			`This build adds PDF ingestion + page text + page rendering (+ optional embedded image extraction) and indexing into Chroma / Meilisearch.`

			`## Enable / config (env)`

			`- ENABLE_PDF=1`
			`- PDF_STORE_DIR=/data/pdf_store`
			`- PDF_MAX_MB=80`
			`- MEILI_PDF_INDEX=pdf_docs (optional; used by /pdf/{doc_id}/index when add_meili=true)`

			`PyMuPDF is required:`
			`- pip install pymupdf`
			- import name: `fitz`

			`## Endpoints`

			`- POST /pdf/ingest (multipart/form-data, field: file)`
			`-> {doc_id, n_pages, sha256}`

			`- GET /pdf/{doc_id}/text?page=1&mode=blocks\|text\|dict`
			`-> blocks includes bbox + text`

			`- GET /pdf/{doc_id}/render?page=1&dpi=200`
			`-> image/png (cached on disk)`

			`- GET /pdf/{doc_id}/images?page=1`
			`-> list of embedded images (xref ids)`

			`- GET /pdf/{doc_id}/image/{xref}`
			`-> download embedded image`

			`- POST /pdf/{doc_id}/index`
			`body: PdfIndexRequest`
			`-> chunks to Chroma + optional Meili, supports extra_text_by_page for vision captions.`