36 lines
958 B
Markdown
36 lines
958 B
Markdown
|
|
# PDF endpoints in toolserver
|
||
|
|
|
||
|
|
This build adds PDF ingestion + page text + page rendering (+ optional embedded image extraction) and indexing into Chroma / Meilisearch.
|
||
|
|
|
||
|
|
## Enable / config (env)
|
||
|
|
|
||
|
|
- ENABLE_PDF=1
|
||
|
|
- PDF_STORE_DIR=/data/pdf_store
|
||
|
|
- PDF_MAX_MB=80
|
||
|
|
- MEILI_PDF_INDEX=pdf_docs (optional; used by /pdf/{doc_id}/index when add_meili=true)
|
||
|
|
|
||
|
|
PyMuPDF is required:
|
||
|
|
- pip install pymupdf
|
||
|
|
- import name: `fitz`
|
||
|
|
|
||
|
|
## Endpoints
|
||
|
|
|
||
|
|
- POST /pdf/ingest (multipart/form-data, field: file)
|
||
|
|
-> {doc_id, n_pages, sha256}
|
||
|
|
|
||
|
|
- GET /pdf/{doc_id}/text?page=1&mode=blocks|text|dict
|
||
|
|
-> blocks includes bbox + text
|
||
|
|
|
||
|
|
- GET /pdf/{doc_id}/render?page=1&dpi=200
|
||
|
|
-> image/png (cached on disk)
|
||
|
|
|
||
|
|
- GET /pdf/{doc_id}/images?page=1
|
||
|
|
-> list of embedded images (xref ids)
|
||
|
|
|
||
|
|
- GET /pdf/{doc_id}/image/{xref}
|
||
|
|
-> download embedded image
|
||
|
|
|
||
|
|
- POST /pdf/{doc_id}/index
|
||
|
|
body: PdfIndexRequest
|
||
|
|
-> chunks to Chroma + optional Meili, supports extra_text_by_page for vision captions.
|