Document extract
Hand Han AI a document and it reads it. PDFs, Word files, ODT, RTF, and most office formats are extracted to plain text on the VPS — no upload to a third party.
What it does
Converts a document to text using the right open-source binary for the format.
| Field | Value |
|---|---|
| Schema name | extract_document |
| Powered by | pdftotext (Poppler), pandoc, LibreOffice headless |
| Installed by | Provisioner, no manual setup |
| API key required | No |
When Han AI uses it
- You forward a PDF in Telegram and ask a question about it.
- A contract, invoice, or supplier deck needs review.
- A long document needs to be summarised or filed into vector memory.
Examples
- “Read this MSA and flag anything non-standard.”
- “What are the payment milestones in the SOW I just sent?”
- “Summarise the appraisal and remember it for next quarter’s review.”
Limits
- Scanned PDFs without an OCR layer return empty text. Han AI falls back to OCR in that case.
- Encrypted or password-protected files are not opened automatically — you provide the password in the conversation.
- Very large documents are extracted but only the relevant chunks are fed to the model.
Why this stack
Poppler, pandoc, and LibreOffice are the canonical Linux toolchain for document conversion. They cost nothing, run offline, and your file never leaves your VPS.