Document extract

Hand Han AI a document and it reads it. PDFs, Word files, ODT, RTF, and most office formats are extracted to plain text on the VPS — no upload to a third party.

What it does

Converts a document to text using the right open-source binary for the format.

Field	Value
Schema name	`extract_document`
Powered by	`pdftotext` (Poppler), `pandoc`, LibreOffice headless
Installed by	Provisioner, no manual setup
API key required	No

When Han AI uses it

You forward a PDF in Telegram and ask a question about it.
A contract, invoice, or supplier deck needs review.
A long document needs to be summarised or filed into vector memory.

Examples

“Read this MSA and flag anything non-standard.”
“What are the payment milestones in the SOW I just sent?”
“Summarise the appraisal and remember it for next quarter’s review.”

Limits

Scanned PDFs without an OCR layer return empty text. Han AI falls back to OCR in that case.
Encrypted or password-protected files are not opened automatically — you provide the password in the conversation.
Very large documents are extracted but only the relevant chunks are fed to the model.

Why this stack

Poppler, pandoc, and LibreOffice are the canonical Linux toolchain for document conversion. They cost nothing, run offline, and your file never leaves your VPS.