Voice messages
Voice notes, audio files, and video notes are all handled. The audio is transcribed on your VPS, then the resulting text flows through the same path as a typed message.
How to use it
Open the Telegram chat with your bot and record a voice note. Send. Han AI replies with text as if you had typed.
The same applies to forwarded audio files and Telegram round video notes. For video notes only the audio is transcribed; visual content is not analysed.
How transcription works
| Step | What happens |
|---|---|
| 1 | The audio file arrives via the Telegram bot. |
| 2 | Your VPS picks the best available transcriber — local whisper.cpp if installed, otherwise the OpenAI Whisper API path. |
| 3 | The transcript is appended to the same conversation history as a normal text turn, at /var/hanai/state/coo-history.jsonl. |
| 4 | The AI COO reads the transcript and replies. |
Local versus API
| Path | When |
|---|---|
Local whisper.cpp | Preferred when installed. Lower latency, no per-minute cost, audio never leaves your VPS. |
| OpenAI Whisper API | Default fallback. Used when local install is absent. |
If you send a high volume of voice notes, ask your operator to install whisper.cpp at /opt/whisper.cpp/. It’s cheaper and keeps the audio on your hardware.
Languages
Telegram audio in any language Whisper supports will transcribe. Bot replies follow the language norm of the conversation so far.
What doesn’t work yet
- Replies as voice. Han AI replies in text. Generating speech back is not in scope today.
- Real-time streaming. Send a complete voice note rather than expecting partial responses while you speak.
Next
- Live mode — what happens after the transcript reaches the COO.
- Where your data lives — where transcripts are stored.