feat: Whisper audio transcription pipeline for audio documents by nv78 · Pull Request #112 · anote-ai/Autonomous-Intelligence

nv78 · 2026-03-24T21:27:51Z

Summary

When a user uploads an audio file (MP3, M4A, WAV, OGG, FLAC, WebM, etc.) as a document, the new pipeline transcribes it via OpenAI Whisper and feeds the transcript into the existing RAG chunking + embedding pipeline. The audio becomes fully searchable — ask "what did the speaker say about X?" and it works.

New file: `backend/services/audio_service.py`

transcribe_audio(bytes, filename, language?) — main entry point
Calls openai.OpenAI().audio.transcriptions.create() with whisper-1 and response_format="verbose_json" to capture detected language and duration

Prepends a metadata header to the transcript so it surfaces in search:

[Audio transcript — language: en, duration: 3m 42s]

The speaker begins by discussing...

Passes a prompt hint to Whisper to improve punctuation and paragraph breaks
Validates file size against MAX_AUDIO_BYTES (default 25 MB — Whisper's hard limit)
Validates extension against the set Whisper actually supports: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm
Respects ENABLE_MULTIMODAL flag
Never raises — returns a placeholder string on failure so the document record is always created
language param accepts BCP-47 codes ("en", "es", etc.) for better accuracy when known

Updated: `backend/api_endpoints/documents/handler.py`

New elif category == "audio" branch in IngestDocumentsHandler:

Reads raw bytes from the upload
Calls transcribe_audio(audio_bytes, filename=filename)
Stores transcript as document_text (media_type='audio', mime_type)
Enqueues chunk_document_fn.remote() for embedding

Video remains a binary-only stub — to be wired in PR 4.

Test plan

Upload an MP3 → document record created with media_type='audio', transcript in document_text
Chunks appear in chunks table after Ray task completes
Ask a question about spoken content → RAG retrieves relevant transcript chunk, LLM answers correctly
Upload a 26 MB audio file → size-limit placeholder stored, no API call made
Upload a .wav file → transcription works
Upload a .pdf → text path unaffected (no regression)
Set ENABLE_MULTIMODAL=false → placeholder stored, no Whisper call

Depends on: claude/image-document-analysis → claude/add-multimodal-support-QBQca

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

When a user uploads an audio file (MP3, M4A, WAV, OGG, FLAC, etc.) as a document, the new audio pipeline: 1. Reads the raw audio bytes from the upload 2. Calls transcribe_audio() in the new services/audio_service.py module, which invokes OpenAI Whisper (whisper-1) via verbose_json response format 3. Prepends a metadata header to the transcript: [Audio transcript — language: en, duration: 3m 42s] 4. Stores the full transcript as document_text (media_type='audio', mime_type) 5. Feeds the transcript into the existing chunk_document Ray task so it is split, embedded, and indexed — making the audio fully searchable via RAG services/audio_service.py: - transcribe_audio(bytes, filename, language?) -> str - Validates file size (MAX_AUDIO_BYTES, default 25 MB — Whisper API limit) and extension against the set of formats Whisper supports - Passes a prompt hint to Whisper to improve punctuation/paragraph breaks - Uses verbose_json to capture detected language and duration for the header - Respects ENABLE_MULTIMODAL config flag - Never raises — returns a placeholder string on failure documents/handler.py: - New elif category == "audio" branch reads bytes, calls transcribe_audio(), stores transcript, and enqueues chunking via Ray remote - Video remains a binary-only stub (PR 4) https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Whisper audio transcription pipeline for audio documents#112

feat: Whisper audio transcription pipeline for audio documents#112
nv78 wants to merge 1 commit intoclaude/image-document-analysisfrom
claude/audio-transcription-pipeline

nv78 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nv78 commented Mar 24, 2026

Summary

New file: backend/services/audio_service.py

Updated: backend/api_endpoints/documents/handler.py

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New file: `backend/services/audio_service.py`

Updated: `backend/api_endpoints/documents/handler.py`