Skip to content

feat: Whisper audio transcription pipeline for audio documents#112

Open
nv78 wants to merge 1 commit intoclaude/image-document-analysisfrom
claude/audio-transcription-pipeline
Open

feat: Whisper audio transcription pipeline for audio documents#112
nv78 wants to merge 1 commit intoclaude/image-document-analysisfrom
claude/audio-transcription-pipeline

Conversation

@nv78
Copy link
Member

@nv78 nv78 commented Mar 24, 2026

Summary

When a user uploads an audio file (MP3, M4A, WAV, OGG, FLAC, WebM, etc.) as a document, the new pipeline transcribes it via OpenAI Whisper and feeds the transcript into the existing RAG chunking + embedding pipeline. The audio becomes fully searchable — ask "what did the speaker say about X?" and it works.

New file: backend/services/audio_service.py

  • transcribe_audio(bytes, filename, language?) — main entry point
  • Calls openai.OpenAI().audio.transcriptions.create() with whisper-1 and response_format="verbose_json" to capture detected language and duration
  • Prepends a metadata header to the transcript so it surfaces in search:
    [Audio transcript — language: en, duration: 3m 42s]
    
    The speaker begins by discussing...
    
  • Passes a prompt hint to Whisper to improve punctuation and paragraph breaks
  • Validates file size against MAX_AUDIO_BYTES (default 25 MB — Whisper's hard limit)
  • Validates extension against the set Whisper actually supports: flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm
  • Respects ENABLE_MULTIMODAL flag
  • Never raises — returns a placeholder string on failure so the document record is always created
  • language param accepts BCP-47 codes ("en", "es", etc.) for better accuracy when known

Updated: backend/api_endpoints/documents/handler.py

New elif category == "audio" branch in IngestDocumentsHandler:

  1. Reads raw bytes from the upload
  2. Calls transcribe_audio(audio_bytes, filename=filename)
  3. Stores transcript as document_text (media_type='audio', mime_type)
  4. Enqueues chunk_document_fn.remote() for embedding

Video remains a binary-only stub — to be wired in PR 4.

Test plan

  • Upload an MP3 → document record created with media_type='audio', transcript in document_text
  • Chunks appear in chunks table after Ray task completes
  • Ask a question about spoken content → RAG retrieves relevant transcript chunk, LLM answers correctly
  • Upload a 26 MB audio file → size-limit placeholder stored, no API call made
  • Upload a .wav file → transcription works
  • Upload a .pdf → text path unaffected (no regression)
  • Set ENABLE_MULTIMODAL=false → placeholder stored, no Whisper call

Depends on: claude/image-document-analysisclaude/add-multimodal-support-QBQca

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

When a user uploads an audio file (MP3, M4A, WAV, OGG, FLAC, etc.) as a
document, the new audio pipeline:

1. Reads the raw audio bytes from the upload
2. Calls transcribe_audio() in the new services/audio_service.py module,
   which invokes OpenAI Whisper (whisper-1) via verbose_json response format
3. Prepends a metadata header to the transcript:
   [Audio transcript — language: en, duration: 3m 42s]
4. Stores the full transcript as document_text (media_type='audio', mime_type)
5. Feeds the transcript into the existing chunk_document Ray task so it is
   split, embedded, and indexed — making the audio fully searchable via RAG

services/audio_service.py:
- transcribe_audio(bytes, filename, language?) -> str
- Validates file size (MAX_AUDIO_BYTES, default 25 MB — Whisper API limit)
  and extension against the set of formats Whisper supports
- Passes a prompt hint to Whisper to improve punctuation/paragraph breaks
- Uses verbose_json to capture detected language and duration for the header
- Respects ENABLE_MULTIMODAL config flag
- Never raises — returns a placeholder string on failure

documents/handler.py:
- New elif category == "audio" branch reads bytes, calls transcribe_audio(),
  stores transcript, and enqueues chunking via Ray remote
- Video remains a binary-only stub (PR 4)

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants