feat: Whisper audio transcription pipeline for audio documents#112
Open
nv78 wants to merge 1 commit intoclaude/image-document-analysisfrom
Open
feat: Whisper audio transcription pipeline for audio documents#112nv78 wants to merge 1 commit intoclaude/image-document-analysisfrom
nv78 wants to merge 1 commit intoclaude/image-document-analysisfrom
Conversation
When a user uploads an audio file (MP3, M4A, WAV, OGG, FLAC, etc.) as a document, the new audio pipeline: 1. Reads the raw audio bytes from the upload 2. Calls transcribe_audio() in the new services/audio_service.py module, which invokes OpenAI Whisper (whisper-1) via verbose_json response format 3. Prepends a metadata header to the transcript: [Audio transcript — language: en, duration: 3m 42s] 4. Stores the full transcript as document_text (media_type='audio', mime_type) 5. Feeds the transcript into the existing chunk_document Ray task so it is split, embedded, and indexed — making the audio fully searchable via RAG services/audio_service.py: - transcribe_audio(bytes, filename, language?) -> str - Validates file size (MAX_AUDIO_BYTES, default 25 MB — Whisper API limit) and extension against the set of formats Whisper supports - Passes a prompt hint to Whisper to improve punctuation/paragraph breaks - Uses verbose_json to capture detected language and duration for the header - Respects ENABLE_MULTIMODAL config flag - Never raises — returns a placeholder string on failure documents/handler.py: - New elif category == "audio" branch reads bytes, calls transcribe_audio(), stores transcript, and enqueues chunking via Ray remote - Video remains a binary-only stub (PR 4) https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a user uploads an audio file (MP3, M4A, WAV, OGG, FLAC, WebM, etc.) as a document, the new pipeline transcribes it via OpenAI Whisper and feeds the transcript into the existing RAG chunking + embedding pipeline. The audio becomes fully searchable — ask "what did the speaker say about X?" and it works.
New file:
backend/services/audio_service.pytranscribe_audio(bytes, filename, language?)— main entry pointopenai.OpenAI().audio.transcriptions.create()withwhisper-1andresponse_format="verbose_json"to capture detected language and durationMAX_AUDIO_BYTES(default 25 MB — Whisper's hard limit)flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webmENABLE_MULTIMODALflaglanguageparam accepts BCP-47 codes ("en","es", etc.) for better accuracy when knownUpdated:
backend/api_endpoints/documents/handler.pyNew
elif category == "audio"branch inIngestDocumentsHandler:transcribe_audio(audio_bytes, filename=filename)document_text(media_type='audio',mime_type)chunk_document_fn.remote()for embeddingVideo remains a binary-only stub — to be wired in PR 4.
Test plan
media_type='audio', transcript indocument_textchunkstable after Ray task completes.wavfile → transcription works.pdf→ text path unaffected (no regression)ENABLE_MULTIMODAL=false→ placeholder stored, no Whisper callhttps://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu