feat: video frame extraction + vision analysis pipeline#113
Open
nv78 wants to merge 1 commit intoclaude/audio-transcription-pipelinefrom
Open
feat: video frame extraction + vision analysis pipeline#113nv78 wants to merge 1 commit intoclaude/audio-transcription-pipelinefrom
nv78 wants to merge 1 commit intoclaude/audio-transcription-pipelinefrom
Conversation
When a user uploads a video file (MP4, MOV, AVI, MKV, WebM, etc.) as a document the new pipeline produces a fully searchable structured document: 1. Write video bytes to a secure temp directory 2. Use ffmpeg to extract one JPEG frame every VIDEO_FRAME_INTERVAL_SECS (default 30 s), capped at VIDEO_MAX_FRAMES (default 20) 3. Call describe_image() on each frame (GPT-4o / Claude vision) and tag the description with a MM:SS timestamp 4. Extract the audio track via ffmpeg (-ar 16000 mono MP3) and pass it through transcribe_audio() (Whisper) for a full spoken transcript 5. Interleave the frame descriptions and audio transcript into a single markdown document, store it as document_text, and enqueue chunking agents/config.py: - VIDEO_FRAME_INTERVAL_SECS (default: 30) — configurable via env var - VIDEO_MAX_FRAMES (default: 20) — cost/time guardrail - Both exposed in get_agent_config() services/video_service.py (new): - describe_video(bytes, filename, mime_type) -> str - _get_duration(): ffprobe for header metadata - _extract_frames(): ffmpeg select filter + vsync vfr, returns list of (timestamp_secs, path) tuples - _extract_audio_track(): ffmpeg -vn to 16kHz mono MP3 for Whisper - Gracefully handles: ffmpeg not on PATH, empty frame list, oversized audio track, any subprocess error - Never raises — placeholder string on failure documents/handler.py: - video branch now calls describe_video(), stores the analysis, and enqueues chunk_document_fn.remote() — videos are now fully indexed https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the multimodal document pipeline. When a user uploads a video (MP4, MOV, AVI, MKV, WebM, etc.) the system generates a structured, searchable document by combining timestamped visual frame descriptions with a Whisper transcript of the audio track. The result is stored as
document_textand indexed via the normal RAG chunking pipeline.Output document format
This document makes the video fully searchable — ask "what revenue number was shown?" or "what did the speaker say about Q3?" and RAG retrieves the right section.
New file:
backend/services/video_service.pydescribe_video(bytes, filename, mime_type)— main entry point; writes to temp dir, runs analysis, cleans up_get_duration()— ffprobe for header metadata_extract_frames()— ffmpegselectfilter +vsync vfr; returns[(timestamp_secs, path)]capped atVIDEO_MAX_FRAMES_extract_audio_track()— ffmpeg-vnto 16 kHz mono MP3 (optimal for Whisper)describe_image()andtranscribe_audio()from existing servicesUpdated:
backend/agents/config.pyVIDEO_FRAME_INTERVAL_SECS30VIDEO_MAX_FRAMES20Updated:
backend/api_endpoints/documents/handler.pyVideo branch now calls
describe_video(), stores the analysis asdocument_text, and enqueueschunk_document_fn.remote()— videos are fully indexed.Requirements
ffmpegmust be onPATH. If not found, a human-readable placeholder is stored and the document record is still created.OPENAI_API_KEYfor both GPT-4o (frame descriptions) and Whisper (audio).Test plan
media_type='video', analysis indocument_textchunkstable after Ray taskVIDEO_MAX_FRAMES=3→ only 3 frames extractedhttps://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu