Skip to content

feat: video frame extraction + vision analysis pipeline#113

Open
nv78 wants to merge 1 commit intoclaude/audio-transcription-pipelinefrom
claude/video-frame-analysis
Open

feat: video frame extraction + vision analysis pipeline#113
nv78 wants to merge 1 commit intoclaude/audio-transcription-pipelinefrom
claude/video-frame-analysis

Conversation

@nv78
Copy link
Member

@nv78 nv78 commented Mar 24, 2026

Summary

Completes the multimodal document pipeline. When a user uploads a video (MP4, MOV, AVI, MKV, WebM, etc.) the system generates a structured, searchable document by combining timestamped visual frame descriptions with a Whisper transcript of the audio track. The result is stored as document_text and indexed via the normal RAG chunking pipeline.

Output document format

[Video analysis — file: demo.mp4, duration: 5m 42s, frame interval: 30s]

## Visual content (key frames)

**[00:00]** A slide titled "Q3 Financial Results" is displayed. The text reads: "Revenue: $4.2M (+18% YoY)..."

**[00:30]** A bar chart comparing monthly sales from January through September...

**[01:00]** A speaker stands at a podium in front of a large screen...

...

## Audio transcript

[Audio transcript — language: en, duration: 5m 42s]

Good afternoon everyone. Today I'm going to walk you through our Q3 results...

This document makes the video fully searchable — ask "what revenue number was shown?" or "what did the speaker say about Q3?" and RAG retrieves the right section.

New file: backend/services/video_service.py

  • describe_video(bytes, filename, mime_type) — main entry point; writes to temp dir, runs analysis, cleans up
  • _get_duration() — ffprobe for header metadata
  • _extract_frames() — ffmpeg select filter + vsync vfr; returns [(timestamp_secs, path)] capped at VIDEO_MAX_FRAMES
  • _extract_audio_track() — ffmpeg -vn to 16 kHz mono MP3 (optimal for Whisper)
  • Reuses describe_image() and transcribe_audio() from existing services
  • Handles: ffmpeg not on PATH, empty frame list, oversized audio track, any subprocess error
  • Never raises — placeholder on failure

Updated: backend/agents/config.py

Env var Default Purpose
VIDEO_FRAME_INTERVAL_SECS 30 Seconds between extracted frames
VIDEO_MAX_FRAMES 20 Hard cap on frames per video (cost guardrail)

Updated: backend/api_endpoints/documents/handler.py

Video branch now calls describe_video(), stores the analysis as document_text, and enqueues chunk_document_fn.remote() — videos are fully indexed.

Requirements

  • ffmpeg must be on PATH. If not found, a human-readable placeholder is stored and the document record is still created.
  • OPENAI_API_KEY for both GPT-4o (frame descriptions) and Whisper (audio).

Test plan

  • Upload an MP4 → document record with media_type='video', analysis in document_text
  • Chunks in chunks table after Ray task
  • Ask "what does frame at 1 minute show?" → RAG retrieves the timestamped description
  • Ask about spoken content → Whisper transcript section retrieved
  • Upload video without audio track → frame descriptions still work, audio section shows placeholder
  • Set VIDEO_MAX_FRAMES=3 → only 3 frames extracted
  • ffmpeg not on PATH → placeholder stored, no crash
  • Upload a PDF → no regression

Depends on: claude/audio-transcription-pipelineclaude/image-document-analysisclaude/add-multimodal-support-QBQca

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

When a user uploads a video file (MP4, MOV, AVI, MKV, WebM, etc.) as a
document the new pipeline produces a fully searchable structured document:

1. Write video bytes to a secure temp directory
2. Use ffmpeg to extract one JPEG frame every VIDEO_FRAME_INTERVAL_SECS
   (default 30 s), capped at VIDEO_MAX_FRAMES (default 20)
3. Call describe_image() on each frame (GPT-4o / Claude vision) and tag
   the description with a MM:SS timestamp
4. Extract the audio track via ffmpeg (-ar 16000 mono MP3) and pass it
   through transcribe_audio() (Whisper) for a full spoken transcript
5. Interleave the frame descriptions and audio transcript into a single
   markdown document, store it as document_text, and enqueue chunking

agents/config.py:
- VIDEO_FRAME_INTERVAL_SECS (default: 30) — configurable via env var
- VIDEO_MAX_FRAMES (default: 20) — cost/time guardrail
- Both exposed in get_agent_config()

services/video_service.py (new):
- describe_video(bytes, filename, mime_type) -> str
- _get_duration(): ffprobe for header metadata
- _extract_frames(): ffmpeg select filter + vsync vfr, returns list of
  (timestamp_secs, path) tuples
- _extract_audio_track(): ffmpeg -vn to 16kHz mono MP3 for Whisper
- Gracefully handles: ffmpeg not on PATH, empty frame list, oversized
  audio track, any subprocess error
- Never raises — placeholder string on failure

documents/handler.py:
- video branch now calls describe_video(), stores the analysis, and
  enqueues chunk_document_fn.remote() — videos are now fully indexed

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants