feat: video frame extraction + vision analysis pipeline by nv78 · Pull Request #113 · anote-ai/Autonomous-Intelligence

nv78 · 2026-03-24T21:31:14Z

Summary

Completes the multimodal document pipeline. When a user uploads a video (MP4, MOV, AVI, MKV, WebM, etc.) the system generates a structured, searchable document by combining timestamped visual frame descriptions with a Whisper transcript of the audio track. The result is stored as document_text and indexed via the normal RAG chunking pipeline.

Output document format

[Video analysis — file: demo.mp4, duration: 5m 42s, frame interval: 30s]

## Visual content (key frames)

**[00:00]** A slide titled "Q3 Financial Results" is displayed. The text reads: "Revenue: $4.2M (+18% YoY)..."

**[00:30]** A bar chart comparing monthly sales from January through September...

**[01:00]** A speaker stands at a podium in front of a large screen...

...

## Audio transcript

[Audio transcript — language: en, duration: 5m 42s]

Good afternoon everyone. Today I'm going to walk you through our Q3 results...

This document makes the video fully searchable — ask "what revenue number was shown?" or "what did the speaker say about Q3?" and RAG retrieves the right section.

New file: `backend/services/video_service.py`

describe_video(bytes, filename, mime_type) — main entry point; writes to temp dir, runs analysis, cleans up
_get_duration() — ffprobe for header metadata
_extract_frames() — ffmpeg select filter + vsync vfr; returns [(timestamp_secs, path)] capped at VIDEO_MAX_FRAMES
_extract_audio_track() — ffmpeg -vn to 16 kHz mono MP3 (optimal for Whisper)
Reuses describe_image() and transcribe_audio() from existing services
Handles: ffmpeg not on PATH, empty frame list, oversized audio track, any subprocess error
Never raises — placeholder on failure

Updated: `backend/agents/config.py`

Env var	Default	Purpose
`VIDEO_FRAME_INTERVAL_SECS`	`30`	Seconds between extracted frames
`VIDEO_MAX_FRAMES`	`20`	Hard cap on frames per video (cost guardrail)

Updated: `backend/api_endpoints/documents/handler.py`

Video branch now calls describe_video(), stores the analysis as document_text, and enqueues chunk_document_fn.remote() — videos are fully indexed.

Requirements

ffmpeg must be on PATH. If not found, a human-readable placeholder is stored and the document record is still created.
OPENAI_API_KEY for both GPT-4o (frame descriptions) and Whisper (audio).

Test plan

Upload an MP4 → document record with media_type='video', analysis in document_text
Chunks in chunks table after Ray task
Ask "what does frame at 1 minute show?" → RAG retrieves the timestamped description
Ask about spoken content → Whisper transcript section retrieved
Upload video without audio track → frame descriptions still work, audio section shows placeholder
Set VIDEO_MAX_FRAMES=3 → only 3 frames extracted
ffmpeg not on PATH → placeholder stored, no crash
Upload a PDF → no regression

Depends on: claude/audio-transcription-pipeline → claude/image-document-analysis → claude/add-multimodal-support-QBQca

https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

When a user uploads a video file (MP4, MOV, AVI, MKV, WebM, etc.) as a document the new pipeline produces a fully searchable structured document: 1. Write video bytes to a secure temp directory 2. Use ffmpeg to extract one JPEG frame every VIDEO_FRAME_INTERVAL_SECS (default 30 s), capped at VIDEO_MAX_FRAMES (default 20) 3. Call describe_image() on each frame (GPT-4o / Claude vision) and tag the description with a MM:SS timestamp 4. Extract the audio track via ffmpeg (-ar 16000 mono MP3) and pass it through transcribe_audio() (Whisper) for a full spoken transcript 5. Interleave the frame descriptions and audio transcript into a single markdown document, store it as document_text, and enqueue chunking agents/config.py: - VIDEO_FRAME_INTERVAL_SECS (default: 30) — configurable via env var - VIDEO_MAX_FRAMES (default: 20) — cost/time guardrail - Both exposed in get_agent_config() services/video_service.py (new): - describe_video(bytes, filename, mime_type) -> str - _get_duration(): ffprobe for header metadata - _extract_frames(): ffmpeg select filter + vsync vfr, returns list of (timestamp_secs, path) tuples - _extract_audio_track(): ffmpeg -vn to 16kHz mono MP3 for Whisper - Gracefully handles: ffmpeg not on PATH, empty frame list, oversized audio track, any subprocess error - Never raises — placeholder string on failure documents/handler.py: - video branch now calls describe_video(), stores the analysis, and enqueues chunk_document_fn.remote() — videos are now fully indexed https://claude.ai/code/session_01C9mHttiQ4ZAaBbQecVV7uu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: video frame extraction + vision analysis pipeline#113

feat: video frame extraction + vision analysis pipeline#113
nv78 wants to merge 1 commit intoclaude/audio-transcription-pipelinefrom
claude/video-frame-analysis

nv78 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nv78 commented Mar 24, 2026

Summary

Output document format

New file: backend/services/video_service.py

Updated: backend/agents/config.py

Updated: backend/api_endpoints/documents/handler.py

Requirements

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New file: `backend/services/video_service.py`

Updated: `backend/agents/config.py`

Updated: `backend/api_endpoints/documents/handler.py`