From ca3d39195899e50f48fd76afa9cbd2815cf5a893 Mon Sep 17 00:00:00 2001 From: Kneesal Date: Thu, 2 Apr 2026 02:16:25 +0000 Subject: [PATCH 1/2] docs(roadmap): add video content vectorization brainstorm and roadmap tickets Scene-level video embeddings for cross-film recommendations, starting with English-only prototype. Uses Gemini 2.5 Flash to describe scenes from extracted frames + transcript, then embeds descriptions via existing text-embedding-3-small pipeline into a separate pgvector scene_embeddings table. Adds: - Requirements doc with phased rollout, storage schema, cost model, and technology research (Spotify RecSys 2025 validates this approach) - Parent feature feat-037 plus 9 sub-tickets (feat-038 through feat-046) covering data audit, scene boundaries, descriptions, embeddings table, backfill worker, visual fusion, recommendation API, pipeline integration, and demo experience frontend - Updates feat-009 blocks to include feat-037 dependency Co-Authored-By: Claude Opus 4.6 (1M context) --- ...ideo-content-vectorization-requirements.md | 261 ++++++++++++++++++ docs/roadmap/README.md | 22 +- .../feat-009-pgvector-embedding-indexing.md | 1 + .../feat-037-video-content-vectorization.md | 215 +++++++++++++++ ...feat-038-video-vectorization-data-audit.md | 83 ++++++ ...feat-039-chapter-based-scene-boundaries.md | 66 +++++ .../feat-040-multimodal-scene-descriptions.md | 83 ++++++ .../feat-041-scene-embeddings-table.md | 82 ++++++ .../feat-042-backfill-worker.md | 66 +++++ .../feat-043-visual-shot-detection-fusion.md | 63 +++++ .../feat-044-recommendation-query-api.md | 87 ++++++ .../feat-045-pipeline-integration.md | 64 +++++ ...eat-046-recommendations-demo-experience.md | 92 ++++++ 13 files changed, 1179 insertions(+), 6 deletions(-) create mode 100644 docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md create mode 100644 docs/roadmap/content-discovery/feat-037-video-content-vectorization.md create mode 100644 docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md create mode 100644 docs/roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md create mode 100644 docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md create mode 100644 docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md create mode 100644 docs/roadmap/content-discovery/feat-042-backfill-worker.md create mode 100644 docs/roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md create mode 100644 docs/roadmap/content-discovery/feat-044-recommendation-query-api.md create mode 100644 docs/roadmap/content-discovery/feat-045-pipeline-integration.md create mode 100644 docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md diff --git a/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md b/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md new file mode 100644 index 00000000..3fb99fcb --- /dev/null +++ b/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md @@ -0,0 +1,261 @@ +--- +date: 2026-04-02 +topic: video-content-vectorization +--- + +# Video Content Vectorization for Recommendations + +## Problem Frame + +JesusFilm has 50,000+ unique videos ranging from short clips to feature-length films, each available in up to 1,500 language variants. Current recommendations are purely metadata-driven — "you watched Film X, here it is in 1,500 other languages." There is no way to recommend thematically or visually similar content across different films. + +Existing transcript-based text embeddings (already built in the manager pipeline) capture _what was said_ but miss _what was shown_ — visual setting, actions, emotions, cinematography, and mood. A user watching a contemplative scene of someone walking by water should be recommended other reflective moments from entirely different films, not the same film dubbed in Swahili. + +**Validation needed**: Before full investment, confirm that transcript-only embeddings do not already provide adequate cross-film similarity. A quick test (20-50 seed videos, manual evaluation of transcript embedding recommendations) should establish whether visual scene analysis adds meaningful lift. + +**Catalog composition unknown**: The 50K figure includes all video labels (featureFilm, shortFilm, segment, episode, collection, trailer, behindTheScenes). The ratio of feature-length films to short clips dramatically affects scene count, processing time, and cost. A data audit (see R0) is prerequisite to finalizing the approach. + +## Rollout Strategy + +**Phase 1 — English prototype (this scope)**: Process all English-language videos only. Prove recommendation quality, validate the pipeline, and establish cost baseline. This is the fundable proof of concept. + +**Phase 2 — Full catalog (future, funding-dependent)**: If Phase 1 demonstrates value, expand to all 50K+ videos across all languages. Phase 2 is explicitly out of scope for this requirements doc. + +All requirements below are scoped to Phase 1 (English videos only) unless stated otherwise. + +## Requirements + +- R0. **Data audit (prerequisite)**: Before committing to the pipeline, query the CMS to determine: (a) video count by label type and duration distribution for English-language videos, (b) how many have existing chapter/scene metadata from the enrichment pipeline, (c) whether the Video → VideoVariant model provides implicit deduplication or whether separate Video records exist for the same content in different languages. +- R1. **Scene segmentation**: Break videos into meaningful narrative scenes with precise start/end timestamps. + - R1a. **Transcript-based segmentation**: Extend the existing `chapters.ts` service output (which already produces titles, start/end timestamps, and summaries via LLM) as the baseline for scene boundaries. For short clips that are a single scene, chapter output may be sufficient without further segmentation. + - R1b. **Visual shot detection + fusion**: For feature-length films, augment transcript-based boundaries with visual shot detection to produce more accurate narrative scene boundaries. This is a research-heavy component — evaluate libraries and approaches during planning. +- R2. **Scene content description**: For each scene, generate a rich multimodal description capturing visual setting, objects, actions, characters, emotional tone, and mood by feeding representative frames + transcript to a multimodal LLM. Note: this requires a new multimodal LLM client — the existing OpenRouter `embeddings.ts` is text-only and cannot send images. +- R3. **Scene embedding and storage**: Embed each scene description using the existing text embedding pipeline (`text-embedding-3-small`, 1536 dims) and store in a **separate `scene_embeddings` table** in pgvector with full traceability back to source video and scene. +- R4. **Cross-film recommendation**: Given a scene or video, find visually and thematically similar scenes from _different_ films using vector similarity. Deduplication across language variants uses the Video → VideoVariant parent relationship (embed once per Video, not per variant). This scope includes the vector similarity query capability; the recommendation UI (how results are surfaced in web/mobile) is a separate feature. +- R5. **Backfill worker**: A dedicated worker service to process the English video catalog. Must be resumable/idempotent. Must include: + - Configurable batch size and rate limits + - Cost tracking per video and cumulative + - Automatic pause if cost exceeds a configurable threshold + - Dry-run mode that estimates cost without calling LLMs +- R6. **Incremental pipeline integration**: After backfill, scene vectorization becomes a required step in the existing manager enrichment workflow for new English video uploads. Note: unlike existing parallel steps (translate, chapters, metadata, embeddings) which all consume transcript text, scene vectorization needs video frame access via muxAssetId — it runs as an independent branch, not a simple addition to the existing parallel group. +- R7. **Existing scene metadata**: Where videos already have chapter output from the enrichment pipeline, use it as the starting point for segmentation rather than re-detecting from scratch. + +## Storage Schema + +Scene embeddings are stored in a dedicated pgvector table with full traceability to source video and scene boundaries: + +```sql +CREATE TABLE scene_embeddings ( + id SERIAL PRIMARY KEY, + + -- Traceability: which video and scene + video_id INTEGER NOT NULL, -- FK to Strapi video record + core_id TEXT, -- video.coreId for cross-reference + mux_asset_id TEXT NOT NULL, -- which Mux asset frames came from + playback_id TEXT NOT NULL, -- for Mux thumbnail URL construction + + -- Scene boundaries + scene_index INTEGER NOT NULL, -- 0-based order within the video + start_seconds FLOAT NOT NULL, + end_seconds FLOAT, -- NULL for final scene (extends to end) + + -- Content (for debugging, tracing, and quality review) + description TEXT NOT NULL, -- LLM-generated scene description + chapter_title TEXT, -- from chapters.ts if available + frame_count INTEGER, -- how many frames were sent to LLM + + -- The embedding + embedding vector(1536) NOT NULL, + model TEXT NOT NULL DEFAULT 'text-embedding-3-small', + + -- Phase tracking + language TEXT NOT NULL DEFAULT 'en', -- which language transcript was used + + -- Metadata + created_at TIMESTAMPTZ DEFAULT NOW(), + + -- Uniqueness: one embedding per scene per video + UNIQUE(video_id, scene_index) +); + +-- HNSW index for fast similarity search +CREATE INDEX scene_embeddings_hnsw + ON scene_embeddings USING hnsw (embedding vector_cosine_ops); + +-- Lookup by video (for "find scenes in this video" and deduplication) +CREATE INDEX scene_embeddings_video_id ON scene_embeddings(video_id); + +-- Phase filtering (English prototype vs full catalog) +CREATE INDEX scene_embeddings_language ON scene_embeddings(language); +``` + +**How to trace an embedding back to its source:** + +- `video_id` → Strapi Video record (title, slug, label, description) +- `video_id` → Video.variants → VideoVariant records (language-specific playback) +- `mux_asset_id` / `playback_id` → Mux asset (for re-extracting frames) +- `scene_index` + `start_seconds` / `end_seconds` → exact moment in the video +- `description` → what the LLM "saw" in this scene (stored for inspection) +- `chapter_title` → link to chapters.ts output if it was the scene source + +**Recommendation query pattern:** + +```sql +-- Find similar scenes from DIFFERENT videos +SELECT se.video_id, se.scene_index, se.description, se.start_seconds, + 1 - (se.embedding <=> $1) AS similarity +FROM scene_embeddings se +WHERE se.video_id != $2 -- exclude current video + AND se.language = 'en' -- Phase 1: English only +ORDER BY se.embedding <=> $1 +LIMIT 10; +``` + +**Why this schema:** + +- **Separate from `video_embeddings`** (feat-009): Different columns (timestamps, description) and different query patterns (scene similarity vs. transcript keyword search). Separate tables let feat-009 proceed as-is. +- **`video_id` as dedup key**: Language variants are VideoVariants under the same Video parent. Embedding once per Video and filtering by `video_id !=` gives implicit cross-variant deduplication. +- **`language` column**: Enables Phase 1 (English only) filtering and future Phase 2 expansion without schema changes. +- **`description` stored**: Enables quality review, debugging, and re-embedding with a different model without re-running the LLM. + +## Rough Cost Model + +**Phase 1 (English only) — order-of-magnitude estimates. Refine after R0 data audit.** + +English subset is likely a fraction of the 50K total. Assuming ~5K-10K English videos: + +- Short clips (~80%): 8K × 2 scenes = ~16K scene descriptions +- Feature films (~20%): 2K × 75 scenes = ~150K scene descriptions +- **Total: ~166K multimodal LLM calls** + +At Gemini 2.5 Flash pricing (~$0.15/1M input tokens, ~$0.60/1M output tokens): + +- Per scene: ~3 frames (thumbnails) + transcript chunk ≈ ~2K tokens input, ~500 tokens output +- **Total input: ~332M tokens → ~$50** +- **Total output: ~83M tokens → ~$50** +- **Embedding cost**: 166K × text-embedding-3-small ≈ ~$3 +- **Phase 1 rough total: ~$100-$300** + +**Full catalog estimate (Phase 2, for future funding request):** + +- ~830K scene descriptions → ~$500-$1,500 + +Compare: Twelve Labs Embed at ~$0.03/min × estimated 500K+ total minutes = **$15K+** + +## Success Criteria + +- Recommendations surface genuinely different films/clips based on visual and thematic similarity, not just metadata overlap +- **Measurable quality bar**: Curate 50-100 seed videos with human-labeled "expected similar" results. Scene embeddings must surface at least 3 relevant cross-film results in top 10 for 80% of seed videos, outperforming transcript-only embeddings on the same evaluation set. +- Feature-length films are segmented into meaningful narrative scenes (not raw shot cuts) +- The backfill worker can process the English catalog without manual intervention (resumable on failure, cost-capped) +- New English uploads are automatically scene-vectorized as part of the enrichment pipeline +- Language variants of the same content are deduplicated in recommendation results +- **Phase gate**: Phase 1 results are evaluated before requesting Phase 2 funding + +## Scope Boundaries + +- **Phase 1 only**: English-language videos. Other languages are Phase 2, out of scope. +- **Not building a user-facing search UI** — this is the recommendation engine layer. Search (feat-010) is a separate concern. +- **Not replacing transcript embeddings** — scene embeddings complement them. Both live in pgvector in separate tables. +- **Hybrid approach**: Start with LLM-generated scene descriptions embedded as text vectors (ships faster, reuses existing infra). Native video embedding models (Twelve Labs, Gemini video embeddings) are a future upgrade path, not in scope now. +- **Not building the recommendation UI** — this provides the vector similarity query capability. How recommendations are surfaced in web/mobile is a separate feature. + +## Key Decisions + +- **English-first phased rollout**: Prototype with all English videos (~$100-$300 estimated cost). Prove value before investing in full 50K+ catalog. Phase 2 is a separate funding decision. +- **LLM descriptions over native video embeddings**: At scale, native video embedding APIs (Twelve Labs at ~$15K+) are 10-30x more expensive than LLM scene descriptions (~$500-$1,500 full catalog). LLM descriptions reuse existing infrastructure (text-embedding-3-small + pgvector) and provide good quality. Can upgrade selectively later. +- **Scene-level granularity**: Embeddings are per-scene, not per-frame or per-video. Short clips may be 1-3 scenes; feature films 50-200. This is the right unit for recommendations. +- **Build on existing chapters pipeline**: The `chapters.ts` service already produces transcript-based scene segmentation with timestamps. R1 extends this with visual shot detection for feature films rather than building scene detection from scratch. +- **Separate `scene_embeddings` table**: Scene embeddings have different columns (start/end timestamps, description text) and query patterns than transcript chunk embeddings. Separate tables let feat-009 proceed as-is and keep query logic clean. Resolve before feat-009 starts Apr 7. +- **Hybrid storage: pgvector + lightweight metadata**: Scene data lives in the `scene_embeddings` table with full traceability columns (video_id, mux_asset_id, timestamps, description) rather than as a Strapi content type. Keeps it lean for prototype; can promote to CMS entity later if human-in-the-loop editing is needed. +- **Backfill worker separate from manager**: The one-time catalog processing runs as a dedicated worker service (can scale independently, doesn't block the manager pipeline). Can reuse the same workflow code/libraries. New uploads use the integrated manager pipeline step. +- **Deduplication via Video → VideoVariant model**: Scene detection and embedding runs once per Video entity (the parent), not per VideoVariant. Recommendations filter by unique Video ID. Confirm during data audit (R0) that language variants are modeled as VideoVariants, not separate Video records. + +## Dependencies / Assumptions + +- **pgvector must be deployed first** (feat-009, scheduled Apr 7, 14-day duration → ~Apr 21) — R3, R4, R6 are blocked. R0, R1, R2, R5 scaffolding can proceed in parallel. +- **Existing chapters pipeline** in manager is working and produces scene-like segmentation +- **Mux thumbnail API** provides frame extraction at specific timestamps via `image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time=N` — confirm during planning +- **New multimodal LLM client needed** — existing OpenRouter client is text-only; R2 requires sending images alongside text +- **Railway worker constraints** — need to confirm Railway supports long-lived worker processes or design backfill as queue-based with short-lived jobs. Existing `railway.toml` has `restartPolicyMaxRetries: 3` which may not suit multi-day processing. + +## Outstanding Questions + +### Deferred to Planning + +- [Affects R0][Data audit] Query CMS for English video count by label, duration distribution, and chapter metadata coverage. This gates the pipeline sizing. +- [Affects R1b][Needs research] Which visual scene detection libraries work best for narrative film content? PySceneDetect handles shot boundaries; evaluate options for combining with transcript-based scene detection. +- [Affects R2][Needs research] Which multimodal LLM gives best scene descriptions for the cost? Gemini 2.5 Flash vs GPT-4o vs others — benchmark quality and pricing at scale. +- [Affects R2][Technical] How many representative frames per scene should be sampled for description? 1 keyframe vs 3-5 frames affects description quality and API cost. +- [Affects R5][Technical] Backfill worker architecture — queue-based (process videos from a job queue) or single long-lived process? Depends on Railway constraints. +- [Affects R5][Needs research] Confirm Mux thumbnail API works for arbitrary timestamps and returns sufficient resolution for multimodal LLM input. +- [Affects R4][Technical] How will scene similarity interact with feat-010 semantic search API? Different query pattern (find similar scenes vs. keyword search). + +## Visual Embedding Technology Research + +**Researched Apr 2, 2026. Use to inform feat-040 (scene descriptions) model selection.** + +### Approach Comparison + +| Approach | Est. Cost (50K videos) | Quality | Infra Complexity | +| ------------------------------------------ | ---------------------- | ----------------------------- | ------------------------- | +| **Gemini 2.5 Flash describe + text-embed** | **$150-300** | **High (narrative + visual)** | **Low (reuses existing)** | +| Gemini Embedding 2 (direct video embed) | $2,000-5,000 | High (native multimodal) | Medium (new index) | +| Twelve Labs Embed (Marengo 3.0) | $10,000+ | Highest (purpose-built) | Medium (new index) | +| CLIP/SigLIP local | ~$0 (compute only) | Medium (visual only) | Medium (new index + GPU) | +| GPT-4o describe + text-embed | $1,200-2,400 | High | Low | + +### Recommended: Gemini 2.5 Flash "Describe then Embed" + +- **Image input**: Accepts multiple images + text per request. ~1,290 tokens per image ≈ $0.000039/image. +- **At 3 frames/scene × 166K scenes (English)**: ~$58 in image tokens + ~$50 output tokens = **~$100-$300 total**. +- **Quality**: Strong at visual description, emotional tone, settings, actions. Best cost/quality ratio by a wide margin. +- **Why not GPT-4o**: 8x more expensive ($2.50/1M input vs $0.30/1M). Comparable quality. +- **Why not Claude**: Haiku is 3-4x more expensive, Sonnet 10x. Not justified at scale for scene description. + +### Why Not CLIP/SigLIP Directly? + +CLIP/SigLIP produce embeddings directly from images (512-1152 dims) in a shared text-image space. Strengths: zero marginal cost, text-to-image search works. But: + +- Embeddings capture "what's in this image" not narrative meaning. Will find "beach scene" but miss "baptism at a river" vs "family swimming at a lake." +- **Incompatible vector space** with text-embedding-3-small — cannot mix in the same pgvector index. +- For ministry content requiring semantic nuance, CLIP alone is insufficient. + +### Future Upgrade Path: Gemini Embedding 2 + +Google's multimodal embedding model (public preview, Mar 2026): + +- 3072 dims (Matryoshka down to 768). Can target 1536 to match existing space. +- Accepts text, image, video, audio in one unified embedding space. +- **Video constraint**: max 80-120 seconds per clip → fits our scene-based approach. +- Pricing: ~$0.00079/frame. At 1fps for 60s scenes ≈ $0.047/scene. +- **When to adopt**: Once out of preview and pricing stabilizes. Store as a second signal in a separate column, combine scores at query time. + +### Mux Thumbnail API (Confirmed) + +- **URL**: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.{png|jpg|webp}?time={SECONDS}` +- **Resolution**: Defaults to original video resolution. Supports `?width=512&height=512` for LLM-friendly sizes. +- **Rate limit**: 1 unique thumbnail per 10 seconds of video duration per asset. A 60-min film supports 360 thumbnails — plenty for 3 frames × 20 scenes. +- **Cost**: Included in Mux standard pricing. No per-thumbnail charge. +- **CDN cached**: Repeated requests for the same timestamp are free. + +## Roadmap Tickets + +This brainstorm produced the following roadmap features in `docs/roadmap/content-discovery/`: + +| ID | Feature | Days | Start | Depends on | +| ------------------------------------------------------------------------------------ | ----------------------------------- | ---- | ------ | ---------------------------- | +| [feat-037](../roadmap/content-discovery/feat-037-video-content-vectorization.md) | Parent: Video Content Vectorization | 42 | Apr 21 | feat-009, feat-031 | +| [feat-038](../roadmap/content-discovery/feat-038-video-vectorization-data-audit.md) | Data Audit | 3 | Apr 21 | feat-037 | +| [feat-039](../roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md) | Chapter-Based Scene Boundaries | 7 | Apr 24 | feat-038 | +| [feat-040](../roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md) | Multimodal Scene Descriptions | 10 | May 1 | feat-039 | +| [feat-041](../roadmap/content-discovery/feat-041-scene-embeddings-table.md) | Scene Embeddings Table + Indexing | 7 | May 11 | feat-009, feat-040 | +| [feat-042](../roadmap/content-discovery/feat-042-backfill-worker.md) | English Backfill Worker | 10 | May 18 | feat-038, feat-040, feat-041 | +| [feat-043](../roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md) | Visual Shot Detection Fusion (P2) | 10 | May 28 | feat-039 | +| [feat-044](../roadmap/content-discovery/feat-044-recommendation-query-api.md) | Recommendation Query API | 7 | May 28 | feat-041, feat-042 | +| [feat-045](../roadmap/content-discovery/feat-045-pipeline-integration.md) | Pipeline Integration | 7 | Jun 4 | feat-041, feat-042 | +| [feat-046](../roadmap/content-discovery/feat-046-recommendations-demo-experience.md) | Recommendations Demo Experience | 7 | Jun 4 | feat-044 | + +## Next Steps + +→ `/ce:plan` for structured implementation planning (R0 data audit is first planning task). diff --git a/docs/roadmap/README.md b/docs/roadmap/README.md index 3a72c5c4..3ff3813e 100644 --- a/docs/roadmap/README.md +++ b/docs/roadmap/README.md @@ -30,12 +30,22 @@ Build trusted, scalable AI capabilities that help people discover gospel content ### Content Discovery -| ID | Feature | Owner | Priority | Start | Days | Status | -| --------------------------------------------------------------------- | ------------------------------------- | ----- | -------- | ------ | ---- | ----------- | -| [feat-009](content-discovery/feat-009-pgvector-embedding-indexing.md) | pgvector Setup and Embedding Indexing | nisal | P0 | Apr 7 | 14 | not-started | -| [feat-010](content-discovery/feat-010-semantic-search-api.md) | Semantic Search API | nisal | P0 | Apr 14 | 21 | not-started | -| [feat-011](content-discovery/feat-011-search-ui-web.md) | Search UI — Web | urim | P0 | Apr 14 | 21 | not-started | -| [feat-012](content-discovery/feat-012-search-ui-mobile.md) | Search UI — Mobile | urim | P0 | Apr 14 | 21 | not-started | +| ID | Feature | Owner | Priority | Start | Days | Status | +| ------------------------------------------------------------------------- | ------------------------------------- | ----- | -------- | ------ | ---- | ----------- | +| [feat-009](content-discovery/feat-009-pgvector-embedding-indexing.md) | pgvector Setup and Embedding Indexing | nisal | P0 | Apr 7 | 14 | not-started | +| [feat-010](content-discovery/feat-010-semantic-search-api.md) | Semantic Search API | nisal | P0 | Apr 14 | 21 | not-started | +| [feat-011](content-discovery/feat-011-search-ui-web.md) | Search UI — Web | urim | P0 | Apr 14 | 21 | not-started | +| [feat-012](content-discovery/feat-012-search-ui-mobile.md) | Search UI — Mobile | urim | P0 | Apr 14 | 21 | not-started | +| [feat-037](content-discovery/feat-037-video-content-vectorization.md) | Video Content Vectorization for Recs | nisal | P1 | Apr 21 | 42 | not-started | +| [feat-038](content-discovery/feat-038-video-vectorization-data-audit.md) | Vectorization — Data Audit | nisal | P1 | Apr 21 | 3 | not-started | +| [feat-039](content-discovery/feat-039-chapter-based-scene-boundaries.md) | Vectorization — Scene Boundaries | nisal | P1 | Apr 24 | 7 | not-started | +| [feat-040](content-discovery/feat-040-multimodal-scene-descriptions.md) | Vectorization — Scene Descriptions | nisal | P1 | May 1 | 10 | not-started | +| [feat-041](content-discovery/feat-041-scene-embeddings-table.md) | Vectorization — Embeddings Table | nisal | P1 | May 11 | 7 | not-started | +| [feat-042](content-discovery/feat-042-backfill-worker.md) | Vectorization — English Backfill | nisal | P1 | May 18 | 10 | not-started | +| [feat-043](content-discovery/feat-043-visual-shot-detection-fusion.md) | Vectorization — Visual Shot Fusion | nisal | P2 | May 28 | 10 | not-started | +| [feat-044](content-discovery/feat-044-recommendation-query-api.md) | Vectorization — Recommendation API | nisal | P1 | May 28 | 7 | not-started | +| [feat-045](content-discovery/feat-045-pipeline-integration.md) | Vectorization — Pipeline Integration | nisal | P1 | Jun 4 | 7 | not-started | +| [feat-046](content-discovery/feat-046-recommendations-demo-experience.md) | Vectorization — Recommendations Demo | nisal | P1 | Jun 4 | 7 | not-started | ### Topic Experiences diff --git a/docs/roadmap/content-discovery/feat-009-pgvector-embedding-indexing.md b/docs/roadmap/content-discovery/feat-009-pgvector-embedding-indexing.md index 541111de..09c4640d 100644 --- a/docs/roadmap/content-discovery/feat-009-pgvector-embedding-indexing.md +++ b/docs/roadmap/content-discovery/feat-009-pgvector-embedding-indexing.md @@ -10,6 +10,7 @@ depends_on: - "feat-002" blocks: - "feat-010" + - "feat-037" tags: - "cms" - "pgvector" diff --git a/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md b/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md new file mode 100644 index 00000000..70901205 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md @@ -0,0 +1,215 @@ +--- +id: "feat-037" +title: "Video Content Vectorization for Recommendations" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-04-21" +duration: 42 +depends_on: + - "feat-009" + - "feat-031" +blocks: + - "feat-038" +tags: + - "cms" + - "pgvector" + - "ai-pipeline" + - "search" + - "manager" +--- + +## Problem + +Current recommendations are metadata-driven — "you watched Film X, here it is in 1,500 other languages." Transcript embeddings (feat-009/010) capture what was said, but miss what was shown. Visual scene embeddings enable cross-film recommendations based on visual setting, actions, emotional tone, and mood. + +**Phase 1 (this feature)**: All English-language videos. Prove recommendation quality at ~$100-$300 estimated cost. Phase 2 (full 50K+ catalog) is a separate funding decision. + +## Entry Points — Read These First + +1. `apps/manager/src/services/chapters.ts` — existing scene-like segmentation: `Chapter { title, startSeconds, endSeconds, summary }`. This is the baseline for R1a. +2. `apps/manager/src/services/embeddings.ts` — existing text embedding pipeline using `text-embedding-3-small` (1536 dims). Scene descriptions will be embedded through the same model. +3. `apps/manager/src/workflows/videoEnrichment.ts` — enrichment workflow with parallel steps. R6 adds scene vectorization as a new branch. +4. `apps/manager/src/services/storage.ts` — S3 artifact storage pattern (`{assetId}/{type}.json`). +5. `apps/cms/src/api/video/content-types/video/schema.json` — Video content type with `coreId`, `label` enum, `variants` relation. +6. `apps/cms/src/api/video-variant/content-types/video-variant/schema.json` — VideoVariant with `language` and `muxVideo` relations. +7. `apps/cms/src/api/mux-video/content-types/mux-video/schema.json` — MuxVideo with `assetId` and `playbackId` for frame extraction. +8. `docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md` — full requirements doc with storage schema, cost model, and rollout strategy. + +## Grep These + +- `chapters` in `apps/manager/src/` — existing chapter/scene segmentation +- `getOpenrouter` in `apps/manager/src/` — AI model client (text-only; needs multimodal extension) +- `text-embedding-3-small` in `apps/manager/src/` — embedding model +- `strapi.db.connection.raw` in `apps/cms/src/` — raw SQL patterns for pgvector +- `muxAssetId` in `apps/manager/src/` — Mux asset references for frame extraction +- `playbackId` in `apps/cms/src/` — Mux playback IDs for thumbnail URLs +- `label` in `apps/cms/src/api/video/` — video type enum (featureFilm, shortFilm, etc.) + +## What To Build + +### R0. Data Audit (first task) + +Query CMS to determine English video landscape: + +```sql +-- Video count by label type +SELECT label, COUNT(*) FROM videos GROUP BY label; + +-- Duration distribution +SELECT label, + COUNT(*) as count, + AVG(duration) as avg_duration, + MAX(duration) as max_duration +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +JOIN languages l ON vv.language_id = l.id +WHERE l.bcp47 = 'en' +GROUP BY label; + +-- Chapter metadata coverage +SELECT COUNT(DISTINCT ej.mux_asset_id) +FROM enrichment_jobs ej +WHERE ej.step_statuses->>'chapters' = 'completed'; +``` + +### R1. Scene Segmentation + +**R1a — Transcript-based (extend chapters.ts)**: + +- For each English video, use existing chapter output as scene boundaries +- Short clips (single chapter) → treat as one scene +- Store chapter boundaries as scene candidates + +**R1b — Visual fusion (feature films only)**: + +- Extract frames at chapter boundaries using Mux thumbnail API: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time={SECONDS}` +- Feed frame sequences + transcript to multimodal LLM to refine/merge chapter boundaries into narrative scenes +- Research: evaluate PySceneDetect for shot boundary detection to augment + +### R2. Scene Content Description + +New service: `apps/manager/src/services/sceneDescription.ts` + +```typescript +type SceneDescription = { + sceneIndex: number + startSeconds: number + endSeconds: number | null + description: string // LLM-generated rich description + chapterTitle: string | null + frameCount: number +} + +export async function describeScene( + playbackId: string, + startSeconds: number, + endSeconds: number | null, + transcript: string, + chapterTitle: string | null, +): Promise +``` + +- Extract 3 representative frames via Mux thumbnail API at scene start, midpoint, and end +- Send frames + transcript chunk to multimodal LLM (Gemini 2.5 Flash via OpenRouter or direct API) +- Prompt: describe visual setting, objects, actions, characters, emotional tone, mood +- **Requires new multimodal client** — existing OpenRouter client is text-only + +### R3. Scene Embedding + Storage + +Create `scene_embeddings` table via bootstrap SQL (same pattern as feat-009): + +```sql +CREATE TABLE IF NOT EXISTS scene_embeddings ( + id SERIAL PRIMARY KEY, + video_id INTEGER NOT NULL, + core_id TEXT, + mux_asset_id TEXT NOT NULL, + playback_id TEXT NOT NULL, + scene_index INTEGER NOT NULL, + start_seconds FLOAT NOT NULL, + end_seconds FLOAT, + description TEXT NOT NULL, + chapter_title TEXT, + frame_count INTEGER, + embedding vector(1536) NOT NULL, + model TEXT NOT NULL DEFAULT 'text-embedding-3-small', + language TEXT NOT NULL DEFAULT 'en', + created_at TIMESTAMPTZ DEFAULT NOW(), + UNIQUE(video_id, scene_index) +); + +CREATE INDEX IF NOT EXISTS scene_embeddings_hnsw + ON scene_embeddings USING hnsw (embedding vector_cosine_ops); +CREATE INDEX IF NOT EXISTS scene_embeddings_video_id + ON scene_embeddings(video_id); +CREATE INDEX IF NOT EXISTS scene_embeddings_language + ON scene_embeddings(language); +``` + +Indexing service: `apps/cms/src/api/scene-embedding/services/indexer.ts` + +```typescript +export async function indexSceneEmbeddings( + videoId: number, + scenes: SceneDescription[], + embeddings: number[][], + meta: { + coreId: string + muxAssetId: string + playbackId: string + language: string + }, +): Promise<{ scenesIndexed: number }> +``` + +### R4. Cross-film Recommendation Query + +```sql +SELECT se.video_id, se.scene_index, se.description, se.start_seconds, + 1 - (se.embedding <=> $1) AS similarity +FROM scene_embeddings se +WHERE se.video_id != $2 + AND se.language = 'en' +ORDER BY se.embedding <=> $1 +LIMIT 10; +``` + +Expose as CMS service or API endpoint for web/mobile consumption. + +### R5. Backfill Worker + +Dedicated Railway service (or separate entry point in manager) for one-time English catalog processing: + +- Queue-based: iterate English videos, process each through R1 → R2 → R3 +- Resumable: track processed video IDs, skip on restart +- Cost controls: configurable batch size, rate limits, cost tracking per video, auto-pause at threshold +- Dry-run mode: estimate cost without LLM calls + +### R6. Pipeline Integration + +Add scene vectorization to `videoEnrichment.ts` as an independent branch: + +- Runs after transcription completes (needs transcript) +- Also needs muxAssetId/playbackId (for frames) — different input than other parallel steps +- Triggers R1a → R2 → R3 for the new video + +## Constraints + +- **English only** — filter by language in all queries and processing. `language` column enables future expansion. +- **Separate table from `video_embeddings`** — different columns, different query patterns. Do not extend feat-009's table. +- **Do NOT use a Strapi content type** for scene embeddings — pgvector columns don't work with Strapi ORM. Use raw SQL (same pattern as feat-009). +- **Embed once per Video, not per VideoVariant** — language variants share visual content. Dedup by `video_id`. +- **Cost cap** — backfill worker must auto-pause if cumulative cost exceeds configurable threshold. +- **Mux thumbnail API** for frame extraction — do not download full videos. Confirm API supports arbitrary timestamps during planning. + +## Verification + +1. **Data audit complete**: know English video count by label, duration distribution, chapter coverage +2. **Scene segmentation**: sample 10 feature films, verify scene boundaries align with narrative scenes (not just shot cuts) +3. **Scene descriptions**: sample 20 scenes, verify descriptions capture visual content, not just transcript paraphrasing +4. **Embeddings indexed**: `SELECT COUNT(*) FROM scene_embeddings WHERE language = 'en'` matches expected scene count +5. **Recommendation quality**: for 50 seed videos, top-10 similar scenes include at least 3 relevant cross-film results for 80% of seeds +6. **Deduplication**: recommendations never surface the same video (different variant) as the input +7. **Cost tracking**: backfill worker logs cumulative cost, stays within budget +8. **Pipeline integration**: upload a new English video → scene embeddings appear in `scene_embeddings` table automatically diff --git a/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md b/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md new file mode 100644 index 00000000..c354dade --- /dev/null +++ b/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md @@ -0,0 +1,83 @@ +--- +id: "feat-038" +title: "Video Vectorization — Data Audit" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-04-21" +duration: 3 +depends_on: + - "feat-037" +blocks: + - "feat-039" + - "feat-042" +tags: + - "cms" + - "pgvector" +--- + +## Problem + +Before building the scene vectorization pipeline, we need to know the shape of the English video catalog: how many videos by type, duration distribution, and existing chapter coverage. This gates all downstream sizing, cost estimates, and architecture decisions. + +## Entry Points — Read These First + +1. `apps/cms/src/api/video/content-types/video/schema.json` — Video schema with `label` enum +2. `apps/cms/src/api/video-variant/content-types/video-variant/schema.json` — VideoVariant with language relation +3. `apps/cms/src/api/enrichment-job/content-types/enrichment-job/schema.json` — tracks chapter completion status +4. `docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md` — R0 requirements + +## Grep These + +- `label` in `apps/cms/src/api/video/` — video type enum values +- `bcp47` in `apps/cms/src/` — language code field for filtering English + +## What To Build + +Run diagnostic queries against the CMS database: + +```sql +-- English video count by label +SELECT v.label, COUNT(*) as count +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +JOIN languages l ON vv.language_id = l.id +WHERE l.bcp47 = 'en' +GROUP BY v.label ORDER BY count DESC; + +-- Duration distribution for English videos +SELECT v.label, + COUNT(*) as count, + ROUND(AVG(vv.duration)) as avg_duration_sec, + MAX(vv.duration) as max_duration_sec +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +JOIN languages l ON vv.language_id = l.id +WHERE l.bcp47 = 'en' +GROUP BY v.label; + +-- Chapter metadata coverage +SELECT COUNT(DISTINCT ej.mux_asset_id) +FROM enrichment_jobs ej +WHERE ej.step_statuses->>'chapters' = 'completed'; + +-- Confirm Video → VideoVariant dedup model +SELECT v.id, COUNT(vv.id) as variant_count +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +GROUP BY v.id ORDER BY variant_count DESC LIMIT 10; +``` + +Deliverable: update the brainstorm doc cost model with actual numbers. Confirm or revise the ~$100-$300 Phase 1 estimate. + +## Constraints + +- Read-only queries — do not modify production data +- Use `strapi.db.connection.raw()` pattern or direct DB access + +## Verification + +- Know exact English video count by label type +- Know duration distribution (what % are short clips vs feature films) +- Know chapter coverage (what % already have scene-like metadata) +- Cost model in brainstorm doc updated with real numbers diff --git a/docs/roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md b/docs/roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md new file mode 100644 index 00000000..b1c26746 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md @@ -0,0 +1,66 @@ +--- +id: "feat-039" +title: "Video Vectorization — Chapter-Based Scene Boundaries" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-04-24" +duration: 7 +depends_on: + - "feat-038" +blocks: + - "feat-040" +tags: + - "manager" + - "ai-pipeline" +--- + +## Problem + +The existing `chapters.ts` service produces transcript-based scene segmentation (title, startSeconds, endSeconds, summary). This output needs to be formalized as "scene boundaries" that downstream steps (description, embedding) consume. For short clips that are a single chapter, the chapter IS the scene. + +## Entry Points — Read These First + +1. `apps/manager/src/services/chapters.ts` — `Chapter { title, startSeconds, endSeconds, summary }` type and generation logic +2. `apps/manager/src/services/storage.ts` — artifact storage/retrieval pattern +3. `apps/manager/src/workflows/videoEnrichment.ts` — where chapters step runs + +## Grep These + +- `Chapter` in `apps/manager/src/services/chapters.ts` — existing type definition +- `chapters` in `apps/manager/src/workflows/` — how chapters are invoked + +## What To Build + +New service: `apps/manager/src/services/sceneBoundaries.ts` + +```typescript +type SceneBoundary = { + sceneIndex: number + startSeconds: number + endSeconds: number | null + chapterTitle: string | null + transcriptChunk: string +} + +export async function extractSceneBoundaries( + assetId: string, + chapters: Chapter[], + transcript: string, +): Promise +``` + +- Map each chapter to a SceneBoundary with its corresponding transcript chunk +- Single-chapter videos → one scene +- Store as `{assetId}/scene-boundaries.json` artifact + +## Constraints + +- Do not modify `chapters.ts` — consume its output, don't change it +- Keep the SceneBoundary type simple — visual fusion (feat-043) will extend it later + +## Verification + +- Process 10 English videos with existing chapters → scene boundaries match chapter structure +- Short clips produce 1-3 scenes, feature films produce 20-100+ +- Artifact stored successfully in S3 diff --git a/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md b/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md new file mode 100644 index 00000000..4f0554d8 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md @@ -0,0 +1,83 @@ +--- +id: "feat-040" +title: "Video Vectorization — Multimodal Scene Descriptions" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-05-01" +duration: 10 +depends_on: + - "feat-039" +blocks: + - "feat-041" + - "feat-042" +tags: + - "manager" + - "ai-pipeline" +--- + +## Problem + +Each scene needs a rich description capturing visual setting, objects, actions, emotional tone, and mood. This requires a new multimodal LLM client (existing OpenRouter client is text-only) that can send video frames alongside transcript text. + +## Entry Points — Read These First + +1. `apps/manager/src/lib/openrouter.ts` — existing AI client (text-only) +2. `apps/manager/src/services/chapters.ts` — example of LLM prompting pattern +3. `apps/manager/src/services/sceneBoundaries.ts` — scene boundary input (from feat-039) +4. `apps/cms/src/api/mux-video/content-types/mux-video/schema.json` — `playbackId` for Mux thumbnail URLs + +## Grep These + +- `getOpenrouter` in `apps/manager/src/` — existing AI client usage +- `playbackId` in `apps/manager/src/` — Mux playback ID references + +## What To Build + +1. **Multimodal LLM client** — extend or add a client that supports sending images + text. Gemini 2.5 Flash recommended for cost/quality. + +2. **Frame extraction utility**: + + ```typescript + export async function extractFrames( + playbackId: string, + timestamps: number[], + ): Promise + ``` + + Uses Mux thumbnail API: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time={SECONDS}` + +3. **Scene description service**: `apps/manager/src/services/sceneDescription.ts` + + ```typescript + type SceneDescription = { + sceneIndex: number + startSeconds: number + endSeconds: number | null + description: string + chapterTitle: string | null + frameCount: number + } + + export async function describeScene( + playbackId: string, + boundary: SceneBoundary, + ): Promise + ``` + + - Extract 3 frames (start, mid, end of scene) + - Send frames + transcript chunk to multimodal LLM + - Prompt for: visual setting, objects, actions, characters, emotional tone, mood + - Store as `{assetId}/scene-descriptions.json` artifact + +## Constraints + +- Confirm Mux thumbnail API works for arbitrary timestamps and returns sufficient resolution +- Rate limit LLM calls — respect provider limits +- Log token usage per call for cost tracking + +## Verification + +- Sample 20 scenes: descriptions capture visual content, not just transcript paraphrasing +- Mux thumbnail extraction works for timestamps throughout a video +- Token usage logged accurately diff --git a/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md b/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md new file mode 100644 index 00000000..5a86639a --- /dev/null +++ b/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md @@ -0,0 +1,82 @@ +--- +id: "feat-041" +title: "Video Vectorization — Scene Embeddings Table + Indexing" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-05-11" +duration: 7 +depends_on: + - "feat-009" + - "feat-040" +blocks: + - "feat-042" + - "feat-044" +tags: + - "cms" + - "pgvector" +--- + +## Problem + +Scene descriptions need to be embedded and stored in pgvector for similarity queries. This requires a new `scene_embeddings` table (separate from feat-009's `video_embeddings`) and an indexing service. + +## Entry Points — Read These First + +1. `apps/cms/src/bootstrap.ts` or `apps/cms/src/index.ts` — where pgvector extension and tables are created (feat-009 pattern) +2. `apps/manager/src/services/embeddings.ts` — existing text embedding pipeline +3. `docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md` — full schema in Storage Schema section + +## Grep These + +- `video_embeddings` in `apps/cms/src/` — feat-009 table creation pattern to follow +- `strapi.db.connection.raw` in `apps/cms/src/` — raw SQL execution pattern + +## What To Build + +1. **Bootstrap SQL** — add to CMS bootstrap alongside feat-009's table: + + ```sql + CREATE TABLE IF NOT EXISTS scene_embeddings ( + id SERIAL PRIMARY KEY, + video_id INTEGER NOT NULL, + core_id TEXT, + mux_asset_id TEXT NOT NULL, + playback_id TEXT NOT NULL, + scene_index INTEGER NOT NULL, + start_seconds FLOAT NOT NULL, + end_seconds FLOAT, + description TEXT NOT NULL, + chapter_title TEXT, + frame_count INTEGER, + embedding vector(1536) NOT NULL, + model TEXT NOT NULL DEFAULT 'text-embedding-3-small', + language TEXT NOT NULL DEFAULT 'en', + created_at TIMESTAMPTZ DEFAULT NOW(), + UNIQUE(video_id, scene_index) + ); + + CREATE INDEX IF NOT EXISTS scene_embeddings_hnsw + ON scene_embeddings USING hnsw (embedding vector_cosine_ops); + CREATE INDEX IF NOT EXISTS scene_embeddings_video_id + ON scene_embeddings(video_id); + CREATE INDEX IF NOT EXISTS scene_embeddings_language + ON scene_embeddings(language); + ``` + +2. **Indexing service**: `apps/cms/src/api/scene-embedding/services/indexer.ts` + - Accept scene descriptions + embeddings + video metadata + - Upsert rows (delete existing for video_id + insert within transaction) + - Return count indexed + +## Constraints + +- Follow exact same pattern as feat-009 for raw SQL in Strapi +- HNSW index, not IVFFlat +- Table name may need adjustment based on Strapi's actual `videos` table name + +## Verification + +- `\d scene_embeddings` shows table with vector(1536) column +- Insert test data → HNSW index used in EXPLAIN ANALYZE of similarity query +- Upsert is idempotent — re-indexing same video replaces rows diff --git a/docs/roadmap/content-discovery/feat-042-backfill-worker.md b/docs/roadmap/content-discovery/feat-042-backfill-worker.md new file mode 100644 index 00000000..234b1c83 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-042-backfill-worker.md @@ -0,0 +1,66 @@ +--- +id: "feat-042" +title: "Video Vectorization — English Backfill Worker" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-05-18" +duration: 10 +depends_on: + - "feat-038" + - "feat-040" + - "feat-041" +blocks: + - "feat-044" +tags: + - "manager" + - "ai-pipeline" + - "infrastructure" +--- + +## Problem + +The full English video catalog needs to be processed through the scene vectorization pipeline (boundaries → descriptions → embeddings → indexing). This is a one-time batch job that must be resumable, cost-tracked, and safe to run against production. + +## Entry Points — Read These First + +1. `apps/manager/src/workflows/videoEnrichment.ts` — existing workflow pattern +2. `apps/manager/src/services/sceneBoundaries.ts` — scene boundary extraction (feat-039) +3. `apps/manager/src/services/sceneDescription.ts` — scene description generation (feat-040) +4. `apps/cms/src/api/scene-embedding/services/indexer.ts` — embedding indexer (feat-041) +5. `apps/manager/railway.toml` — Railway service configuration + +## Grep These + +- `restartPolicyType` in `apps/manager/` — Railway restart configuration +- `enrichment-job` in `apps/cms/src/api/` — job tracking pattern + +## What To Build + +Dedicated entry point (separate Railway service or manager CLI command) that: + +1. **Fetches English video queue** — all Videos with English variants, ordered by label (feature films first for early quality signal) +2. **Tracks progress** — store processed video IDs to resume on restart. Use enrichment job pattern or simple DB table. +3. **Per-video pipeline**: scene boundaries → scene descriptions → embed descriptions → index in pgvector +4. **Cost controls**: + - Configurable batch size (default: 100 videos per run) + - Rate limiting (requests per minute to LLM provider) + - Cumulative cost tracking (log tokens used, compute running total) + - Auto-pause at configurable cost threshold (default: $500) +5. **Dry-run mode** — process N videos through boundary extraction only, estimate total LLM cost without making calls +6. **Logging** — structured JSON logs: video ID, label, scene count, tokens used, cost, duration per video + +## Constraints + +- Must be resumable — crashing mid-batch loses no completed work +- Must not block the manager pipeline for new uploads +- Railway worker constraints: design as queue-based with configurable batch sizes rather than assuming infinite runtime +- English only: filter by language throughout + +## Verification + +- Dry-run mode reports accurate cost estimate for full English catalog +- Process 100 English videos end-to-end → embeddings appear in `scene_embeddings` +- Kill worker mid-batch, restart → picks up where it left off +- Cost tracking matches actual API billing within 10% +- `SELECT COUNT(*) FROM scene_embeddings WHERE language = 'en'` grows as expected diff --git a/docs/roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md b/docs/roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md new file mode 100644 index 00000000..7643ffe6 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md @@ -0,0 +1,63 @@ +--- +id: "feat-043" +title: "Video Vectorization — Visual Shot Detection Fusion" +owner: "nisal" +priority: "P2" +status: "not-started" +start_date: "2026-05-28" +duration: 10 +depends_on: + - "feat-039" +tags: + - "manager" + - "ai-pipeline" +--- + +## Problem + +Transcript-based chapter boundaries (feat-039) work well for short clips but may miss visual scene transitions in feature films where a narrative scene contains many camera cuts. Combining visual shot detection with transcript analysis produces more accurate scene boundaries for longer content. + +## Entry Points — Read These First + +1. `apps/manager/src/services/sceneBoundaries.ts` — existing chapter-based boundaries (feat-039) +2. `apps/manager/src/services/sceneDescription.ts` — consumer of scene boundaries (feat-040) + +## Grep These + +- `SceneBoundary` in `apps/manager/src/` — type to extend +- `chapters` in `apps/manager/src/services/` — existing segmentation + +## What To Build + +1. **Research phase** — evaluate scene detection approaches: + - PySceneDetect (Python, may need microservice or WASM) + - Mux frame sampling + LLM-based scene change detection + - FFmpeg scene detection filter (`-vf "select=gt(scene\,0.3)"`) + +2. **Visual boundary detector**: + + ```typescript + export async function detectVisualBoundaries( + playbackId: string, + duration: number, + ): Promise // timestamps of visual scene changes + ``` + +3. **Fusion logic** — merge visual boundaries with chapter-based boundaries: + - If visual and chapter boundaries align (within N seconds), keep chapter boundary + - If visual boundary exists between chapter boundaries, consider splitting + - Use LLM to decide: "given this transcript segment, does a scene change at timestamp T make narrative sense?" + +4. **Update `extractSceneBoundaries`** to optionally use fusion for feature-length videos + +## Constraints + +- This is P2 — only needed if chapter-based boundaries prove insufficient for feature films +- Do not break existing chapter-based flow; fusion is an optional enhancement +- May require Python tooling (PySceneDetect) — evaluate Node.js alternatives first + +## Verification + +- Compare scene boundaries with and without fusion for 10 feature films +- Fusion boundaries align better with narrative scene changes (manual review) +- No regression for short clips (still use chapter-based only) diff --git a/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md b/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md new file mode 100644 index 00000000..2882f252 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md @@ -0,0 +1,87 @@ +--- +id: "feat-044" +title: "Video Vectorization — Recommendation Query API" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-05-28" +duration: 7 +depends_on: + - "feat-041" + - "feat-042" +blocks: + - "feat-046" +tags: + - "cms" + - "pgvector" + - "graphql" +--- + +## Problem + +With scene embeddings indexed, we need a queryable API that returns similar scenes from different videos. This is the core recommendation capability that the demo frontend (feat-046) and future recommendation UI will consume. + +## Entry Points — Read These First + +1. `apps/cms/src/api/scene-embedding/services/indexer.ts` — scene embedding storage (feat-041) +2. `apps/cms/src/api/core-sync/services/` — raw SQL patterns in Strapi services +3. `docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md` — recommendation query in Storage Schema section + +## Grep These + +- `strapi.db.connection.raw` in `apps/cms/src/` — raw SQL execution +- `scene_embeddings` in `apps/cms/src/` — table references +- `register` in `apps/cms/src/api/` — custom route/controller registration pattern + +## What To Build + +1. **Recommendation service**: `apps/cms/src/api/scene-embedding/services/recommender.ts` + + ```typescript + type SceneRecommendation = { + videoId: number + sceneIndex: number + description: string + startSeconds: number + endSeconds: number | null + similarity: number // 0-1 + } + + export async function getRecommendations( + videoId: number, + sceneIndex?: number, // specific scene, or aggregate across all scenes + limit?: number, // default 10 + ): Promise + ``` + +2. **Query logic**: + + ```sql + -- For a specific scene + SELECT se.video_id, se.scene_index, se.description, se.start_seconds, se.end_seconds, + 1 - (se.embedding <=> $1) AS similarity + FROM scene_embeddings se + WHERE se.video_id != $2 + AND se.language = 'en' + ORDER BY se.embedding <=> $1 + LIMIT $3; + ``` + + For whole-video recommendations: average similarity across all scenes of the input video, or take top scene match per candidate video. + +3. **Custom API route**: `GET /api/scene-embeddings/recommendations?videoId=X&sceneIndex=Y&limit=10` + +4. **GraphQL integration** (if applicable): expose as custom query resolver + +## Constraints + +- Filter `video_id != input` to never recommend the same video +- English only for Phase 1 (`language = 'en'`) +- Response must include enough metadata (videoId, timestamps, description) for the frontend to render + +## Verification + +- Query with a known video → returns different videos with >0.5 similarity +- Never returns the input video in results +- Response time <500ms for top-10 query +- Results are plausibly similar (manual spot-check) diff --git a/docs/roadmap/content-discovery/feat-045-pipeline-integration.md b/docs/roadmap/content-discovery/feat-045-pipeline-integration.md new file mode 100644 index 00000000..0784fe66 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-045-pipeline-integration.md @@ -0,0 +1,64 @@ +--- +id: "feat-045" +title: "Video Vectorization — Pipeline Integration" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-06-04" +duration: 7 +depends_on: + - "feat-041" + - "feat-042" +tags: + - "manager" + - "ai-pipeline" +--- + +## Problem + +After backfill, new English video uploads need to be automatically scene-vectorized as part of the enrichment workflow. Unlike existing parallel steps that consume transcript text, scene vectorization needs video frame access — it's an independent branch. + +## Entry Points — Read These First + +1. `apps/manager/src/workflows/videoEnrichment.ts` — existing enrichment workflow with parallel steps +2. `apps/manager/src/services/sceneBoundaries.ts` — scene boundary extraction +3. `apps/manager/src/services/sceneDescription.ts` — scene description generation +4. `apps/cms/src/api/scene-embedding/services/indexer.ts` — embedding indexer + +## Grep These + +- `"use step"` in `apps/manager/src/workflows/` — workflow step pattern +- `transcribe` in `apps/manager/src/workflows/` — step dependency pattern +- `muxAssetId` in `apps/manager/src/workflows/` — where asset IDs are available + +## What To Build + +Add scene vectorization as a new branch in `videoEnrichment.ts`: + +``` +transcribe +├── [existing parallel] translate, chapters, metadata, embeddings +└── [new branch] sceneVectorize + ├── extractSceneBoundaries (needs transcript + chapters output) + ├── describeScenes (needs playbackId for frames + boundaries) + ├── embedDescriptions (needs descriptions) + └── indexSceneEmbeddings (needs embeddings + video metadata) +``` + +- Runs after both transcription AND chapters complete (needs both) +- Uses `muxAssetId` / `playbackId` from job context for frame extraction +- English-only gate: skip for non-English primary language videos +- Updates enrichment job status with `sceneVectorization` step tracking + +## Constraints + +- Do not block existing parallel steps — scene vectorization runs independently +- Failure in scene vectorization should not fail the overall enrichment job +- English-only check: skip step if video's primary language is not English + +## Verification + +- Upload a new English video → enrichment completes → scene embeddings appear in `scene_embeddings` +- Upload a non-English video → scene vectorization step is skipped +- Scene vectorization failure does not block transcript/translation/chapters from completing +- Enrichment job status shows sceneVectorization step status diff --git a/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md b/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md new file mode 100644 index 00000000..69aa82f5 --- /dev/null +++ b/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md @@ -0,0 +1,92 @@ +--- +id: "feat-046" +title: "Video Vectorization — Recommendations Demo Experience" +owner: "nisal" +priority: "P1" +status: "not-started" +start_date: "2026-06-04" +duration: 7 +depends_on: + - "feat-044" +blocks: [] +tags: + - "web" + - "cms" + - "graphql" +--- + +## Problem + +We need a demo frontend to prove the recommendation engine works and to present results for Phase 2 funding decisions. This renders as an Experience on the existing `[slug]/[locale]` route, showing a video with its scene-similar recommendations from other films. + +## Entry Points — Read These First + +1. `apps/web/src/app/[slug]/[locale]/page.tsx` — experience page route (slug + locale) +2. `apps/web/src/app/[slug]/page.tsx` — experience page route (slug only) +3. `apps/web/src/components/sections/index.tsx` — `SectionRenderer` maps block `__typename` to components +4. `apps/web/src/lib/content.ts` — `getWatchExperience()` fetches experience data via GraphQL +5. `apps/cms/src/api/scene-embedding/services/recommender.ts` — recommendation query API (feat-044) + +## Grep These + +- `SectionRenderer` in `apps/web/src/components/` — block type mapping +- `__typename` in `apps/web/src/components/sections/` — how block types are resolved +- `getWatchExperience` in `apps/web/src/lib/` — experience data fetching +- `ExperienceSectionRenderer` in `apps/web/src/` — section rendering pipeline + +## What To Build + +### 1. CMS: Recommendations Block Type + +Add a new block type to the Experience content type in Strapi: + +- **Block name**: `ComponentBlocksVideoRecommendations` +- **Fields**: + - `sourceVideo` — relation to Video (the video to get recommendations for) + - `title` — text (e.g., "Scenes like this") + - `limit` — integer (default 10, max recommendations to show) + +### 2. GraphQL: Expose Recommendations + +Extend the Experience GraphQL query to include the new block type. The block fetches recommendations at render time via the recommendation API (feat-044). + +### 3. Web: Recommendations Section Component + +New component: `apps/web/src/components/sections/VideoRecommendations.tsx` + +```typescript +// Renders a grid/carousel of recommended scenes from other videos +// Each card shows: +// - Mux thumbnail at the scene's start timestamp +// - Scene description (truncated) +// - Source video title +// - Similarity score (optional, for demo purposes) +// - Click → navigates to that video at the scene timestamp +``` + +### 4. Register in SectionRenderer + +Add `ComponentBlocksVideoRecommendations` → `VideoRecommendations` mapping in `SectionRenderer`. + +### 5. Create Demo Experience in CMS + +Create an Experience with slug (e.g., `recommendations-demo`) containing: + +- A VideoHero block with a source video +- A VideoRecommendations block for that video +- Accessible at `/recommendations-demo/en` + +## Constraints + +- Use existing Experience / SectionRenderer pattern — do not create custom routes +- Thumbnails via Mux: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time={START_SECONDS}` +- Demo purpose: optimize for clarity and showcasing results, not production polish +- Server Component by default (Next.js App Router convention) + +## Verification + +- Navigate to `/recommendations-demo/en` → see source video + grid of recommended scenes +- Recommendations are from different videos (not the same film) +- Each recommendation card shows thumbnail, description, and source video title +- Clicking a recommendation navigates to the video (or plays from scene timestamp) +- Page loads in <3s with recommendations visible From 6f60c495b9b8122d7f3cd994246497de255a30f4 Mon Sep 17 00:00:00 2001 From: Kneesal Date: Thu, 2 Apr 2026 10:07:44 +0000 Subject: [PATCH 2/2] docs(roadmap): refine video vectorization approach - Phase 1: en/es/fr to verify locale dedup - Use video segments via Gemini, not still frames - Extract: felt needs/themes, bible verses, content, tone, demographics - Locale-aware queries, no cross-locale bleed - No human tags, pure vector similarity for Phase 1 - Cost model ~$600-$900 for video segment approach - Schema: themes[], bible_verses[], demographics[] Co-Authored-By: Claude Opus 4.6 (1M context) --- ...ideo-content-vectorization-requirements.md | 110 ++++++++++++------ .../feat-037-video-content-vectorization.md | 58 ++++++--- ...feat-038-video-vectorization-data-audit.md | 51 ++++++-- .../feat-040-multimodal-scene-descriptions.md | 85 ++++++++++---- .../feat-041-scene-embeddings-table.md | 6 +- .../feat-042-backfill-worker.md | 16 +-- .../feat-044-recommendation-query-api.md | 23 +++- .../feat-045-pipeline-integration.md | 10 +- ...eat-046-recommendations-demo-experience.md | 5 +- 9 files changed, 253 insertions(+), 111 deletions(-) diff --git a/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md b/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md index 3fb99fcb..74541036 100644 --- a/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md +++ b/docs/brainstorms/2026-04-02-video-content-vectorization-requirements.md @@ -17,27 +17,41 @@ Existing transcript-based text embeddings (already built in the manager pipeline ## Rollout Strategy -**Phase 1 — English prototype (this scope)**: Process all English-language videos only. Prove recommendation quality, validate the pipeline, and establish cost baseline. This is the fundable proof of concept. +**Phase 1 — English / Spanish / French prototype (this scope)**: Process videos in three languages: English, Spanish, and French. Three languages are required to verify that recommendations never bleed across locales — a user watching in English must not be recommended the same film dubbed in Spanish. This also exercises the Video → VideoVariant deduplication model under real multilingual conditions. Prove recommendation quality, validate the pipeline, and establish cost baseline. This is the fundable proof of concept. **Phase 2 — Full catalog (future, funding-dependent)**: If Phase 1 demonstrates value, expand to all 50K+ videos across all languages. Phase 2 is explicitly out of scope for this requirements doc. -All requirements below are scoped to Phase 1 (English videos only) unless stated otherwise. +All requirements below are scoped to Phase 1 (English, Spanish, French) unless stated otherwise. ## Requirements -- R0. **Data audit (prerequisite)**: Before committing to the pipeline, query the CMS to determine: (a) video count by label type and duration distribution for English-language videos, (b) how many have existing chapter/scene metadata from the enrichment pipeline, (c) whether the Video → VideoVariant model provides implicit deduplication or whether separate Video records exist for the same content in different languages. +- R0. **Data audit (prerequisite)**: Before committing to the pipeline, query the CMS to determine: (a) video count by label type and duration distribution for English, Spanish, and French videos, (b) how many have existing chapter/scene metadata from the enrichment pipeline, (c) whether the Video → VideoVariant model provides implicit deduplication or whether separate Video records exist for the same content in different languages. **Critical**: confirm that the same film in English, Spanish, and French share a single Video parent with separate VideoVariant records — if not, the dedup strategy must be revised. - R1. **Scene segmentation**: Break videos into meaningful narrative scenes with precise start/end timestamps. - R1a. **Transcript-based segmentation**: Extend the existing `chapters.ts` service output (which already produces titles, start/end timestamps, and summaries via LLM) as the baseline for scene boundaries. For short clips that are a single scene, chapter output may be sufficient without further segmentation. - R1b. **Visual shot detection + fusion**: For feature-length films, augment transcript-based boundaries with visual shot detection to produce more accurate narrative scene boundaries. This is a research-heavy component — evaluate libraries and approaches during planning. -- R2. **Scene content description**: For each scene, generate a rich multimodal description capturing visual setting, objects, actions, characters, emotional tone, and mood by feeding representative frames + transcript to a multimodal LLM. Note: this requires a new multimodal LLM client — the existing OpenRouter `embeddings.ts` is text-only and cannot send images. +- R2. **Scene analysis**: For each scene, feed the **actual video segment** (not still frames) + transcript + CMS metadata to a multimodal LLM to extract structured signals. The LLM receives the moving video clip via Mux and the transcript chunk from the chapters pipeline. + **Inputs** (what the LLM receives): + - Video segment (actual moving video via Gemini video input, not stills) + - Transcript text for the scene (from chapters pipeline) + - CMS metadata for the parent video (existing bible verse references, video label/type) + **Extracted signals** (what the LLM outputs — ordered by importance): + - **Felt needs/themes** (MOST IMPORTANT): the human need the scene addresses — forgiveness, hope, grief, loneliness, identity, redemption, fear, belonging, purpose, healing, doubt, courage, etc. Two completely different scenes addressing the same felt need should recommend each other. This is the primary signal for ministry content. + - **Bible verses**: scripture references relevant to the scene. Sourced from CMS metadata where available + LLM-identified additional references. E.g., a scene about forgiveness → Matthew 6:14-15, Ephesians 4:32. + - **Content**: narrative summary — what is happening, the dialogue, the message being communicated + - **Emotional tone**: contemplative, joyful, grieving, urgent, peaceful, hopeful, sorrowful + - **Demographics** (where extractable): target audience signals — age group (children, youth, young adult, adult, elderly), life stage (student, parent, married, widowed, incarcerated), cultural context. Not every scene will have clear demographic signals — extract only when evident from the content. + All extracted signals are concatenated into a single text block for embedding, with felt needs/themes weighted by appearing first and repeated. Structured fields (themes, verses, demographics) are also stored as arrays for filtering and display. + Note: this requires a new multimodal LLM client — the existing OpenRouter `embeddings.ts` is text-only and cannot process video. Gemini 2.5 Flash accepts video input natively (up to ~1hr clips). - R3. **Scene embedding and storage**: Embed each scene description using the existing text embedding pipeline (`text-embedding-3-small`, 1536 dims) and store in a **separate `scene_embeddings` table** in pgvector with full traceability back to source video and scene. -- R4. **Cross-film recommendation**: Given a scene or video, find visually and thematically similar scenes from _different_ films using vector similarity. Deduplication across language variants uses the Video → VideoVariant parent relationship (embed once per Video, not per variant). This scope includes the vector similarity query capability; the recommendation UI (how results are surfaced in web/mobile) is a separate feature. -- R5. **Backfill worker**: A dedicated worker service to process the English video catalog. Must be resumable/idempotent. Must include: +- R4. **Cross-film recommendation**: Given a scene or video, find visually and thematically similar scenes from _different_ films using vector similarity. Deduplication across language variants uses the Video → VideoVariant parent relationship (embed once per Video, not per variant). Recommendations are filtered by locale — a user's locale determines which language results they see. **No human tags**: existing CMS tags are unreliable; all semantic signal comes from LLM-generated scene descriptions. This scope includes the vector similarity query capability; the recommendation UI (how results are surfaced in web/mobile) is a separate feature. +- R4a. **Locale-aware filtering**: The recommendation query accepts a `language` parameter and only returns scenes from videos that have a variant in that language. A user watching in Spanish sees recommendations for videos available in Spanish, regardless of which language variant was used for scene analysis. +- R4b. **User-driven scoring (future)**: Recommendation ranking will eventually incorporate user feedback signals (clicks, watch time, explicit ratings). Phase 1 prototype uses pure vector similarity. The feedback loop is explicitly out of scope for Phase 1 but the API should be designed to accept an optional re-ranking parameter for future use. +- R5. **Backfill worker**: A dedicated worker service to process the English, Spanish, and French video catalog. Must be resumable/idempotent. Must include: - Configurable batch size and rate limits - Cost tracking per video and cumulative - Automatic pause if cost exceeds a configurable threshold - Dry-run mode that estimates cost without calling LLMs -- R6. **Incremental pipeline integration**: After backfill, scene vectorization becomes a required step in the existing manager enrichment workflow for new English video uploads. Note: unlike existing parallel steps (translate, chapters, metadata, embeddings) which all consume transcript text, scene vectorization needs video frame access via muxAssetId — it runs as an independent branch, not a simple addition to the existing parallel group. +- R6. **Incremental pipeline integration**: After backfill, scene vectorization becomes a required step in the existing manager enrichment workflow for new video uploads in supported languages (en, es, fr). Note: unlike existing parallel steps (translate, chapters, metadata, embeddings) which all consume transcript text, scene vectorization needs video frame access via muxAssetId — it runs as an independent branch, not a simple addition to the existing parallel group. - R7. **Existing scene metadata**: Where videos already have chapter output from the enrichment pipeline, use it as the starting point for segmentation rather than re-detecting from scratch. ## Storage Schema @@ -59,10 +73,12 @@ CREATE TABLE scene_embeddings ( start_seconds FLOAT NOT NULL, end_seconds FLOAT, -- NULL for final scene (extends to end) - -- Content (for debugging, tracing, and quality review) - description TEXT NOT NULL, -- LLM-generated scene description + -- Extracted signals (for embedding, filtering, and display) + description TEXT NOT NULL, -- concatenated extraction (all signals) — this is what gets embedded + themes TEXT[] DEFAULT '{}', -- felt needs/themes: {"forgiveness","redemption","grief","hope"} + bible_verses TEXT[] DEFAULT '{}', -- {"Matthew 6:14-15","Ephesians 4:32"} + demographics TEXT[] DEFAULT '{}', -- {"youth","student"} — empty if not extractable chapter_title TEXT, -- from chapters.ts if available - frame_count INTEGER, -- how many frames were sent to LLM -- The embedding embedding vector(1536) NOT NULL, @@ -93,20 +109,27 @@ CREATE INDEX scene_embeddings_language ON scene_embeddings(language); - `video_id` → Strapi Video record (title, slug, label, description) - `video_id` → Video.variants → VideoVariant records (language-specific playback) -- `mux_asset_id` / `playback_id` → Mux asset (for re-extracting frames) +- `mux_asset_id` / `playback_id` → Mux asset (for replaying the video segment) - `scene_index` + `start_seconds` / `end_seconds` → exact moment in the video -- `description` → what the LLM "saw" in this scene (stored for inspection) +- `description` → concatenated LLM extraction (themes, verses, content, tone, demographics) +- `themes` → felt needs/themes as structured array (for filtering and display) +- `bible_verses` → scripture references as structured array +- `demographics` → target audience signals as structured array (may be empty) - `chapter_title` → link to chapters.ts output if it was the scene source **Recommendation query pattern:** ```sql --- Find similar scenes from DIFFERENT videos +-- Find similar scenes from DIFFERENT videos, locale-aware +-- $3 = user's locale (en, es, fr). Only return videos that have a variant in the user's language. SELECT se.video_id, se.scene_index, se.description, se.start_seconds, 1 - (se.embedding <=> $1) AS similarity FROM scene_embeddings se +JOIN video_variants vv ON vv.video_id = se.video_id +JOIN languages l ON vv.language_id = l.id WHERE se.video_id != $2 -- exclude current video - AND se.language = 'en' -- Phase 1: English only + AND l.bcp47 = $3 -- only videos available in user's locale + AND se.language IN ('en', 'es', 'fr') -- Phase 1 languages ORDER BY se.embedding <=> $1 LIMIT 10; ``` @@ -115,68 +138,85 @@ LIMIT 10; - **Separate from `video_embeddings`** (feat-009): Different columns (timestamps, description) and different query patterns (scene similarity vs. transcript keyword search). Separate tables let feat-009 proceed as-is. - **`video_id` as dedup key**: Language variants are VideoVariants under the same Video parent. Embedding once per Video and filtering by `video_id !=` gives implicit cross-variant deduplication. -- **`language` column**: Enables Phase 1 (English only) filtering and future Phase 2 expansion without schema changes. +- **`language` column**: Enables Phase 1 (en, es, fr) filtering and future Phase 2 expansion without schema changes. - **`description` stored**: Enables quality review, debugging, and re-embedding with a different model without re-running the LLM. ## Rough Cost Model -**Phase 1 (English only) — order-of-magnitude estimates. Refine after R0 data audit.** +**Phase 1 (English + Spanish + French) — order-of-magnitude estimates. Refine after R0 data audit.** -English subset is likely a fraction of the 50K total. Assuming ~5K-10K English videos: +The three-language subset is likely a fraction of the 50K total. Note: scene analysis runs once per unique Video entity (not per variant), so if the same film exists in all three languages, it's processed once. The cost multiplier depends on how many Videos are unique to each language vs shared. + +Assuming ~5K-10K unique Videos with en/es/fr variants: - Short clips (~80%): 8K × 2 scenes = ~16K scene descriptions - Feature films (~20%): 2K × 75 scenes = ~150K scene descriptions -- **Total: ~166K multimodal LLM calls** +- **Total: ~166K multimodal LLM calls** (per unique Video, not per variant) +- If es/fr add ~30% unique Videos not in English: **~216K total calls** At Gemini 2.5 Flash pricing (~$0.15/1M input tokens, ~$0.60/1M output tokens): -- Per scene: ~3 frames (thumbnails) + transcript chunk ≈ ~2K tokens input, ~500 tokens output -- **Total input: ~332M tokens → ~$50** -- **Total output: ~83M tokens → ~$50** -- **Embedding cost**: 166K × text-embedding-3-small ≈ ~$3 -- **Phase 1 rough total: ~$100-$300** +- Per scene: video segment (~30-120s avg) + transcript chunk + metadata +- Gemini 2.5 Flash video input: ~260 tokens/second of video. Avg scene ~60s = ~15,600 video tokens + ~500 transcript tokens + ~200 metadata tokens ≈ ~16.3K input tokens, ~800 output tokens (structured extraction) +- **Total input: 216K × 16.3K = ~3.5B tokens → ~$525** +- **Total output: 216K × 800 = ~173M tokens → ~$104** +- **Embedding cost**: 216K × text-embedding-3-small ≈ ~$4 +- **Phase 1 rough total: ~$600-$900** +- Note: video tokens are ~8x more expensive per scene than the still-frames approach (~$130-$400). The tradeoff is significantly better extraction quality — the LLM sees motion, pacing, transitions, and full context rather than 3 snapshots. **Full catalog estimate (Phase 2, for future funding request):** -- ~830K scene descriptions → ~$500-$1,500 +- ~830K scenes → ~$2,000-$4,000 -Compare: Twelve Labs Embed at ~$0.03/min × estimated 500K+ total minutes = **$15K+** +Compare: Twelve Labs Embed at ~$0.03/min × estimated 500K+ total minutes = **$15K+**. Our approach is still 4-8x cheaper than Twelve Labs while extracting richer structured signals (themes, verses, demographics). ## Success Criteria - Recommendations surface genuinely different films/clips based on visual and thematic similarity, not just metadata overlap - **Measurable quality bar**: Curate 50-100 seed videos with human-labeled "expected similar" results. Scene embeddings must surface at least 3 relevant cross-film results in top 10 for 80% of seed videos, outperforming transcript-only embeddings on the same evaluation set. - Feature-length films are segmented into meaningful narrative scenes (not raw shot cuts) -- The backfill worker can process the English catalog without manual intervention (resumable on failure, cost-capped) -- New English uploads are automatically scene-vectorized as part of the enrichment pipeline +- The backfill worker can process the en/es/fr catalog without manual intervention (resumable on failure, cost-capped) +- New uploads in supported languages (en, es, fr) are automatically scene-vectorized as part of the enrichment pipeline +- **No locale bleed**: A user watching in Spanish never sees recommendations for the same video in English or French. Verified by testing seed videos across all three locales. +- **No human tags**: All semantic signal comes from LLM-generated scene descriptions. Existing CMS tags are not used for similarity or filtering. - Language variants of the same content are deduplicated in recommendation results +- **Scoring is pure vector similarity** for Phase 1. User-driven feedback loop (clicks, watch time, ratings) is a Phase 2 concern — but the API accepts an optional re-ranking parameter to prepare for it. - **Phase gate**: Phase 1 results are evaluated before requesting Phase 2 funding ## Scope Boundaries -- **Phase 1 only**: English-language videos. Other languages are Phase 2, out of scope. +- **Phase 1 only**: English, Spanish, and French videos. Other languages are Phase 2, out of scope. - **Not building a user-facing search UI** — this is the recommendation engine layer. Search (feat-010) is a separate concern. - **Not replacing transcript embeddings** — scene embeddings complement them. Both live in pgvector in separate tables. - **Hybrid approach**: Start with LLM-generated scene descriptions embedded as text vectors (ships faster, reuses existing infra). Native video embedding models (Twelve Labs, Gemini video embeddings) are a future upgrade path, not in scope now. - **Not building the recommendation UI** — this provides the vector similarity query capability. How recommendations are surfaced in web/mobile is a separate feature. +- **No human tags for similarity** — existing CMS tags are unreliable. All semantic signal comes from LLM-generated scene descriptions. If tags improve, they can be incorporated later. +- **No user feedback loop in Phase 1** — scoring is pure vector similarity. User-driven re-ranking (implicit and explicit signals) is a future enhancement. The API should be structured to accept re-ranking parameters but no feedback infrastructure is built. ## Key Decisions -- **English-first phased rollout**: Prototype with all English videos (~$100-$300 estimated cost). Prove value before investing in full 50K+ catalog. Phase 2 is a separate funding decision. -- **LLM descriptions over native video embeddings**: At scale, native video embedding APIs (Twelve Labs at ~$15K+) are 10-30x more expensive than LLM scene descriptions (~$500-$1,500 full catalog). LLM descriptions reuse existing infrastructure (text-embedding-3-small + pgvector) and provide good quality. Can upgrade selectively later. +- **Three-language prototype (en/es/fr)**: Process English, Spanish, and French videos (~$600-$900 estimated cost). Three languages are the minimum to prove locale-aware deduplication actually works — you can't verify "no locale bleed" with one language. Prove value before investing in full 50K+ catalog. Phase 2 is a separate funding decision. +- **Actual video segments, not still frames**: Send the moving video clip to Gemini 2.5 Flash, not extracted keyframes. This follows the Netflix/YouTube approach where temporal signals (motion, pacing, transitions) carry meaning that stills miss. Costs ~4-5x more per scene than the stills approach but produces significantly better theme/need extraction. This is the same direction the industry has moved — YouTube's content understanding uses frame sequences via video transformers, Netflix uses temporal segments via LSTMs. +- **Felt needs/themes are the primary signal**: For ministry content, thematic similarity matters more than visual similarity. Two completely different scenes about forgiveness should recommend each other. The LLM prompt prioritizes felt needs/themes extraction, and themes appear first in the concatenated description to weight them in the embedding. +- **LLM structured extraction over native video embeddings**: Native video embedding APIs (Twelve Labs at ~$15K+) produce opaque vectors. Our approach extracts human-readable structured signals (themes, verses, demographics, content) that can be inspected, filtered, and displayed — plus the embedding. Full catalog at ~$2K-$4K vs Twelve Labs ~$15K+. - **Scene-level granularity**: Embeddings are per-scene, not per-frame or per-video. Short clips may be 1-3 scenes; feature films 50-200. This is the right unit for recommendations. - **Build on existing chapters pipeline**: The `chapters.ts` service already produces transcript-based scene segmentation with timestamps. R1 extends this with visual shot detection for feature films rather than building scene detection from scratch. -- **Separate `scene_embeddings` table**: Scene embeddings have different columns (start/end timestamps, description text) and query patterns than transcript chunk embeddings. Separate tables let feat-009 proceed as-is and keep query logic clean. Resolve before feat-009 starts Apr 7. -- **Hybrid storage: pgvector + lightweight metadata**: Scene data lives in the `scene_embeddings` table with full traceability columns (video_id, mux_asset_id, timestamps, description) rather than as a Strapi content type. Keeps it lean for prototype; can promote to CMS entity later if human-in-the-loop editing is needed. +- **Bible verses from metadata + LLM**: CMS metadata provides existing verse references where available. The LLM identifies additional relevant scripture from scene context. Both are stored in the `bible_verses` array. +- **Demographics where extractable**: Target audience signals (age group, life stage, cultural context) are extracted when evident from the content. Not every scene will have clear demographic signals — the field may be empty. Stored as a structured array for optional filtering. +- **Separate `scene_embeddings` table**: Scene embeddings have different columns (timestamps, themes, verses, demographics) and query patterns than transcript chunk embeddings. Separate tables let feat-009 proceed as-is and keep query logic clean. +- **Hybrid storage: pgvector + lightweight metadata**: Scene data lives in the `scene_embeddings` table with full traceability columns rather than as a Strapi content type. Keeps it lean for prototype; can promote to CMS entity later if human-in-the-loop editing is needed. - **Backfill worker separate from manager**: The one-time catalog processing runs as a dedicated worker service (can scale independently, doesn't block the manager pipeline). Can reuse the same workflow code/libraries. New uploads use the integrated manager pipeline step. - **Deduplication via Video → VideoVariant model**: Scene detection and embedding runs once per Video entity (the parent), not per VideoVariant. Recommendations filter by unique Video ID. Confirm during data audit (R0) that language variants are modeled as VideoVariants, not separate Video records. +- **No human tags for similarity**: Existing CMS tags are unreliable. All semantic signal comes from LLM extraction against the actual video + transcript. If tags improve, they can be incorporated later. +- **Pure vector similarity for Phase 1 scoring**: No user feedback loop, no click-through weighting, no personalization. Get the prototype working first. The recommendation API accepts an optional `rerank` parameter (no-op in Phase 1) so the interface is ready for user-driven scoring in Phase 2. ## Dependencies / Assumptions - **pgvector must be deployed first** (feat-009, scheduled Apr 7, 14-day duration → ~Apr 21) — R3, R4, R6 are blocked. R0, R1, R2, R5 scaffolding can proceed in parallel. - **Existing chapters pipeline** in manager is working and produces scene-like segmentation -- **Mux thumbnail API** provides frame extraction at specific timestamps via `image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time=N` — confirm during planning -- **New multimodal LLM client needed** — existing OpenRouter client is text-only; R2 requires sending images alongside text +- **Gemini 2.5 Flash video input**: Accepts video natively (up to ~1hr). Scene segments are passed as video input alongside transcript text and CMS metadata. Confirm during planning: how to pass a Mux video URL directly to Gemini vs downloading the segment first. +- **Mux video segment access**: Need to confirm how to extract a video segment (start/end timestamps) from Mux for Gemini input. Options: (a) Mux clip API, (b) download full video and trim, (c) pass Mux stream URL with timestamp params. The thumbnail API (`image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time=N`) is still useful for recommendation card display but NOT for scene analysis input. +- **New multimodal LLM client needed** — existing OpenRouter client is text-only; R2 requires sending video + text to Gemini - **Railway worker constraints** — need to confirm Railway supports long-lived worker processes or design backfill as queue-based with short-lived jobs. Existing `railway.toml` has `restartPolicyMaxRetries: 3` which may not suit multi-day processing. ## Outstanding Questions @@ -250,7 +290,7 @@ This brainstorm produced the following roadmap features in `docs/roadmap/content | [feat-039](../roadmap/content-discovery/feat-039-chapter-based-scene-boundaries.md) | Chapter-Based Scene Boundaries | 7 | Apr 24 | feat-038 | | [feat-040](../roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md) | Multimodal Scene Descriptions | 10 | May 1 | feat-039 | | [feat-041](../roadmap/content-discovery/feat-041-scene-embeddings-table.md) | Scene Embeddings Table + Indexing | 7 | May 11 | feat-009, feat-040 | -| [feat-042](../roadmap/content-discovery/feat-042-backfill-worker.md) | English Backfill Worker | 10 | May 18 | feat-038, feat-040, feat-041 | +| [feat-042](../roadmap/content-discovery/feat-042-backfill-worker.md) | Phase 1 Backfill Worker (en/es/fr) | 10 | May 18 | feat-038, feat-040, feat-041 | | [feat-043](../roadmap/content-discovery/feat-043-visual-shot-detection-fusion.md) | Visual Shot Detection Fusion (P2) | 10 | May 28 | feat-039 | | [feat-044](../roadmap/content-discovery/feat-044-recommendation-query-api.md) | Recommendation Query API | 7 | May 28 | feat-041, feat-042 | | [feat-045](../roadmap/content-discovery/feat-045-pipeline-integration.md) | Pipeline Integration | 7 | Jun 4 | feat-041, feat-042 | diff --git a/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md b/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md index 70901205..657e54c0 100644 --- a/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md +++ b/docs/roadmap/content-discovery/feat-037-video-content-vectorization.md @@ -23,7 +23,7 @@ tags: Current recommendations are metadata-driven — "you watched Film X, here it is in 1,500 other languages." Transcript embeddings (feat-009/010) capture what was said, but miss what was shown. Visual scene embeddings enable cross-film recommendations based on visual setting, actions, emotional tone, and mood. -**Phase 1 (this feature)**: All English-language videos. Prove recommendation quality at ~$100-$300 estimated cost. Phase 2 (full 50K+ catalog) is a separate funding decision. +**Phase 1 (this feature)**: English, Spanish, and French videos. Three languages are required to verify locale-aware deduplication — a user watching in Spanish must never see the same film recommended in English. Prove recommendation quality at ~$130-$400 estimated cost. Phase 2 (full 50K+ catalog) is a separate funding decision. ## Entry Points — Read These First @@ -92,28 +92,38 @@ WHERE ej.step_statuses->>'chapters' = 'completed'; New service: `apps/manager/src/services/sceneDescription.ts` ```typescript -type SceneDescription = { +type SceneAnalysis = { sceneIndex: number startSeconds: number endSeconds: number | null - description: string // LLM-generated rich description + description: string // concatenated extraction (all signals) — this is what gets embedded + themes: string[] // felt needs: ["forgiveness", "redemption", "grief", "hope"] + bibleVerses: string[] // ["Matthew 6:14-15", "Ephesians 4:32"] + demographics: string[] // ["youth", "student"] — empty if not extractable chapterTitle: string | null - frameCount: number } -export async function describeScene( +export async function analyzeScene( + muxAssetId: string, playbackId: string, startSeconds: number, endSeconds: number | null, transcript: string, + metadata: { bibleVerses?: string[]; videoLabel: string }, chapterTitle: string | null, -): Promise +): Promise ``` -- Extract 3 representative frames via Mux thumbnail API at scene start, midpoint, and end -- Send frames + transcript chunk to multimodal LLM (Gemini 2.5 Flash via OpenRouter or direct API) -- Prompt: describe visual setting, objects, actions, characters, emotional tone, mood -- **Requires new multimodal client** — existing OpenRouter client is text-only +- Send **actual video segment** (not stills) to Gemini 2.5 Flash via its native video input, alongside transcript chunk and CMS metadata +- LLM extracts structured signals (ordered by importance): + 1. **Felt needs/themes** (MOST IMPORTANT): forgiveness, hope, grief, loneliness, identity, redemption, belonging, purpose, healing, doubt, courage + 2. **Bible verses**: from CMS metadata where available + LLM-identified additional references + 3. **Content**: narrative summary, dialogue, message being communicated + 4. **Emotional tone**: contemplative, joyful, grieving, urgent, peaceful, hopeful + 5. **Demographics** (where extractable): age group, life stage, cultural context +- `description` concatenates all signals into a single text block for embedding, with themes/needs weighted first +- Structured fields stored as arrays for filtering and display +- **Requires new multimodal client** — existing OpenRouter client is text-only and cannot process video ### R3. Scene Embedding + Storage @@ -129,9 +139,11 @@ CREATE TABLE IF NOT EXISTS scene_embeddings ( scene_index INTEGER NOT NULL, start_seconds FLOAT NOT NULL, end_seconds FLOAT, - description TEXT NOT NULL, + description TEXT NOT NULL, -- concatenated extraction (all signals) — embedded + themes TEXT[] DEFAULT '{}', -- felt needs: {"forgiveness","redemption","grief"} + bible_verses TEXT[] DEFAULT '{}', -- {"Matthew 6:14-15","Ephesians 4:32"} + demographics TEXT[] DEFAULT '{}', -- {"youth","student"} — may be empty chapter_title TEXT, - frame_count INTEGER, embedding vector(1536) NOT NULL, model TEXT NOT NULL DEFAULT 'text-embedding-3-small', language TEXT NOT NULL DEFAULT 'en', @@ -166,16 +178,20 @@ export async function indexSceneEmbeddings( ### R4. Cross-film Recommendation Query ```sql +-- Locale-aware: only return videos available in the user's language SELECT se.video_id, se.scene_index, se.description, se.start_seconds, 1 - (se.embedding <=> $1) AS similarity FROM scene_embeddings se +JOIN video_variants vv ON vv.video_id = se.video_id +JOIN languages l ON vv.language_id = l.id WHERE se.video_id != $2 - AND se.language = 'en' + AND l.bcp47 = $3 -- user's locale + AND se.language IN ('en', 'es', 'fr') ORDER BY se.embedding <=> $1 LIMIT 10; ``` -Expose as CMS service or API endpoint for web/mobile consumption. +Expose as CMS service or API endpoint for web/mobile consumption. API accepts optional `rerank` parameter (no-op in Phase 1, reserved for user-driven scoring). ### R5. Backfill Worker @@ -196,7 +212,10 @@ Add scene vectorization to `videoEnrichment.ts` as an independent branch: ## Constraints -- **English only** — filter by language in all queries and processing. `language` column enables future expansion. +- **Phase 1 languages: en, es, fr** — filter by language in all queries and processing. `language` column enables future expansion. +- **No locale bleed** — recommendations are locale-aware. A user's locale determines which results they see. Never recommend the same video in a different language. +- **No human tags** — existing CMS tags are unreliable. All semantic signal comes from LLM-generated scene descriptions only. +- **Pure vector similarity scoring** — no user feedback loop in Phase 1. API accepts optional `rerank` parameter (no-op) to prepare for user-driven scoring in Phase 2. - **Separate table from `video_embeddings`** — different columns, different query patterns. Do not extend feat-009's table. - **Do NOT use a Strapi content type** for scene embeddings — pgvector columns don't work with Strapi ORM. Use raw SQL (same pattern as feat-009). - **Embed once per Video, not per VideoVariant** — language variants share visual content. Dedup by `video_id`. @@ -208,8 +227,9 @@ Add scene vectorization to `videoEnrichment.ts` as an independent branch: 1. **Data audit complete**: know English video count by label, duration distribution, chapter coverage 2. **Scene segmentation**: sample 10 feature films, verify scene boundaries align with narrative scenes (not just shot cuts) 3. **Scene descriptions**: sample 20 scenes, verify descriptions capture visual content, not just transcript paraphrasing -4. **Embeddings indexed**: `SELECT COUNT(*) FROM scene_embeddings WHERE language = 'en'` matches expected scene count +4. **Embeddings indexed**: `SELECT COUNT(*) FROM scene_embeddings WHERE language IN ('en', 'es', 'fr')` matches expected scene count 5. **Recommendation quality**: for 50 seed videos, top-10 similar scenes include at least 3 relevant cross-film results for 80% of seeds -6. **Deduplication**: recommendations never surface the same video (different variant) as the input -7. **Cost tracking**: backfill worker logs cumulative cost, stays within budget -8. **Pipeline integration**: upload a new English video → scene embeddings appear in `scene_embeddings` table automatically +6. **No locale bleed**: query recommendations for a Spanish video with locale=es → results are all videos with Spanish variants. Repeat for en and fr. No cross-locale contamination. +7. **Deduplication**: recommendations never surface the same video (different variant) as the input +8. **Cost tracking**: backfill worker logs cumulative cost, stays within budget +9. **Pipeline integration**: upload a new video in en/es/fr → scene embeddings appear in `scene_embeddings` table automatically diff --git a/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md b/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md index c354dade..ba943422 100644 --- a/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md +++ b/docs/roadmap/content-discovery/feat-038-video-vectorization-data-audit.md @@ -18,7 +18,7 @@ tags: ## Problem -Before building the scene vectorization pipeline, we need to know the shape of the English video catalog: how many videos by type, duration distribution, and existing chapter coverage. This gates all downstream sizing, cost estimates, and architecture decisions. +Before building the scene vectorization pipeline, we need to know the shape of the English, Spanish, and French video catalog: how many videos by type, duration distribution, existing chapter coverage, and critically — whether language variants share a Video parent (dedup model). This gates all downstream sizing, cost estimates, and architecture decisions. ## Entry Points — Read These First @@ -37,15 +37,22 @@ Before building the scene vectorization pipeline, we need to know the shape of t Run diagnostic queries against the CMS database: ```sql --- English video count by label -SELECT v.label, COUNT(*) as count +-- Video count by label for Phase 1 languages (en, es, fr) +SELECT v.label, l.bcp47, COUNT(*) as count FROM videos v JOIN video_variants vv ON vv.video_id = v.id JOIN languages l ON vv.language_id = l.id -WHERE l.bcp47 = 'en' -GROUP BY v.label ORDER BY count DESC; +WHERE l.bcp47 IN ('en', 'es', 'fr') +GROUP BY v.label, l.bcp47 ORDER BY v.label, l.bcp47; --- Duration distribution for English videos +-- Unique Video count (deduped across languages) — this is what we actually process +SELECT COUNT(DISTINCT v.id) as unique_videos +FROM videos v +JOIN video_variants vv ON vv.video_id = v.id +JOIN languages l ON vv.language_id = l.id +WHERE l.bcp47 IN ('en', 'es', 'fr'); + +-- Duration distribution for Phase 1 languages SELECT v.label, COUNT(*) as count, ROUND(AVG(vv.duration)) as avg_duration_sec, @@ -53,7 +60,7 @@ SELECT v.label, FROM videos v JOIN video_variants vv ON vv.video_id = v.id JOIN languages l ON vv.language_id = l.id -WHERE l.bcp47 = 'en' +WHERE l.bcp47 IN ('en', 'es', 'fr') GROUP BY v.label; -- Chapter metadata coverage @@ -61,14 +68,32 @@ SELECT COUNT(DISTINCT ej.mux_asset_id) FROM enrichment_jobs ej WHERE ej.step_statuses->>'chapters' = 'completed'; --- Confirm Video → VideoVariant dedup model -SELECT v.id, COUNT(vv.id) as variant_count +-- CRITICAL: Confirm Video → VideoVariant dedup model +-- Do en/es/fr variants of the same film share a Video parent? +SELECT v.id, v.label, + COUNT(vv.id) as variant_count, + ARRAY_AGG(DISTINCT l.bcp47) as languages FROM videos v JOIN video_variants vv ON vv.video_id = v.id -GROUP BY v.id ORDER BY variant_count DESC LIMIT 10; +JOIN languages l ON vv.language_id = l.id +WHERE l.bcp47 IN ('en', 'es', 'fr') +GROUP BY v.id, v.label +HAVING COUNT(DISTINCT l.bcp47) > 1 +ORDER BY variant_count DESC LIMIT 20; + +-- How many Videos have variants in multiple Phase 1 languages? +-- (high overlap = dedup model works, low overlap = mostly unique per language) +SELECT multi_lang_count, COUNT(*) as video_count FROM ( + SELECT v.id, COUNT(DISTINCT l.bcp47) as multi_lang_count + FROM videos v + JOIN video_variants vv ON vv.video_id = v.id + JOIN languages l ON vv.language_id = l.id + WHERE l.bcp47 IN ('en', 'es', 'fr') + GROUP BY v.id +) sub GROUP BY multi_lang_count ORDER BY multi_lang_count; ``` -Deliverable: update the brainstorm doc cost model with actual numbers. Confirm or revise the ~$100-$300 Phase 1 estimate. +Deliverable: update the brainstorm doc cost model with actual numbers. Confirm or revise the ~$130-$400 Phase 1 estimate. **If the dedup model is broken (same film = separate Video records per language), flag immediately — the entire dedup strategy must be revised.** ## Constraints @@ -77,7 +102,9 @@ Deliverable: update the brainstorm doc cost model with actual numbers. Confirm o ## Verification -- Know exact English video count by label type +- Know exact video count by label type for en, es, fr +- Know how many unique Video entities span multiple Phase 1 languages (dedup model validation) - Know duration distribution (what % are short clips vs feature films) - Know chapter coverage (what % already have scene-like metadata) - Cost model in brainstorm doc updated with real numbers +- **Dedup model confirmed or red-flagged**: en/es/fr variants of the same film share a Video parent diff --git a/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md b/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md index 4f0554d8..84fc8504 100644 --- a/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md +++ b/docs/roadmap/content-discovery/feat-040-multimodal-scene-descriptions.md @@ -1,6 +1,6 @@ --- id: "feat-040" -title: "Video Vectorization — Multimodal Scene Descriptions" +title: "Video Vectorization — Multimodal Scene Analysis" owner: "nisal" priority: "P1" status: "not-started" @@ -18,66 +18,101 @@ tags: ## Problem -Each scene needs a rich description capturing visual setting, objects, actions, emotional tone, and mood. This requires a new multimodal LLM client (existing OpenRouter client is text-only) that can send video frames alongside transcript text. +Each scene needs structured signal extraction that drives recommendation quality. For ministry content, **felt needs/themes** are the most important signal — two completely different scenes addressing forgiveness should recommend each other. This requires a new multimodal LLM client that can process actual video segments (not still frames) alongside transcript text and CMS metadata. The existing OpenRouter client is text-only and cannot handle video input. + +**Approach**: Following the direction Netflix and YouTube have moved — process actual video (motion, pacing, transitions) rather than keyframes. Gemini 2.5 Flash accepts video input natively and extracts richer signals from moving content than stills alone. ## Entry Points — Read These First 1. `apps/manager/src/lib/openrouter.ts` — existing AI client (text-only) 2. `apps/manager/src/services/chapters.ts` — example of LLM prompting pattern 3. `apps/manager/src/services/sceneBoundaries.ts` — scene boundary input (from feat-039) -4. `apps/cms/src/api/mux-video/content-types/mux-video/schema.json` — `playbackId` for Mux thumbnail URLs +4. `apps/cms/src/api/mux-video/content-types/mux-video/schema.json` — `assetId` and `playbackId` for Mux video access ## Grep These - `getOpenrouter` in `apps/manager/src/` — existing AI client usage +- `muxAssetId` in `apps/manager/src/` — Mux asset references for video access - `playbackId` in `apps/manager/src/` — Mux playback ID references ## What To Build -1. **Multimodal LLM client** — extend or add a client that supports sending images + text. Gemini 2.5 Flash recommended for cost/quality. +1. **Multimodal LLM client** — new client that supports sending video + text to Gemini 2.5 Flash. Must handle video input (not images). Evaluate: pass Mux stream URL directly vs download segment and upload. -2. **Frame extraction utility**: +2. **Video segment access utility**: ```typescript - export async function extractFrames( + // Get a video segment from Mux for Gemini input + export async function getVideoSegment( + muxAssetId: string, playbackId: string, - timestamps: number[], - ): Promise + startSeconds: number, + endSeconds: number | null, + ): Promise // format TBD: URL, Buffer, or file path ``` - Uses Mux thumbnail API: `https://image.mux.com/{PLAYBACK_ID}/thumbnail.jpg?time={SECONDS}` + Research during planning: Mux clip API, signed URL with range params, or download-and-trim. -3. **Scene description service**: `apps/manager/src/services/sceneDescription.ts` +3. **Scene analysis service**: `apps/manager/src/services/sceneAnalysis.ts` ```typescript - type SceneDescription = { + type SceneAnalysis = { sceneIndex: number startSeconds: number endSeconds: number | null - description: string + description: string // concatenated extraction — this is what gets embedded + themes: string[] // felt needs: ["forgiveness", "redemption", "grief", "hope"] + bibleVerses: string[] // ["Matthew 6:14-15", "Ephesians 4:32"] + demographics: string[] // ["youth", "student"] — empty if not extractable chapterTitle: string | null - frameCount: number } - export async function describeScene( + export async function analyzeScene( + muxAssetId: string, playbackId: string, boundary: SceneBoundary, - ): Promise + transcript: string, + metadata: { bibleVerses?: string[]; videoLabel: string }, + ): Promise + ``` + + **Inputs to LLM**: + - Actual video segment (moving video, not stills) + - Transcript text for the scene + - CMS metadata (existing bible verse references, video label/type) + + **LLM extracts** (ordered by importance for the embedding): + 1. **Felt needs/themes** (MOST IMPORTANT): forgiveness, hope, grief, loneliness, identity, redemption, belonging, purpose, healing, doubt, courage, fear, etc. + 2. **Bible verses**: from CMS metadata where available + LLM-identified additional references + 3. **Content**: narrative summary, dialogue, message being communicated + 4. **Emotional tone**: contemplative, joyful, grieving, urgent, peaceful, hopeful, sorrowful + 5. **Demographics** (where extractable): age group (children, youth, young adult, adult, elderly), life stage (student, parent, married, widowed, incarcerated), cultural context + + **Embedding construction**: `description` concatenates all signals into a single text block, with themes/needs appearing first to weight them in the embedding. Example: + + ``` + Themes: forgiveness, guilt, reconciliation. + Bible verses: Matthew 6:14-15, Ephesians 4:32. + Content: A father confronts his estranged son after years apart. The son asks for forgiveness... + Tone: sorrowful, hopeful. + Demographics: adult, parent. ``` - - Extract 3 frames (start, mid, end of scene) - - Send frames + transcript chunk to multimodal LLM - - Prompt for: visual setting, objects, actions, characters, emotional tone, mood - - Store as `{assetId}/scene-descriptions.json` artifact + - Store as `{assetId}/scene-analysis.json` artifact ## Constraints -- Confirm Mux thumbnail API works for arbitrary timestamps and returns sufficient resolution -- Rate limit LLM calls — respect provider limits -- Log token usage per call for cost tracking +- **Video segments, not stills** — send actual moving video to Gemini, not extracted keyframes +- Confirm Mux video segment access method during planning (clip API, signed URLs, or download) +- Rate limit LLM calls — respect Gemini provider limits +- Log token usage per call for cost tracking (video tokens are ~260/second) +- Demographics are optional — extract only when evident, leave empty otherwise ## Verification -- Sample 20 scenes: descriptions capture visual content, not just transcript paraphrasing -- Mux thumbnail extraction works for timestamps throughout a video -- Token usage logged accurately +- Sample 20 scenes: extraction captures felt needs/themes, not just transcript paraphrasing +- Themes/needs are meaningful ministry categories (not generic like "good" or "interesting") +- Bible verses are relevant to the scene's actual themes (spot-check 20 scenes) +- Two visually different scenes about the same felt need (e.g., forgiveness) produce similar embeddings +- Demographics extracted where clearly applicable (youth scene → "youth"), empty where ambiguous +- Token usage logged accurately — video token counts match expected ~260 tokens/second diff --git a/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md b/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md index 5a86639a..6f909227 100644 --- a/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md +++ b/docs/roadmap/content-discovery/feat-041-scene-embeddings-table.md @@ -46,9 +46,11 @@ Scene descriptions need to be embedded and stored in pgvector for similarity que scene_index INTEGER NOT NULL, start_seconds FLOAT NOT NULL, end_seconds FLOAT, - description TEXT NOT NULL, + description TEXT NOT NULL, -- concatenated extraction (all signals) — embedded + themes TEXT[] DEFAULT '{}', -- felt needs: {"forgiveness","redemption","grief"} + bible_verses TEXT[] DEFAULT '{}', -- {"Matthew 6:14-15","Ephesians 4:32"} + demographics TEXT[] DEFAULT '{}', -- {"youth","student"} — may be empty chapter_title TEXT, - frame_count INTEGER, embedding vector(1536) NOT NULL, model TEXT NOT NULL DEFAULT 'text-embedding-3-small', language TEXT NOT NULL DEFAULT 'en', diff --git a/docs/roadmap/content-discovery/feat-042-backfill-worker.md b/docs/roadmap/content-discovery/feat-042-backfill-worker.md index 234b1c83..ab4bcb0c 100644 --- a/docs/roadmap/content-discovery/feat-042-backfill-worker.md +++ b/docs/roadmap/content-discovery/feat-042-backfill-worker.md @@ -1,6 +1,6 @@ --- id: "feat-042" -title: "Video Vectorization — English Backfill Worker" +title: "Video Vectorization — Phase 1 Backfill Worker (en/es/fr)" owner: "nisal" priority: "P1" status: "not-started" @@ -20,7 +20,7 @@ tags: ## Problem -The full English video catalog needs to be processed through the scene vectorization pipeline (boundaries → descriptions → embeddings → indexing). This is a one-time batch job that must be resumable, cost-tracked, and safe to run against production. +The English, Spanish, and French video catalog needs to be processed through the scene vectorization pipeline (boundaries → descriptions → embeddings → indexing). Processing runs once per unique Video entity (not per variant). This is a one-time batch job that must be resumable, cost-tracked, and safe to run against production. ## Entry Points — Read These First @@ -39,7 +39,7 @@ The full English video catalog needs to be processed through the scene vectoriza Dedicated entry point (separate Railway service or manager CLI command) that: -1. **Fetches English video queue** — all Videos with English variants, ordered by label (feature films first for early quality signal) +1. **Fetches Phase 1 video queue** — all unique Videos with en/es/fr variants, ordered by label (feature films first for early quality signal). Dedup: process each Video once regardless of how many language variants it has. 2. **Tracks progress** — store processed video IDs to resume on restart. Use enrichment job pattern or simple DB table. 3. **Per-video pipeline**: scene boundaries → scene descriptions → embed descriptions → index in pgvector 4. **Cost controls**: @@ -55,12 +55,14 @@ Dedicated entry point (separate Railway service or manager CLI command) that: - Must be resumable — crashing mid-batch loses no completed work - Must not block the manager pipeline for new uploads - Railway worker constraints: design as queue-based with configurable batch sizes rather than assuming infinite runtime -- English only: filter by language throughout +- Phase 1 languages only (en, es, fr): filter by language throughout +- Process once per Video entity, store `language` column as the transcript language used for description ## Verification -- Dry-run mode reports accurate cost estimate for full English catalog -- Process 100 English videos end-to-end → embeddings appear in `scene_embeddings` +- Dry-run mode reports accurate cost estimate for full en/es/fr catalog +- Process 100 videos end-to-end → embeddings appear in `scene_embeddings` - Kill worker mid-batch, restart → picks up where it left off - Cost tracking matches actual API billing within 10% -- `SELECT COUNT(*) FROM scene_embeddings WHERE language = 'en'` grows as expected +- `SELECT COUNT(*) FROM scene_embeddings WHERE language IN ('en', 'es', 'fr')` grows as expected +- No duplicate processing: a Video with en+es+fr variants is processed once diff --git a/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md b/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md index 2882f252..db094681 100644 --- a/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md +++ b/docs/roadmap/content-discovery/feat-044-recommendation-query-api.md @@ -49,39 +49,50 @@ With scene embeddings indexed, we need a queryable API that returns similar scen export async function getRecommendations( videoId: number, + locale: string, // user's locale — filters results to videos available in this language sceneIndex?: number, // specific scene, or aggregate across all scenes limit?: number, // default 10 + rerank?: string, // no-op in Phase 1, reserved for future user-driven scoring ): Promise ``` 2. **Query logic**: ```sql - -- For a specific scene + -- For a specific scene, locale-aware + -- $3 = user's locale (en, es, fr). Only return videos that have a variant in user's language. SELECT se.video_id, se.scene_index, se.description, se.start_seconds, se.end_seconds, 1 - (se.embedding <=> $1) AS similarity FROM scene_embeddings se + JOIN video_variants vv ON vv.video_id = se.video_id + JOIN languages l ON vv.language_id = l.id WHERE se.video_id != $2 - AND se.language = 'en' + AND l.bcp47 = $3 -- locale-aware: only videos available in user's language + AND se.language IN ('en', 'es', 'fr') -- Phase 1 languages ORDER BY se.embedding <=> $1 - LIMIT $3; + LIMIT $4; ``` For whole-video recommendations: average similarity across all scenes of the input video, or take top scene match per candidate video. -3. **Custom API route**: `GET /api/scene-embeddings/recommendations?videoId=X&sceneIndex=Y&limit=10` +3. **Custom API route**: `GET /api/scene-embeddings/recommendations?videoId=X&locale=en&sceneIndex=Y&limit=10&rerank=` 4. **GraphQL integration** (if applicable): expose as custom query resolver ## Constraints - Filter `video_id != input` to never recommend the same video -- English only for Phase 1 (`language = 'en'`) +- **Locale-aware**: `locale` parameter is required. Only return videos with a variant in the requested language. +- Phase 1 languages: en, es, fr +- **No human tags for similarity** — all semantic signal is from LLM scene descriptions +- **Pure vector similarity scoring** — `rerank` parameter accepted but is a no-op in Phase 1. Designed to accept user-driven scoring signals in Phase 2. - Response must include enough metadata (videoId, timestamps, description) for the frontend to render ## Verification -- Query with a known video → returns different videos with >0.5 similarity +- Query with a known video + locale=en → returns different videos with >0.5 similarity, all with English variants +- Query same video + locale=es → results are all videos with Spanish variants (different result set) +- **No locale bleed**: query with locale=es never returns a video that only exists in English - Never returns the input video in results - Response time <500ms for top-10 query - Results are plausibly similar (manual spot-check) diff --git a/docs/roadmap/content-discovery/feat-045-pipeline-integration.md b/docs/roadmap/content-discovery/feat-045-pipeline-integration.md index 0784fe66..61757b45 100644 --- a/docs/roadmap/content-discovery/feat-045-pipeline-integration.md +++ b/docs/roadmap/content-discovery/feat-045-pipeline-integration.md @@ -16,7 +16,7 @@ tags: ## Problem -After backfill, new English video uploads need to be automatically scene-vectorized as part of the enrichment workflow. Unlike existing parallel steps that consume transcript text, scene vectorization needs video frame access — it's an independent branch. +After backfill, new video uploads in supported languages (en, es, fr) need to be automatically scene-vectorized as part of the enrichment workflow. Unlike existing parallel steps that consume transcript text, scene vectorization needs video frame access — it's an independent branch. ## Entry Points — Read These First @@ -47,18 +47,20 @@ transcribe - Runs after both transcription AND chapters complete (needs both) - Uses `muxAssetId` / `playbackId` from job context for frame extraction -- English-only gate: skip for non-English primary language videos +- Phase 1 language gate: skip for videos not in en/es/fr +- Process once per Video entity (not per variant) — check if Video already has scene embeddings before processing - Updates enrichment job status with `sceneVectorization` step tracking ## Constraints - Do not block existing parallel steps — scene vectorization runs independently - Failure in scene vectorization should not fail the overall enrichment job -- English-only check: skip step if video's primary language is not English +- Phase 1 language check: skip step if video's language is not in (en, es, fr) ## Verification - Upload a new English video → enrichment completes → scene embeddings appear in `scene_embeddings` -- Upload a non-English video → scene vectorization step is skipped +- Upload a new Spanish video → enrichment completes → scene embeddings appear +- Upload a video in an unsupported language (e.g., Japanese) → scene vectorization step is skipped - Scene vectorization failure does not block transcript/translation/chapters from completing - Enrichment job status shows sceneVectorization step status diff --git a/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md b/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md index 69aa82f5..7b174c28 100644 --- a/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md +++ b/docs/roadmap/content-discovery/feat-046-recommendations-demo-experience.md @@ -86,7 +86,10 @@ Create an Experience with slug (e.g., `recommendations-demo`) containing: ## Verification - Navigate to `/recommendations-demo/en` → see source video + grid of recommended scenes -- Recommendations are from different videos (not the same film) +- Navigate to `/recommendations-demo/es` → recommendations are all videos with Spanish variants +- Navigate to `/recommendations-demo/fr` → recommendations are all videos with French variants +- **No locale bleed**: `/recommendations-demo/es` never shows a video that only exists in English +- Recommendations are from different videos (not the same film in a different language) - Each recommendation card shows thumbnail, description, and source video title - Clicking a recommendation navigates to the video (or plays from scene timestamp) - Page loads in <3s with recommendations visible