YouTube Video Subtitle Generation Algorithm

This algorithm has been created using Claude Opus 4.5.

Executive Summary

This document outlines the algorithm and architecture for the subsync CLI application's core feature: generating Netflix-compliant subtitles from YouTube videos. The system extracts audio from YouTube videos, transcribes the speech using AI, processes the transcription to meet professional subtitle standards, and outputs YouTube-compatible subtitle files.

System Overview
Architecture
Detailed Algorithm
Netflix Compliance Rules
YouTube Format Compatibility
Third-Party Tools & Services
Data Structures
Error Handling
Future Extensibility: Translation
Visual Diagrams

1. System Overview

Purpose

Transform YouTube video audio into professional-quality subtitles that:

Are accurately transcribed in the video's spoken language
Meet Netflix's Timed Text Style Guide requirements
Are formatted for direct upload to YouTube
Support future translation capabilities

Input

YouTube video URL (public videos)
Target language code (language of the audio)
Optional: output path, quality settings

Output

SRT subtitle file (primary format - universal compatibility)
Optional: VTT file (WebVTT format)

2. Architecture

High-Level System Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              SUBSYNC CLI                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌────────────┐│
│  │   URL        │    │   Audio      │    │              │    │  Subtitle  ││
│  │   Handler    │───▶│   Extractor  │───▶│  Transcriber │───▶│  Processor ││
│  │              │    │   (yt-dlp)   │    │  (Whisper)   │    │  (Netflix) ││
│  └──────────────┘    └──────────────┘    └──────────────┘    └─────┬──────┘│
│                                                                      │       │
│                                                                      ▼       │
│                      ┌──────────────────────────────────────────────────────┤
│                      │                                                       │
│                      │  ┌────────────┐    ┌────────────┐    ┌────────────┐ │
│                      │  │  Subtitle  │    │    SRT     │    │    VTT     │ │
│                      └─▶│   Writer   │───▶│   Output   │    │   Output   │ │
│                         │            │    │            │    │ (optional) │ │
│                         └────────────┘    └────────────┘    └────────────┘ │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────────┤
│  │                    FUTURE: Translation Module                             │
│  │  ┌────────────┐    ┌────────────┐    ┌────────────┐                     │
│  │  │  Google    │    │   DeepL    │    │   OpenAI   │                     │
│  │  │ Translate  │    │    API     │    │    API     │                     │
│  │  └────────────┘    └────────────┘    └────────────┘                     │
│  └──────────────────────────────────────────────────────────────────────────┤
└─────────────────────────────────────────────────────────────────────────────┘

Module Responsibilities

Module	Responsibility
URL Handler	Validate YouTube URLs, extract video ID
Audio Extractor	Download audio stream using yt-dlp
Transcriber	Convert speech to text with timestamps
Subtitle Processor	Apply Netflix timing/formatting rules
Subtitle Writer	Generate output files (SRT, VTT)
Translation Module	(Future) Translate subtitles to other languages

3. Detailed Algorithm

Step 1: Input Validation

INPUT: youtube_url, language_code, [output_path], [quality]

PROCESS:
  1. Parse URL to extract video ID
     - Supported formats:
       • https://www.youtube.com/watch?v=VIDEO_ID
       • https://youtu.be/VIDEO_ID
       • https://www.youtube.com/embed/VIDEO_ID
       • https://www.youtube.com/v/VIDEO_ID

  2. Validate language code against Whisper supported languages
     - List of 99+ supported languages
     - Auto-detection available if not specified

  3. Validate output path (create if needed)

OUTPUT: Validated parameters object

Step 2: Video Metadata Extraction

INPUT: video_id

PROCESS (using yt-dlp):
  1. Extract video information without downloading
     - Title (for output filename)
     - Duration (for progress estimation)
     - Availability status
     - Video ID confirmation

  2. Check for restrictions:
     - Age restriction
     - Region locking
     - Private/unlisted status
     - Live stream (not supported)

OUTPUT: VideoMetadata {
  id: str,
  title: str,
  duration: float,
  uploader: str,
  upload_date: str
}

ERRORS:
  - VideoUnavailable: Video is private or deleted
  - AgeRestricted: Video requires age verification
  - LiveStream: Live streams not supported

Step 3: Audio Extraction

INPUT: VideoMetadata, temp_directory

PROCESS (using yt-dlp):
  1. Configure extraction options:
     - format: "bestaudio/best"
     - extract_audio: True
     - audio_format: "wav"  # Best for Whisper
     - audio_quality: 0     # Highest quality
     - postprocessors: FFmpegExtractAudio

  2. Download and extract audio
     - Show progress callback
     - Handle interruptions gracefully

  3. Verify audio file integrity
     - Check file exists
     - Validate duration matches metadata

OUTPUT: audio_file_path (WAV format, 16kHz mono optimal)

CLEANUP: Register for deletion on completion/error

Step 4: Speech Transcription

INPUT: audio_file_path, language_code

PROCESS (using OpenAI Whisper):
  1. Select model based on quality setting:
     - "turbo": Fast, good accuracy (recommended)
     - "large-v3": Best accuracy, slower
     - "medium": Balanced
     - "small"/"base"/"tiny": Faster, less accurate

  2. Load model (cache for reuse)

  3. Transcribe with options:
     - language: specified or auto-detect
     - task: "transcribe" (not translate)
     - word_timestamps: True (granular timing)
     - verbose: False

  4. Extract segments with timing:
     - Each segment has start, end, text
     - Word-level timestamps for precise alignment

OUTPUT: TranscriptionResult {
  language: str,
  segments: [
    {
      id: int,
      start: float,      # seconds
      end: float,        # seconds
      text: str,
      words: [{word, start, end}, ...]
    },
    ...
  ]
}

Step 5: Netflix Compliance Processing

INPUT: TranscriptionResult

PROCESS:
  FOR each segment in transcription:

    1. TIMING VALIDATION:
       ┌─────────────────────────────────────────────┐
       │ Check duration = end - start                 │
       │                                              │
       │ IF duration < 833ms (5/6 second):           │
       │   - Extend end time if no conflict          │
       │   - Minimum: 833ms                          │
       │                                              │
       │ IF duration > 7000ms (7 seconds):           │
       │   - Split into multiple segments            │
       │   - Find natural break points               │
       │                                              │
       │ ENSURE gap from previous >= 83ms (2 frames) │
       └─────────────────────────────────────────────┘

    2. TEXT SEGMENTATION:
       ┌─────────────────────────────────────────────┐
       │ Count characters in text                     │
       │                                              │
       │ IF chars <= 42:                             │
       │   - Single line, no split needed            │
       │                                              │
       │ IF 42 < chars <= 84:                        │
       │   - Split into 2 lines                      │
       │   - Apply line break rules (see below)      │
       │                                              │
       │ IF chars > 84:                              │
       │   - Split into multiple subtitle events     │
       │   - Use word timestamps for timing          │
       └─────────────────────────────────────────────┘

    3. LINE BREAK RULES (in priority order):
       ┌─────────────────────────────────────────────┐
       │ PREFER breaking:                            │
       │   ✓ After punctuation marks (. , ! ? :)    │
       │   ✓ Before conjunctions (and, but, or)     │
       │   ✓ Before prepositions (in, on, at, to)   │
       │                                              │
       │ AVOID breaking:                             │
       │   ✗ Between article and noun               │
       │   ✗ Between adjective and noun             │
       │   ✗ Between first and last name            │
       │   ✗ Between verb and subject pronoun       │
       │   ✗ Between verb and auxiliary             │
       └─────────────────────────────────────────────┘

    4. READING SPEED VALIDATION:
       ┌─────────────────────────────────────────────┐
       │ Calculate CPS = total_chars / duration_secs │
       │                                              │
       │ IF CPS > 20 (adult content):               │
       │   Option A: Extend duration (if space)      │
       │   Option B: Flag for manual review          │
       │                                              │
       │ IF CPS > 17 (children's content):          │
       │   Apply stricter limits                     │
       └─────────────────────────────────────────────┘

    5. CREATE SUBTITLE EVENT:
       - Assign sequential index (1, 2, 3...)
       - Format times as HH:MM:SS,mmm
       - Store processed text with line breaks

OUTPUT: List[Subtitle] ready for output

Step 6: Output Generation

INPUT: List[Subtitle], output_path, format

PROCESS:

  FOR SRT FORMAT:
    ┌────────────────────────────────────────┐
    │ 1                                       │
    │ 00:00:00,000 --> 00:00:02,500          │
    │ Hello everyone and welcome              │
    │ to today's video.                       │
    │                                         │
    │ 2                                       │
    │ 00:00:02,600 --> 00:00:05,000          │
    │ Today we're going to discuss           │
    │ something very important.               │
    └────────────────────────────────────────┘

    - Index number
    - Timestamp line: START --> END
    - Text (1-2 lines)
    - Blank line separator

  FOR VTT FORMAT (optional):
    ┌────────────────────────────────────────┐
    │ WEBVTT                                  │
    │                                         │
    │ 00:00:00.000 --> 00:00:02.500          │
    │ Hello everyone and welcome              │
    │ to today's video.                       │
    │                                         │
    │ 00:00:02.600 --> 00:00:05.000          │
    │ Today we're going to discuss           │
    │ something very important.               │
    └────────────────────────────────────────┘

    - Header: "WEBVTT"
    - Timestamp uses . instead of ,
    - No index numbers (optional)

OUTPUT: Subtitle file(s) written to disk

4. Netflix Compliance Rules

Timing Requirements

Rule	Value	Notes
Minimum Duration	833ms (5/6 sec)	20 frames at 24fps
Maximum Duration	7000ms (7 sec)	Per subtitle event
Minimum Gap	83ms (2 frames)	Between consecutive subtitles
Frame Rate Reference	24fps	Adjust for other frame rates

Character & Line Limits

Rule	Value
Max Characters/Line	42
Max Lines/Subtitle	2
Max Total Characters	84 (42 × 2)

Reading Speed

Content Type	Max CPS
Adult Programs	20 characters/second
Children's Programs	17 characters/second

Positioning

Center justified
Bottom of screen (default)
Top positioning when avoiding on-screen text

Typography

Element	Specification
Font	Arial (proportionalSansSerif)
Color	White
Encoding	UTF-8

5. YouTube Format Compatibility

Supported Upload Formats

Format	Extension	Recommended
SubRip	.srt	✅ Primary
WebVTT	.vtt	✅ Alternative
SubViewer	.sbv, .sub	Supported
MPsub	.mpsub	Supported
LRC	.lrc	Supported
SAMI	.sami	Advanced
TTML	.ttml, .dfxp	Advanced
SCC	.scc	Broadcast

SRT Format Specification

[SEQUENCE_NUMBER]
[START_TIME] --> [END_TIME]
[SUBTITLE_TEXT]

[BLANK_LINE]

Time Format: HH:MM:SS,mmm

Hours: 00-99
Minutes: 00-59
Seconds: 00-59
Milliseconds: 000-999

Upload Process

Go to YouTube Studio
Select video → Subtitles
Add Language → Upload file
Choose "With timing" option
Select .srt file
Review and publish

6. Third-Party Tools & Services

Required Dependencies

yt-dlp

Purpose: YouTube audio extraction
License: Unlicense
Installation: pip install yt-dlp or uv add yt-dlp
Key Features:
- Audio-only download (-x / extract_audio)
- Format selection (-f bestaudio)
- Python API for embedding
- Handles age-restricted content (with cookies)

OpenAI Whisper

Purpose: Speech-to-text transcription
License: MIT
Installation: pip install openai-whisper or uv add openai-whisper
Models:

Model	Parameters	VRAM	Speed	Accuracy
tiny	39M	~1GB	~10x	Basic
base	74M	~1GB	~7x	Good
small	244M	~2GB	~4x	Better
medium	769M	~5GB	~2x	Very Good
large-v3	1550M	~10GB	1x	Best
turbo	809M	~6GB	~8x	Very Good

FFmpeg

Purpose: Audio processing (required by both yt-dlp and Whisper)
License: GPL/LGPL
Installation:
- macOS: brew install ffmpeg
- Ubuntu: apt install ffmpeg
- Windows: choco install ffmpeg

Optional Dependencies

Package	Purpose
`pysrt`	SRT file manipulation
`webvtt-py`	WebVTT handling
`rich`	CLI progress display

7. Data Structures

Core Data Models

VideoMetadata:
  ├── id: str                    # YouTube video ID
  ├── title: str                 # Video title
  ├── duration: float            # Duration in seconds
  ├── uploader: str              # Channel name
  └── upload_date: str           # YYYYMMDD format

TranscriptionSegment:
  ├── id: int                    # Segment index
  ├── start: float               # Start time (seconds)
  ├── end: float                 # End time (seconds)
  ├── text: str                  # Transcribed text
  └── words: List[Word]          # Word-level timestamps
      └── Word:
          ├── word: str
          ├── start: float
          └── end: float

TranscriptionResult:
  ├── language: str              # Detected/specified language
  ├── duration: float            # Total audio duration
  └── segments: List[Segment]

Subtitle:
  ├── index: int                 # Sequential number (1-based)
  ├── start_time: timedelta      # Start timestamp
  ├── end_time: timedelta        # End timestamp
  ├── text: str                  # Original text
  ├── lines: List[str]           # Formatted lines (1-2)
  ├── char_count: int            # Total characters
  ├── duration_ms: int           # Duration in milliseconds
  └── cps: float                 # Characters per second

SubtitleFile:
  ├── format: str                # "srt" or "vtt"
  ├── language: str              # Language code
  ├── video_id: str              # Source video ID
  └── subtitles: List[Subtitle]

8. Error Handling

Error Categories

Category	Examples	Handling
Network	Connection timeout, DNS failure	Retry with backoff
YouTube	Video unavailable, private, age-restricted	User-friendly message
Transcription	Model load failure, CUDA error	Fallback to CPU/smaller model
Processing	Invalid timestamps, encoding errors	Log and skip segment
File System	Permission denied, disk full	Clear error message

Retry Strategy

Network Operations:
  - Max retries: 3
  - Backoff: Exponential (1s, 2s, 4s)
  - Timeout: 30 seconds

Model Loading:
  - Fallback chain: GPU → CPU
  - Model fallback: large → turbo → medium

Graceful Degradation

If word timestamps fail → use segment timestamps
If language detection fails → prompt user
If CPS exceeds limit → flag for review (don't block)

9. Future Extensibility: Translation

Translation Module Interface

Protocol: ITranslator
  ├── translate(text: str, source: str, target: str) -> str
  ├── translate_batch(texts: List[str], source: str, target: str) -> List[str]
  ├── supports_language(lang: str) -> bool
  └── get_supported_languages() -> List[str]

Planned Translation Providers

Provider	API	Cost Model
Google Translate	Cloud Translation API	Per character
DeepL	DeepL API	Per character
OpenAI	GPT-4 API	Per token
Local	NLLB, MarianMT	Free (compute)

Translation Workflow

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Source    │    │ Translation │    │  Translated │
│  Subtitles  │───▶│   Service   │───▶│  Subtitles  │
│   (SRT)     │    │             │    │   (SRT)     │
└─────────────┘    └─────────────┘    └─────────────┘
                          │
                          ▼
                   ┌─────────────┐
                   │   Timing    │
                   │ Preservation│
                   │  (no sync   │
                   │   changes)  │
                   └─────────────┘

Translation Considerations

Preserve Timing: Translation must maintain original timestamps
Character Limits: Translated text must fit within 42 chars/line
Reading Speed: May need adjustment for target language
Cultural Adaptation: Numbers, dates, cultural references
Language-Specific Rules: Different Netflix style guides per language

10. Visual Diagrams

Process Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                           SUBTITLE GENERATION FLOW                           │
└─────────────────────────────────────────────────────────────────────────────┘

    ┌───────────┐
    │  START    │
    └─────┬─────┘
          │
          ▼
    ┌───────────────┐     ┌─────────────────┐
    │ Parse YouTube │     │ Invalid URL?    │
    │     URL       │────▶│ Show Error      │───▶ END
    └───────┬───────┘     └─────────────────┘
            │ Valid
            ▼
    ┌───────────────┐     ┌─────────────────┐
    │ Extract Video │     │ Video Private/  │
    │   Metadata    │────▶│ Unavailable?    │───▶ END
    └───────┬───────┘     └─────────────────┘
            │ Available
            ▼
    ┌───────────────┐
    │ Download      │
    │ Audio Stream  │
    │ (yt-dlp)      │
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │ Transcribe    │
    │ Audio         │
    │ (Whisper)     │
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │ Process       │
    │ Segments      │◀───────────────────┐
    └───────┬───────┘                    │
            │                             │
            ▼                             │
    ┌───────────────┐                    │
    │ Apply Netflix │     ┌──────────┐   │
    │ Timing Rules  │────▶│ Duration │   │
    └───────┬───────┘     │ Invalid? │───┘
            │             └──────────┘
            │ Valid         Adjust
            ▼
    ┌───────────────┐
    │ Format Text   │
    │ (Line Breaks) │
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐     ┌─────────────────┐
    │ Check Reading │     │ CPS > 20?       │
    │ Speed (CPS)   │────▶│ Flag/Adjust     │
    └───────┬───────┘     └────────┬────────┘
            │                       │
            ▼◀──────────────────────┘
    ┌───────────────┐
    │ Generate SRT  │
    │ Output File   │
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │ Cleanup Temp  │
    │ Files         │
    └───────┬───────┘
            │
            ▼
    ┌───────────┐
    │    END    │
    └───────────┘

State Machine: Subtitle Event Processing

┌─────────────────────────────────────────────────────────────────┐
│                    SUBTITLE EVENT STATE MACHINE                  │
└─────────────────────────────────────────────────────────────────┘

                    ┌──────────────┐
                    │   RAW        │
                    │  SEGMENT     │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
              ▼            ▼            ▼
       ┌──────────┐ ┌──────────┐ ┌──────────┐
       │ TOO      │ │ VALID    │ │ TOO      │
       │ SHORT    │ │ DURATION │ │ LONG     │
       │ (<833ms) │ │          │ │ (>7000ms)│
       └────┬─────┘ └────┬─────┘ └────┬─────┘
            │            │            │
            ▼            │            ▼
       ┌──────────┐      │      ┌──────────┐
       │ EXTEND   │      │      │ SPLIT    │
       │ END TIME │      │      │ SEGMENT  │
       └────┬─────┘      │      └────┬─────┘
            │            │            │
            └────────────┼────────────┘
                         │
                         ▼
                  ┌──────────────┐
                  │   TIMING     │
                  │   VALIDATED  │
                  └──────┬───────┘
                         │
              ┌──────────┼──────────┐
              │          │          │
              ▼          ▼          ▼
       ┌──────────┐┌──────────┐┌──────────┐
       │  ≤42     ││  43-84   ││   >84    │
       │  CHARS   ││  CHARS   ││  CHARS   │
       └────┬─────┘└────┬─────┘└────┬─────┘
            │           │           │
            │           ▼           ▼
            │    ┌──────────┐ ┌──────────┐
            │    │  SPLIT   │ │  CREATE  │
            │    │  2 LINES │ │ MULTIPLE │
            │    └────┬─────┘ │  EVENTS  │
            │         │       └────┬─────┘
            │         │            │
            └─────────┴────────────┘
                      │
                      ▼
               ┌──────────────┐
               │   COMPLETE   │
               │   SUBTITLE   │
               └──────────────┘

CLI Command Structure

Usage: subsync generate <youtube_url> [OPTIONS]

Arguments:
  youtube_url          YouTube video URL (required)

Options:
  -l, --language TEXT  Audio language code (default: auto-detect)
  -o, --output PATH    Output file path (default: ./video_title.srt)
  -f, --format TEXT    Output format: srt, vtt (default: srt)
  -q, --quality TEXT   Model quality: fast, balanced, best (default: balanced)
  --children           Apply stricter reading speed for children's content
  -v, --verbose        Show detailed progress
  --help               Show this help message

Examples:
  subsync generate "https://youtube.com/watch?v=abc123" -l en
  subsync generate "https://youtu.be/abc123" -o subtitles.srt
  subsync generate "https://youtube.com/watch?v=abc123" -q best --verbose

Appendix A: Language Codes

Whisper supports 99+ languages. Common codes:

Code	Language	Code	Language
en	English	es	Spanish
fr	French	de	German
it	Italian	pt	Portuguese
ru	Russian	ja	Japanese
ko	Korean	zh	Chinese
ar	Arabic	hi	Hindi

Full list: https://github.com/openai/whisper/blob/main/whisper/tokenizer.py

Appendix B: Netflix Style Guide References

Appendix C: YouTube Subtitle Documentation

Document History

Version	Date	Changes
1.0	2025-12-04	Initial algorithm design

This document serves as the architectural blueprint for the subsync subtitle generation feature. Implementation should follow these specifications to ensure Netflix compliance and YouTube compatibility.

FilesExpand file tree

SUBTITLE_GENERATION_ALGORITHM.md

Latest commit

History