This document describes the system architecture, component responsibilities, and data flow of Transcript Create.
- High-Level Architecture
- Component Responsibilities
- Data Flow
- Database Schema
- Technology Stack
- Design Decisions
Transcript Create follows a microservices-inspired architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (React) │
│ Vite + React + Tailwind + TypeScript │
│ http://localhost:5173 │
└─────────────────────┬───────────────────────────────────────────┘
│ HTTP/REST
│
┌─────────────────────▼───────────────────────────────────────────┐
│ API Server (FastAPI) │
│ http://localhost:8000 │
│ ┌──────────────┬─────────────┬───────────┬──────────────┐ │
│ │ Auth Routes │ Job Routes │ Video │ Search │ │
│ │ │ │ Routes │ Routes │ │
│ ├──────────────┼─────────────┼───────────┼──────────────┤ │
│ │ Billing │ Admin │ Favorites │ Export │ │
│ │ Routes │ Routes │ Routes │ Routes │ │
│ └──────────────┴─────────────┴───────────┴──────────────┘ │
└─────────────────────┬───────────────────────────────────────────┘
│
├─────────────────┐
│ │
┌───────────▼──────┐ ┌───────▼──────────┐
│ PostgreSQL │ │ OpenSearch │
│ (Primary DB) │ │ (Optional) │
│ Queue + Store │ │ Search Index │
└──────────────────┘ └──────────────────┘
▲
│
┌───────────┴──────────┐
│ Worker Process │
│ (Python) │
│ │
│ ┌────────────────┐ │
│ │ Job Processor │ │
│ │ Video Pipeline │ │
│ │ Whisper Model │ │
│ │ Diarization │ │
│ └────────────────┘ │
└──────────────────────┘
│
┌───────────▼──────────┐
│ File Storage │
│ /data/<video_uuid>/ │
│ (Audio + Metadata) │
└──────────────────────┘
Location: frontend/
Purpose: User interface for interacting with the transcription service
Key Features:
- Search interface with filtering and highlighting
- Video transcript viewer with timestamp deep links
- Export menu (SRT, VTT, JSON, PDF)
- Authentication (OAuth login)
- Billing/upgrade flow with Stripe
- Admin dashboard (for authorized users)
Tech Stack:
- Vite for fast development and optimized builds
- React 18 with hooks
- TailwindCSS for styling
- React Router for navigation
- Axios for API calls
- TypeScript for type safety
Location: app/
Purpose: RESTful API for managing jobs, videos, transcripts, and user operations
Responsibilities:
- Handle HTTP requests from frontend
- Authentication and authorization (OAuth, session management)
- Job creation and status tracking
- Video metadata and transcript retrieval
- Search orchestration (PostgreSQL FTS or OpenSearch)
- Export generation (SRT, VTT, JSON, PDF)
- Billing integration (Stripe webhooks)
- Admin operations and analytics
Key Modules:
app/routes/: Modular router definitionsauth.py: OAuth login (Google, Twitch), session managementjobs.py: Job creation and statusvideos.py: Video metadata and transcript accesssearch.py: Full-text search across transcriptsexports.py: Export format generationbilling.py: Stripe Checkout and Customer Portaladmin.py: Admin-only analytics and user managementfavorites.py: User favorites managementevents.py: Event logging for analytics
app/crud.py: Database operations with SQLAlchemyapp/db.py: Database connection managementapp/schemas.py: Pydantic models for request/response validationapp/settings.py: Configuration via environment variables
API Patterns:
- RESTful endpoints with standard HTTP methods
- JSON request/response bodies
- Session-based authentication with cookies
- Error handling with structured JSON responses
- Health check endpoints for load balancers
Location: worker/
Purpose: Background processing of transcription jobs
Responsibilities:
- Poll database for pending videos (using
FOR UPDATE SKIP LOCKED) - Download audio from YouTube (yt-dlp)
- Transcode to 16 kHz mono WAV (ffmpeg)
- Chunk long audio for memory management
- Run Whisper transcription (GPU-accelerated)
- Optional speaker diarization (pyannote.audio)
- Merge and align segments
- Update database with results
- Clean up temporary files
Key Modules:
worker/loop.py: Main worker loop with job pollingworker/pipeline.py: Orchestrates the full processing pipelineworker/audio.py: Audio download and preprocessingworker/whisper_runner.py: Whisper model loading and transcriptionworker/diarize.py: Speaker diarization and alignment
State Transitions:
pending → downloading → transcoding → transcribing → completed
↘ failed
Processing Pipeline:
- Expand Jobs: Convert channel URLs to individual video entries
- Select Video: Pick one pending video with row-level locking
- Download: Use yt-dlp to fetch audio and metadata
- Transcode: Convert to 16 kHz mono WAV
- Chunk: Split long audio into manageable segments
- Transcribe: Run Whisper on each chunk
- Diarize (optional): Identify speakers and align with transcript
- Persist: Save transcript and segments to database
- Cleanup: Remove temporary files (if configured)
Location: SQL schema in sql/schema.sql, migrations in alembic/
Purpose: Primary data store and job queue
Core Tables:
jobs: Job metadata and statusvideos: Video information, processing state, YouTube metadatatranscripts: Transcript metadata (model, language, duration)segments: Individual transcript segments with timestampsyoutube_captions: YouTube auto-captions for comparisonyoutube_caption_segments: Individual YouTube caption segmentsusers: User accounts (OAuth)sessions: Session managementfavorites: User-favorited videosevents: Analytics event log
Queue Mechanism:
- Workers use
SELECT ... FOR UPDATE SKIP LOCKEDfor lock-free parallelism - Multiple workers can process different videos simultaneously
- No external queue service needed (RabbitMQ, Redis, etc.)
Search Indexes:
- PostgreSQL Full-Text Search (GIN indexes)
- Triggers for automatic index updates
- Optional OpenSearch for advanced features
Location: /data/<video_uuid>/ (Docker volume or local directory)
Purpose: Store downloaded audio and temporary files
Contents:
audio.opusoraudio.m4a: Original downloaded audioaudio.wav: Transcoded 16 kHz mono WAVaudio_chunk_*.wav: Chunked audio segments- Metadata files from yt-dlp
Lifecycle:
- Created during video processing
- Optionally cleaned up after successful transcription (configurable)
- Persistent across container restarts (via Docker volumes)
Purpose: Advanced search capabilities with highlighting and ranking
Features:
- Synonym expansion
- N-gram matching for partial words
- Relevance scoring
- Fast highlighting
- Aggregations and faceting
Integration:
- Indexer script:
scripts/opensearch_indexer.py - Search backend setting:
SEARCH_BACKEND=opensearch - Falls back to PostgreSQL FTS if unavailable
1. User submits YouTube URL via frontend
└─> POST /jobs { url, kind }
2. API validates URL and creates job record
└─> INSERT INTO jobs (url, kind, state='pending')
└─> Returns job_id to frontend
3. Frontend polls job status
└─> GET /jobs/{job_id}
└─> Returns job state + video count
4. Worker expands job into videos (for channels)
└─> Run yt-dlp --flat-playlist
└─> INSERT INTO videos (job_id, youtube_id, title, ...)
└─> UPDATE jobs SET state='expanded'
1. Worker polls for pending video
└─> SELECT * FROM videos
WHERE state='pending'
ORDER BY created_at
FOR UPDATE SKIP LOCKED LIMIT 1
2. Download audio
└─> yt-dlp downloads to /data/{video_uuid}/
└─> UPDATE videos SET state='downloading'
3. Transcode audio
└─> ffmpeg converts to 16kHz mono WAV
└─> UPDATE videos SET state='transcoding'
4. Transcribe audio
└─> Chunk audio into segments
└─> Load Whisper model
└─> Transcribe each chunk
└─> UPDATE videos SET state='transcribing'
5. Diarize (optional)
└─> Run pyannote.audio
└─> Align speakers with transcript segments
6. Save results
└─> INSERT INTO transcripts (video_id, model, language, ...)
└─> INSERT INTO segments (transcript_id, start_ms, end_ms, text, ...)
└─> UPDATE videos SET state='completed'
7. Cleanup
└─> Remove temporary files (if CLEANUP_* enabled)
1. User enters search query in frontend
└─> GET /search?q=query&source=native
2. API routes to appropriate search backend
3a. PostgreSQL FTS:
└─> SELECT * FROM segments
WHERE search_vector @@ plainto_tsquery('query')
ORDER BY ts_rank(...)
3b. OpenSearch:
└─> POST /_search
└─> Multi-field query with highlighting
└─> Boosting, synonyms, n-grams
4. API groups results by video
└─> Returns grouped hits with timestamps and highlights
5. Frontend displays results
└─> Video list with matching segments
└─> Click segment → deep link to timestamp
1. User clicks export button
└─> GET /videos/{id}/transcript.srt
2. API checks quotas
└─> Query user plan and daily export count
└─> Return 402 if quota exceeded
3. Generate export format
└─> Fetch segments from database
└─> Format as SRT/VTT/JSON/PDF
└─> Log export event
4. Return file
└─> Content-Disposition: attachment
└─> Frontend triggers download
jobs (1) ─────< (N) videos
│
│ (1)
▼
transcripts (1) ─────< (N) segments
│
│ (1)
▼
(N) youtube_captions ─────< (N) youtube_caption_segments
users (1) ─────< (N) sessions
(1) ─────< (N) favorites ────> (1) videos
(1) ─────< (N) events
Job States:
pending: Newly created, awaiting expansionexpanded: Videos created, ready for processingprocessing: One or more videos are being processedcompleted: All videos completedfailed: Job failed during expansion
Video States:
pending: Awaiting processingdownloading: Downloading audiotranscoding: Converting audio formattranscribing: Running Whispercompleted: Successfully transcribedfailed: Processing failed
- Python 3.11+: Primary language
- FastAPI: High-performance async web framework
- SQLAlchemy: ORM and query builder
- Alembic: Database migrations
- Pydantic: Data validation and settings management
- psycopg: PostgreSQL adapter (psycopg3)
- uvicorn: ASGI server
- OpenAI Whisper: Speech recognition model
- faster-whisper: Optimized Whisper inference (CTranslate2)
- PyTorch: Deep learning framework
- pyannote.audio: Speaker diarization (optional)
- ROCm / CUDA: GPU acceleration
- yt-dlp: YouTube video/audio download
- ffmpeg: Audio/video transcoding
- NumPy: Numerical operations
- React 18: UI library
- Vite: Build tool and dev server
- TypeScript: Type-safe JavaScript
- TailwindCSS: Utility-first CSS framework
- React Router: Client-side routing
- Axios: HTTP client
- PostgreSQL 15+: Primary database
- OpenSearch (optional): Search engine
- Docker: Containerization
- Docker Compose: Local orchestration
- Kubernetes: Production deployment
- Prometheus: Metrics collection
- Grafana: Metrics visualization
- pytest: Python testing framework
- Playwright: E2E browser testing
- Vitest: Frontend unit testing
- ruff: Fast Python linter
- black: Python code formatter
- isort: Import sorting
- mypy: Static type checking
- ESLint: JavaScript/TypeScript linting
- Prettier: Frontend code formatting
- pre-commit: Git hooks for quality checks
Pros:
- Native async/await support for high concurrency
- Automatic OpenAPI documentation
- Built-in data validation with Pydantic
- High performance (on par with Node.js)
- Type hints enable better IDE support
Cons:
- Smaller ecosystem than Flask/Django
- Async patterns require understanding
Pros:
- No additional service to manage (simpler ops)
- ACID transactions ensure reliability
FOR UPDATE SKIP LOCKEDenables lock-free concurrency- Query jobs and queue with same tool
- Backup/restore includes queue state
Cons:
- Not designed for high-throughput queues
- Polling overhead (mitigated with reasonable intervals)
Alternative Considered: RabbitMQ, Redis, Celery
- Would add operational complexity
- Overkill for moderate job volumes
- Our workload: long-running tasks, not high throughput
Problem: Large Whisper models (large-v3) consume significant VRAM. Long videos (1+ hours) can exceed VRAM limits.
Solution: Split audio into chunks (default 15 minutes), transcribe independently, merge with time offsets.
Trade-offs:
- Pro: Predictable memory usage, parallel processing potential
- Con: Potential discontinuity at chunk boundaries (mitigated by context overlap in future)
Decision: Make OpenSearch optional, default to PostgreSQL FTS
Rationale:
- PostgreSQL FTS sufficient for basic use cases
- OpenSearch adds operational overhead (JVM, memory, management)
- Users can opt-in for advanced features (highlighting, synonyms, ranking)
When to use OpenSearch:
- Large corpus (100K+ documents)
- Advanced search features needed
- Performance requirements exceed PostgreSQL FTS
Context: Project initially developed on AMD hardware
Implementation:
- Dockerfile uses ROCm base image
GPU_DEVICE_PREFERENCE=hip,cudatries ROCm first- Falls back gracefully to CUDA or CPU
Future: Maintain both GPU backends equally, select via build variant
Decision: Server-side sessions over JWT
Rationale:
- Easier session revocation (logout, security incidents)
- Smaller cookies (just session ID, not full JWT payload)
- OAuth tokens stored server-side (more secure)
Trade-off: Requires database lookup per request (mitigated by connection pooling)
- Read Code Guidelines: code-guidelines.md
- Learn Testing Practices: testing.md
- Understand Release Process: release-process.md
- Explore Codebase: Start with
app/main.pyandworker/loop.py
If you have questions about the architecture:
- Open an issue with the
questionlabel - Check existing issues and discussions
- Review inline code comments for implementation details