This document describes the monitoring infrastructure for the Transcript Create application.
The monitoring stack consists of:
- Prometheus: Time-series database for metrics collection
- Grafana: Visualization and dashboarding platform
- Application Metrics: Custom metrics exposed by API and Worker services
# Start all services including Prometheus and Grafana
docker compose up -d
# Check that monitoring services are running
docker compose ps prometheus grafana- Grafana: http://localhost:3000
- Username:
admin - Password:
admin(change on first login)
- Username:
- Prometheus: http://localhost:9090
Three dashboards are automatically provisioned:
-
Overview (
transcript-overview)- Service health status
- Request rates
- Job and video statistics
- Queue depth
-
API Performance (
transcript-api)- Request rate by endpoint
- Response time percentiles (p50, p95, p99)
- Error rates
- Success rate
- Concurrent requests
- Search query rate
-
Transcription Pipeline (
transcript-pipeline)- Transcription duration
- Pipeline stage durations (download, transcode, diarization)
- Video queue status
- Processing rates
- Chunks per video
- Model load times
http_requests_total(counter): Total HTTP requests by method, endpoint, and statushttp_request_duration_seconds(histogram): Request latency distributionhttp_requests_in_flight(gauge): Current concurrent requests
jobs_created_total(counter): Jobs created by type (single/channel)jobs_completed_total(counter): Successfully completed jobsjobs_failed_total(counter): Failed jobsvideos_transcribed_total(counter): Successfully transcribed videossearch_queries_total(counter): Search queries by backend (postgres/opensearch)exports_total(counter): Exports by format (srt/vtt/json/pdf)
db_connections_active(gauge): Active database connectionsdb_query_duration_seconds(histogram): Query duration distributiondb_errors_total(counter): Database errors by type
transcription_duration_seconds(histogram): Total transcription time by modeldownload_duration_seconds(histogram): Audio download timetranscode_duration_seconds(histogram): Audio transcoding timediarization_duration_seconds(histogram): Speaker diarization time
videos_pending(gauge): Videos waiting to be processedvideos_in_progress(gauge): Videos currently being processed by statevideos_processed_total(counter): Completed/failed video count
whisper_model_load_seconds(histogram): Model loading time by model and backendwhisper_chunk_transcription_seconds(histogram): Per-chunk transcription timechunk_count(histogram): Number of audio chunks per video
gpu_memory_used_bytes(gauge): GPU memory in use by devicegpu_memory_total_bytes(gauge): Total GPU memory by device
These metrics track yt-dlp operations for audio download, metadata fetch, and caption retrieval:
-
ytdlp_operation_duration_seconds(histogram): Duration of yt-dlp operations by operation type and client strategy- Labels:
operation(download, metadata, captions),client(web_safari, ios, android, tv, direct, default) - Buckets: 1s, 2s, 5s, 10s, 20s, 30s, 60s, 120s, 180s, 300s, 600s
- Labels:
-
ytdlp_operation_attempts_total(counter): Total operation attempts by result- Labels:
operation,client,result(success, failure)
- Labels:
-
ytdlp_operation_errors_total(counter): Failed operations by error classification- Labels:
operation,client,error_class(network, throttle, auth, token, not_found, timeout, unknown)
- Labels:
-
ytdlp_token_usage_total(counter): Operations tracked by PO token presence- Labels:
operation,has_token(true, false)
- Labels:
-
youtube_circuit_breaker_state(gauge): Circuit breaker state- Labels:
name(youtube_download, youtube_metadata) - Values: 0=closed, 1=half_open, 2=open
- Labels:
-
youtube_circuit_breaker_transitions_total(counter): State transitions- Labels:
name,from_state,to_state
- Labels:
Success rate by client strategy:
rate(ytdlp_operation_attempts_total{result="success"}[5m])
/
rate(ytdlp_operation_attempts_total[5m])
95th percentile download duration by client:
histogram_quantile(0.95,
rate(ytdlp_operation_duration_seconds_bucket{operation="download"}[5m])
)
Error rate by classification:
rate(ytdlp_operation_errors_total[5m])
Token usage percentage:
rate(ytdlp_token_usage_total{has_token="true"}[5m])
/
rate(ytdlp_token_usage_total[5m])
Alerts are defined in /config/prometheus/alerts.yml:
- HighErrorRate: API error rate >5% for 5 minutes
- SlowResponseTime: p95 latency >1s for 5 minutes
- APIServiceDown: API service unavailable for 2 minutes
- NoJobsCompleted: No jobs completed in 1 hour
- WorkerServiceDown: Worker service unavailable for 2 minutes
- HighJobFailureRate: Job failure rate >20% for 10 minutes
- JobsStuckInQueue: >50 videos pending for 30 minutes
- HighDatabaseErrors: Database error rate >1/sec for 5 minutes
Recommended alerting thresholds for YouTube ingestion operations:
-
HighYtdlpErrorRate: yt-dlp operation error rate >10% for 10 minutes
( rate(ytdlp_operation_attempts_total{result="failure"}[10m]) / rate(ytdlp_operation_attempts_total[10m]) ) > 0.1 -
SlowYtdlpOperations: p95 operation duration >120s for 15 minutes
histogram_quantile(0.95, rate(ytdlp_operation_duration_seconds_bucket[15m]) ) > 120 -
CircuitBreakerOpen: Circuit breaker has been open for 5 minutes
youtube_circuit_breaker_state == 2 -
HighThrottlingRate: YouTube throttling errors >5/min for 10 minutes
rate(ytdlp_operation_errors_total{error_class="throttle"}[10m]) > 0.083 -
TokenFailures: PO token errors increasing
rate(ytdlp_operation_errors_total{error_class="token"}[5m]) > 0
To receive alert notifications:
- Add Alertmanager to
docker-compose.yml:
alertmanager:
image: prom/alertmanager:v0.27.0
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
volumes:
- ./config/alertmanager:/etc/alertmanager
ports:
- "9093:9093"- Create
/config/alertmanager/alertmanager.yml:
global:
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'- Uncomment the Alertmanager target in
config/prometheus/prometheus.yml
- Define metric in
app/metrics.py:
from prometheus_client import Counter
my_metric = Counter(
"my_metric_total",
"Description of my metric",
["label1", "label2"],
)- Use metric in your code:
from app.metrics import my_metric
my_metric.labels(label1="value1", label2="value2").inc()- Define metric in
worker/metrics.py:
from prometheus_client import Histogram
my_worker_metric = Histogram(
"my_worker_metric_seconds",
"Description of my metric",
buckets=(1, 5, 10, 30, 60, 120),
)- Use metric in your code:
from worker.metrics import my_worker_metric
import time
start = time.time()
# ... do work ...
duration = time.time() - start
my_worker_metric.observe(duration)- Metric Names: Use snake_case and descriptive names
- Labels: Keep cardinality low (<100 unique combinations)
- Units: Include units in metric names (e.g.,
_seconds,_bytes,_total) - Metric Types:
- Counter: Monotonically increasing values (requests, errors)
- Gauge: Values that can go up/down (queue size, memory)
- Histogram: Distribution of values (duration, size)
- Summary: Similar to histogram but with quantiles
- Check service health:
# API metrics endpoint
curl http://localhost:8000/metrics
# Worker metrics endpoint
curl http://localhost:8001/metrics-
Check Prometheus targets:
- Visit http://localhost:9090/targets
- Ensure all targets are "UP"
-
Check container logs:
docker compose logs api worker prometheus grafanaPrometheus stores metrics in memory and on disk. To reduce memory:
- Decrease retention period in
config/prometheus/prometheus.yml:
storage:
tsdb:
retention.time: 15d # Default is 30d
retention.size: 5GB # Default is 10GB- Reduce scrape frequency:
global:
scrape_interval: 30s # Default is 15s- Check Grafana logs:
docker compose logs grafana-
Verify datasource:
- Go to Configuration → Data Sources
- Test the Prometheus connection
-
Re-import dashboard:
- Go to Dashboards → Import
- Upload JSON file from
config/grafana/dashboards/
Metrics collection overhead is typically <1% CPU and <100MB RAM.
To verify:
# Check resource usage
docker stats api worker prometheus grafanaIf overhead is high:
- Reduce scrape frequency
- Decrease histogram bucket count
- Remove unused metrics
The ingestion metrics classify errors to help diagnose issues:
1. throttle errors (429, "too many requests")
- Cause: YouTube rate limiting
- Symptoms: High
ytdlp_operation_errors_total{error_class="throttle"} - Remediation:
- Circuit breaker will automatically back off
- Increase
YTDLP_BACKOFF_MAX_DELAYto slow retry rate - Enable PO tokens if not already active (
PO_TOKEN_USE_FOR_AUDIO=true) - Reduce concurrent worker instances
2. token errors (invalid/expired PO tokens)
- Cause: PO tokens expired or rejected by YouTube
- Symptoms: High
ytdlp_operation_errors_total{error_class="token"} - Remediation:
- Check PO token provider availability
- Verify
PO_TOKEN_PROVIDER_URLis accessible - Check token expiry with
po_token_failures_totalmetric - Review logs for token invalidation events
3. auth errors (403, "sign in required", "bot detected")
- Cause: YouTube requiring authentication or detecting automated access
- Symptoms: High
ytdlp_operation_errors_total{error_class="auth"} - Remediation:
- Enable PO tokens (required for most flows now)
- Configure cookies file via
YTDLP_COOKIES_PATH - Try different client strategies (ios, android as fallbacks)
- Add delays: increase
YTDLP_BACKOFF_BASE_DELAY
4. not_found errors (404, unavailable, private)
- Cause: Video is deleted, private, or region-locked
- Symptoms: High
ytdlp_operation_errors_total{error_class="not_found"} - Remediation:
- These are expected and not retried automatically
- Mark jobs as failed in application logic
- No infrastructure changes needed
5. network errors (connection issues, timeouts)
- Cause: Network connectivity problems
- Symptoms: High
ytdlp_operation_errors_total{error_class="network"} - Remediation:
- Check network connectivity to YouTube
- Verify DNS resolution
- Increase
YTDLP_REQUEST_TIMEOUTif timeouts are frequent - Check firewall rules
6. timeout errors (operation exceeded timeout)
- Cause: Large files or slow connection
- Symptoms: High
ytdlp_operation_duration_secondsand timeout errors - Remediation:
- Increase
YTDLP_REQUEST_TIMEOUT(default 120s) - Check bandwidth availability
- Consider chunking or streaming approaches
- Increase
Compare success rates across different client strategies:
# Query Prometheus for client performance
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=rate(ytdlp_operation_attempts_total{result="success"}[5m]) by (client)'If specific clients are failing frequently:
- Disable underperforming clients:
YTDLP_CLIENTS_DISABLED=Android,iOS - Reorder client priority:
YTDLP_CLIENT_ORDER=web_safari,tv,iOS,Android
All ingestion operations log structured fields. Query logs with:
# Find slow downloads (>60s)
docker compose logs worker | jq 'select(.duration_seconds > 60 and .operation == "download")'
# Find operations without tokens
docker compose logs worker | jq 'select(.has_token == false and .operation != "captions")'
# Group errors by classification
docker compose logs worker | jq 'select(.error_class) | .error_class' | sort | uniq -c# Backup
docker compose stop prometheus
tar -czf prometheus-backup.tar.gz -C $(docker volume inspect --format '{{ .Mountpoint }}' transcript-create_prometheus-data) .
docker compose start prometheus
# Restore
docker compose stop prometheus
tar -xzf prometheus-backup.tar.gz -C $(docker volume inspect --format '{{ .Mountpoint }}' transcript-create_prometheus-data)
docker compose start prometheusDashboards are version-controlled in config/grafana/dashboards/ and automatically provisioned.
To export a modified dashboard:
- Go to Dashboard Settings → JSON Model
- Copy JSON
- Save to
config/grafana/dashboards/
To send metrics to Grafana Cloud:
- Add remote write to
config/prometheus/prometheus.yml:
remote_write:
- url: https://prometheus-us-central1.grafana.net/api/prom/push
basic_auth:
username: YOUR_INSTANCE_ID
password: YOUR_API_KEYTo send metrics to Datadog:
- Add Datadog exporter to
docker-compose.yml - Configure Prometheus to scrape Datadog exporter
- Datadog will pull metrics automatically
-
Change default credentials:
- Grafana admin password
- Add authentication to Prometheus
-
Network isolation:
# In docker-compose.yml
networks:
monitoring:
internal: true
# Add network to monitoring services
prometheus:
networks:
- monitoring-
TLS/HTTPS:
- Use reverse proxy (nginx, traefik) for HTTPS
- Configure certificate for Grafana
-
Access control:
- Limit Grafana users and permissions
- Use Grafana RBAC for team access