Bite-Size Reader

Async Telegram bot that summarizes web articles and YouTube videos into structured JSON. For articles, it uses a multi-provider scraper chain (Scrapling / self-hosted Firecrawl / Playwright / Crawlee / direct HTML) + OpenRouter; for YouTube videos, it downloads the video (1080p) and extracts transcripts. Also supports summarizing forwarded channel posts. Returns a strict JSON summary and stores artifacts in SQLite.

🚀 New to Bite-Size Reader? Start with the 5-Minute Quickstart Tutorial

❓ Have Questions? Check the FAQ or Troubleshooting Guide

📚 All Documentation → Documentation Hub

Architecture overview

flowchart LR
  subgraph TelegramBot
    TGClient[TelegramClient] --> MsgHandler[MessageHandler]
    MsgHandler --> AccessController
    MsgHandler --> CallbackHandler
    CallbackHandler --> CallbackRegistry[CallbackActionRegistry]
    CallbackRegistry --> CallbackActions[CallbackActionService]
    AccessController --> MessageRouter
    MessageRouter --> CommandProcessor
    MessageRouter --> URLHandler
    URLHandler --> URLBatchPolicy[URLBatchPolicyService]
    URLHandler --> URLAwaitingState[URLAwaitingStateStore]
    MessageRouter --> ForwardProcessor
    MessageRouter --> MessagePersistence
    LifecycleMgr[TelegramLifecycleManager] -.-> TGClient
    LifecycleMgr -.-> URLHandler
  end

  subgraph URLPipeline[URL processing pipeline]
    URLHandler --> URLProcessor
    URLProcessor --> ContentExtractor
    ContentExtractor --> ScraperChain[ScraperChain]
    ScraperChain -->|primary| Scrapling[Scrapling]
    ScraperChain -->|secondary| Firecrawl[(Firecrawl /scrape)]
    ScraperChain -->|tertiary| Playwright[Playwright]
    ScraperChain -->|quaternary| Crawlee[Crawlee]
    ScraperChain -->|last_resort| DirectHTML[Direct HTML]
    URLProcessor --> ContentChunker
    URLProcessor --> LLMSummarizer
    LLMSummarizer --> OpenRouter[(OpenRouter Chat Completions)]
  end

  subgraph DigestPipeline[Channel Digest]
    Scheduler[APScheduler] --> DigestService
    CommandProcessor -.->|/digest| DigestService
    DigestService --> ChannelReader
    ChannelReader --> UserbotClient[Userbot Client]
    DigestService --> DigestAnalyzer
    DigestAnalyzer --> OpenRouter
    DigestService --> DigestFormatter
    DigestFormatter --> TGClient
    CommandProcessor -.->|/init_session| SessionInit[Session Init + Mini App]
    SessionInit --> UserbotClient
  end

  subgraph OptionalServices[Optional services]
    Redis[(Redis)] -.-> ContentExtractor
    Redis -.-> LLMSummarizer
    Redis -.-> MobileAPI
    ChromaDB[(ChromaDB)] -.-> SearchService
    MCPServer[MCP Server] -.-> SQLite
    MCPServer -.-> SearchService
  end

  ForwardProcessor --> LLMSummarizer
  LLMSummarizer -.->| optional | WebSearch[WebSearchAgent]
  WebSearch -.-> Firecrawl
  ContentExtractor --> SQLite[(SQLite)]
  MessagePersistence --> SQLite
  LLMSummarizer --> SQLite
  DigestService --> SQLite
  MessageRouter --> ResponseFormatter
  ResponseFormatter --> TGClient
  TGClient -->| Replies | Telegram
  Telegram -->| Updates | TGClient
  UserbotClient -->| Read channels | Telegram
  ResponseFormatter --> Logs[(Structured + audit logs)]

  subgraph MobileAPI[Mobile API]
    FastAPI[FastAPI + JWT] --> SQLite
    FastAPI --> SearchService[SearchService]
    FastAPI --> DigestFacade
    DigestFacade --> DigestAPIService
    DigestAPIService --> SQLite
    FastAPI --> SystemMaint[SystemMaintenanceService]
    SystemMaint --> SQLite
    SystemMaint -.-> Redis
  end

The bot ingests updates via a lightweight TelegramClient, normalizes them through MessageHandler, and hands them to MessageRouter/CallbackHandler flows. CallbackHandler delegates action execution through CallbackActionRegistry + CallbackActionService, and URLHandler delegates URL policy/state concerns through URLBatchPolicyService + URLAwaitingStateStore before invoking URLProcessor. TelegramLifecycleManager owns startup/shutdown orchestration of background tasks and warmups. The channel digest subsystem uses a separate UserbotClient (authenticated as a real Telegram user) to read channel histories, analyzes posts via LLM, and delivers formatted digests on a schedule or via /digest.

For the mobile API, routers are transport-focused and delegate infrastructure orchestration to dedicated services (DigestFacade, SystemMaintenanceService) rather than performing DB/Redis/file operations inline. ResponseFormatter centralizes Telegram replies and audit logging while all artifacts land in SQLite.

Quick start

🚀 5-Minute Setup: Follow the Quickstart Tutorial for step-by-step Docker setup.

Manual Setup:

Copy .env.example to .env and fill required secrets
Build and run with Docker
See DEPLOYMENT.md for full setup, deployment, and update instructions

Common Use Cases

I want to...

Goal	How	Documentation
Summarize web articles	Send URL to Telegram bot	Quickstart Tutorial
Summarize YouTube videos	Send YouTube URL (transcript extracted)	Configure YouTube
Search past summaries	`/search <query>` command	FAQ § Search
Get real-time context	Enable web search enrichment	Enable Web Search
Speed up responses	Enable Redis caching	Setup Redis
Build mobile app	Use Mobile API (JWT auth)	MOBILE_API_SPEC.md
Use web interface	Open Carbon web UI on `/web`	Frontend Web Guide
Integrate with AI agents	Use MCP server	MCP Server Guide
Reduce API costs	Use free models, caching	FAQ § Cost Optimization
Self-host privately	Docker deployment	DEPLOYMENT.md

Docker

If you updated dependencies in pyproject.toml, generate lock files first: make lock-uv.
Build: docker build -t bite-size-reader .
Run: docker run --env-file .env -v $(pwd)/data:/data --name bsr bite-size-reader

Commands and usage

You can simply send a URL (or several URLs) or forward a channel post -- commands are optional.

Summarization

Command	Description
`/help`, `/start`	Show help and usage
`/summarize <URL>`	Summarize a URL immediately
`/summarize`	Bot asks for a URL in the next message
`/summarize_all <URLs>`	Summarize multiple URLs without confirmation
`/cancel`	Cancel pending summarize prompt or multi-link confirmation

Multiple URLs in one message: bot asks "Process N links?"; reply "yes/no". Each link gets its own correlation ID and is processed sequentially.

Content Management

Command	Description
`/unread [limit] [topic]`	Show unread articles, optionally filtered by topic
`/read <request_id>`	Mark an article as read

Search

Command	Description
`/search <query>`	Search summaries by keyword
`/find`, `/findweb`, `/findonline`	Search using Firecrawl web search
`/finddb`, `/findlocal`	Search local database only

Admin

Command	Description
`/dbinfo`	Show database statistics
`/dbverify`	Verify database integrity

Channel Digest

Command	Description
`/init_session`	Initialize userbot session via Mini App OTP/2FA flow
`/digest`	Generate a digest of subscribed channels now
`/channels`	List currently subscribed channels
`/subscribe @channel`	Subscribe to a Telegram channel for digests
`/unsubscribe @channel`	Unsubscribe from a channel

Integrations

Command	Description
`/sync_karakeep`	Trigger Karakeep bookmark sync

Environment

✅ Required (Essential for Basic Functionality)

API_ID=...                          # Telegram API ID (from https://my.telegram.org/apps)
API_HASH=...                        # Telegram API hash
BOT_TOKEN=...                       # Telegram bot token (from @BotFather)
ALLOWED_USER_IDS=123456789          # Comma-separated Telegram user IDs (your ID)
FIRECRAWL_API_KEY=...               # Firecrawl API key (optional -- only for cloud Firecrawl or web search)
OPENROUTER_API_KEY=...              # OpenRouter API key (or use OPENAI_API_KEY/ANTHROPIC_API_KEY)
OPENROUTER_MODEL=deepseek/deepseek-v3.2  # Primary LLM model

🔧 Optional (Enable Features as Needed)

Subsystem	Key Variables	When to Enable
YouTube	`YOUTUBE_DOWNLOAD_ENABLED=true` `YOUTUBE_PREFERRED_QUALITY=1080p` `YOUTUBE_STORAGE_PATH=/data/videos`	Summarize YouTube videos
Web Search	`WEB_SEARCH_ENABLED=false` `WEB_SEARCH_MAX_QUERIES=3`	Add real-time context to summaries
Redis	`REDIS_ENABLED=true` `REDIS_URL` or `REDIS_HOST`/`REDIS_PORT`	Cache responses, speed up bot
Draft Streaming	`SUMMARY_STREAMING_ENABLED=true` `SUMMARY_STREAMING_MODE=section` `TELEGRAM_DRAFT_STREAMING_ENABLED=true`	Live section previews during OpenRouter summaries
Scraper Chain	`SCRAPER_ENABLED=true` `SCRAPER_PROFILE=balanced` `SCRAPER_BROWSER_ENABLED=true` `SCRAPER_PROVIDER_ORDER=[...]`	Control article extraction fallback behavior and tuning
ChromaDB	`CHROMA_HOST=http://localhost:8000` `CHROMA_AUTH_TOKEN`	Semantic search
Embeddings	`EMBEDDING_PROVIDER=local` `GEMINI_API_KEY` `GEMINI_EMBEDDING_DIMENSIONS=768`	Switch embedding provider (local/Gemini)
MCP Server	`MCP_ENABLED=false` `MCP_TRANSPORT=stdio` `MCP_PORT=8200`	AI agent integration (Claude Desktop / optional Docker `mcp` profile)
Mobile API	`JWT_SECRET_KEY` `ALLOWED_CLIENT_IDS` `API_RATE_LIMIT_*`	Build mobile clients
Karakeep	`KARAKEEP_ENABLED=false` `KARAKEEP_API_URL` `KARAKEEP_API_KEY`	Bookmark sync
Channel Digest	`DIGEST_ENABLED=true` `API_BASE_URL=http://localhost:8000`	Scheduled channel digests

⚙️ Advanced (Fine-Tuning)

Category	Key Variables	Purpose
Runtime	`DB_PATH=/data/app.db` `LOG_LEVEL=INFO` `DEBUG_PAYLOADS=0` `MAX_CONCURRENT_CALLS=4`	Performance tuning
LLM Providers	`LLM_PROVIDER=openrouter` `OPENAI_API_KEY` `ANTHROPIC_API_KEY`	Switch LLM providers
Fallbacks	`OPENROUTER_FALLBACK_MODELS=...` `OPENAI_FALLBACK_MODELS=...`	Model fallback chains

📖 Full Reference: environment_variables.md (250+ variables documented)

❓ Configuration Help: FAQ § Configuration | TROUBLESHOOTING § Configuration

⚠️ Breaking Rename: scraper legacy variables SCRAPLING_* and SCRAPER_DIRECT_HTTP_ENABLED are no longer accepted; startup fails fast with replacement hints.

Performance Tips

Speed up summarization:

⚡ Use faster models: qwen/qwen3-max (faster than DeepSeek), google/gemini-2.0-flash-001:free (free)
🔄 Enable Redis caching: Cache repeated URLs, reduce API calls
📦 Increase concurrency: MAX_CONCURRENT_CALLS=5 (default: 4)
🎯 Disable optional features: Set WEB_SEARCH_ENABLED=false, SUMMARY_TWO_PASS_ENABLED=false

Reduce costs:

💰 Use free models: google/gemini-2.0-flash-001:free, deepseek/deepseek-r1:free (via OpenRouter)
🔄 Enable caching: Avoid re-processing same URLs
🎛 Adjust token limits: MAX_CONTENT_LENGTH_TOKENS=30000 (default: 50000)
📊 Monitor usage: Track costs at OpenRouter Dashboard

Optimize storage:

🧹 Auto-cleanup YouTube: YOUTUBE_CLEANUP_AFTER_DAYS=7 (delete old videos)
📏 Set storage limits: YOUTUBE_MAX_STORAGE_GB=10
💾 Database maintenance: Periodic VACUUM and index rebuilding

See detailed optimization guide: How to Optimize Performance | FAQ § Performance

Repository layout

app/
  adapters/
    content/     -- Multi-provider scraper chain, content chunking, LLM summarization, web search context
      scraper/   -- Protocol, chain, factory, providers (Scrapling, Firecrawl, Playwright, Crawlee, direct HTML)
    youtube/     -- YouTube video download and transcript extraction
    external/    -- Response formatting helpers shared by adapters
    karakeep/    -- Karakeep bookmark sync
    llm/         -- Provider-agnostic LLM abstraction
    openrouter/  -- OpenRouter client, payload shaping, error handling
    telegram/    -- Telegram client, message routing, access control, persistence, command_handlers/
  agents/        -- Multi-agent system (extraction, summarization, validation, web search)
  api/           -- Mobile API (FastAPI, JWT auth, sync endpoints)
    models/      -- Pydantic request/response models
    routers/     -- Route handlers (auth, summaries, sync, collections, health, system)
    services/    -- API business logic
  application/   -- Application layer (DTOs, use cases)
  cli/           -- CLI tools (summary runner, search, MCP server, migrations, Chroma backfill)
  config/        -- Configuration modules
  core/          -- URL normalization, JSON contract, logging, language helpers
  db/            -- SQLite schema, migrations, audit logging helpers
  di/            -- Dependency injection
  domain/        -- Domain models and services (DDD patterns)
  infrastructure/ -- Persistence layer, event bus, vector store
    cache/       -- Cache layer (Redis)
    messaging/   -- Messaging infrastructure
  mcp/           -- MCP server for AI agent access
  models/        -- Pydantic-style models (Telegram entities, LLM config)
  observability/ -- Metrics, tracing, telemetry
  prompts/       -- LLM prompt templates (en/ru, including web search analysis)
  security/      -- Security utilities
  types/         -- Type definitions
  utils/         -- Validation and helper utilities
bot.py           -- Entrypoint wiring config, DB, and Telegram bot
web/             -- Carbon web interface (React + TypeScript + Vite)
SPEC.md          -- Full technical specification

YouTube video support

The bot automatically detects YouTube URLs and processes them differently from regular web articles.

Supported URL formats: Standard watch, short (youtu.be), shorts, live, embed, mobile (m.youtube.com), YouTube Music, legacy /v/.

Processing workflow:

Extract video ID from URL (handles query parameters in any order)
Extract transcript via youtube-transcript-api (prefers manual, falls back to auto-generated)
Download video in configured quality (default 1080p) via yt-dlp
Download subtitles, metadata (JSON), and thumbnail
Generate summary from transcript using LLM
Store video metadata, file paths, and transcript in database

Storage management: Videos stored in /data/videos, auto-cleanup of old videos, size limits per-video and total, deduplication via URL hash.

Requirements: ffmpeg (included in Docker image), yt-dlp, youtube-transcript-api.

Web search enrichment (optional)

When WEB_SEARCH_ENABLED=true, the bot enriches article summaries with current web context:

LLM analyzes content to identify knowledge gaps (unfamiliar entities, recent events, claims needing verification)
If search would help, LLM extracts targeted search queries (max 3)
Firecrawl Search API retrieves relevant web results
Search context is injected into the summarization prompt
Final summary benefits from up-to-date information beyond LLM training cutoff

Only ~30-40% of articles trigger search (self-contained content is skipped). Adds 1 extra LLM call for analysis plus 1-3 Firecrawl search calls when triggered. Feature is opt-in to control costs.

Mobile API

FastAPI-based REST API for mobile clients with Telegram-based JWT authentication, summary retrieval, and sync endpoints. See docs/MOBILE_API_SPEC.md for details.

Carbon Web Interface (V1)

Standalone React + IBM Carbon web UI is available in web/ and served by FastAPI on:

/web
/web/* (SPA routes)

Static assets are published under /static/web/*.

Core routes:

/web/library
/web/library/:id
/web/articles
/web/search
/web/submit
/web/collections
/web/collections/:id
/web/digest
/web/preferences

Local development

cd web
npm install
npm run dev
npm run check:static

Optional web env vars:

VITE_API_BASE_URL (default: same-origin API)
VITE_TELEGRAM_BOT_USERNAME (required for Telegram Login Widget in JWT mode)
VITE_ROUTER_BASENAME (default: /web)

Frontend architecture and auth details: FRONTEND.md.

MCP Server

Model Context Protocol server that exposes articles and search to external AI agents (OpenClaw, Claude Desktop). Provides 17 tools and 13 resources for searching, retrieving, and exploring stored summaries. Runs as a dedicated Docker container with SSE transport or standalone via stdio. See docs/mcp_server.md.

Redis caching

Optional caching layer for Firecrawl and LLM responses, API rate limiting, sync locks, and background task distributed locking. Degrades gracefully when unavailable. Set REDIS_ENABLED=true.

Karakeep integration

Syncs bookmarks from Karakeep (self-hosted bookmark manager) into the summarization pipeline. Use /sync_karakeep to trigger manually or enable KARAKEEP_AUTO_SYNC_ENABLED=true for periodic sync.

Local CLI summary runner

With the same environment variables exported (Firecrawl + OpenRouter keys, DB path, etc.), run python -m app.cli.summary --url https://example.com/article.
Pass full message text instead of --url to mimic Telegram input, e.g. python -m app.cli.summary "/summary https://example.com".
The CLI loads environment variables from .env in your current directory (or project root) automatically; override with --env-file path/to/.env if needed.
Add --accept-multiple to auto-confirm when multiple URLs are supplied, --json-path summary.json to write the final JSON to disk, and --log-level DEBUG for verbose traces.
The CLI generates stub Telegram credentials automatically, so no real bot token is required for local runs.

Errors and correlation IDs

All user-visible errors include Error ID: <cid> to correlate with logs and DB requests.correlation_id.

Dev tooling

Install dev deps: pip install -r requirements.txt -r requirements-dev.txt
Format: make format (ruff format + isort)
Lint: make lint (ruff)
Type-check: make type (mypy)
Web static checks: cd web && npm run check:static
Web unit tests: cd web && npm run test
Pre-commit: pre-commit install then commits will auto-run hooks
Optional: pip install loguru to enable Loguru-based JSON logging with stdlib bridging

Pre-commit hooks

Hooks run in this order to minimize churn: Ruff (check with --fix, format), isort (profile=black), mypy, plus standard hooks. If a first run modifies files, stage the changes and run again.

Local environment

Create venv: make venv (or run scripts/create_venv.sh)
Activate: source .venv/bin/activate
Install deps: pip install -r requirements.txt -r requirements-dev.txt

Dependency management

Source of truth: pyproject.toml ([project] deps + [project.optional-dependencies].dev).
Locked requirements are generated to requirements.txt and requirements-dev.txt.
With uv (recommended):
- Install: curl -Ls https://astral.sh/uv/install.sh | sh
- Lock: make lock-uv
Regenerate locks after changing dependencies in pyproject.toml.

CI

GitHub Actions workflow .github/workflows/ci.yml enforces:

Lockfile freshness (rebuilds from pyproject.toml and checks diff)
Lint (ruff), format check (ruff format, isort), type check (mypy)
Unit tests with coverage (pytest, 80% threshold)
Frontend jobs: frontend-build, web-build, web-test, web-static-check
Docker image build on every push/PR; optional push to GHCR when PUBLISH_DOCKER repository variable is set to true (non-PR events)
OpenAPI spec validation, code complexity (radon)
Codecov coverage reporting
Integration tests
Security checks: Bandit (SAST), pip-audit + Safety (dependency vulns)
Secrets scanning: Gitleaks on workspace and full history (history only on push)
PR summary automation

Docker publishing (optional)

Enable publishing to GitHub Container Registry (GHCR):
- In repository settings -> Variables, add PUBLISH_DOCKER=true.
- Ensure workflow permissions include packages: write (already configured).
- Images are tagged as:
  - ghcr.io/<owner>/<repo>:latest (on main)
  - ghcr.io/<owner>/<repo>:<git-sha>

Automated lockfile PRs

Workflow .github/workflows/update-locks.yml watches pyproject.toml and opens a PR to refresh requirements*.txt using uv.
Auto-merge is enabled for that PR; once CI passes, GitHub will automatically merge it.
You can also trigger it manually from the Actions tab.

Documentation

📚 Documentation Hub: docs/README.md - All docs organized by audience and task

Essential Guides

Document	Description	Audience
Quickstart Tutorial	Get first summary in 5 minutes	Users
FAQ	Frequently asked questions	All
TROUBLESHOOTING.md	Debugging guide with correlation IDs	All
DEPLOYMENT.md	Setup and deployment guide	Operators
environment_variables.md	Complete config reference (250+ vars)	All

Technical Documentation

Document	Description	Audience
SPEC.md	Full technical specification (canonical)	Developers
CLAUDE.md	AI assistant codebase guide	AI Assistants, Developers
HEXAGONAL_ARCHITECTURE_QUICKSTART.md	Architecture patterns	Developers
multi_agent_architecture.md	Multi-agent LLM pipeline	Developers
ADRs	Architecture decision records	Developers

Integration Guides

Document	Description	Audience
MOBILE_API_SPEC.md	REST API specification	Integrators
FRONTEND.md	Carbon web architecture and workflows	Frontend Developers, Integrators
mcp_server.md	MCP server (AI agents)	Integrators
claude_code_hooks.md	Development safety hooks	Developers

Version History

Document	Description
CHANGELOG.md	Version history and release notes

Notes

Dependencies include Pyrogram; if using PyroTGFork, align installation accordingly.
Bot commands are registered on startup for private chats.
Python 3.13+ required for all dependencies including scikit-learn for text processing and optional uvloop for async performance.

Name		Name	Last commit message	Last commit date
Latest commit History 1,692 Commits
.agent/skills		.agent/skills
.claude		.claude
.continue/skills		.continue/skills
.cursor/rules		.cursor/rules
.github		.github
app		app
cli		cli
config		config
docs		docs
extension		extension
monitoring		monitoring
openclaw-skill		openclaw-skill
scripts		scripts
tests		tests
web		web
.bandit		.bandit
.desloppify_t2.json		.desloppify_t2.json
.desloppify_t2_latest.json		.desloppify_t2_latest.json
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.markdownlint-cli2.yaml		.markdownlint-cli2.yaml
.pip-audit-ignore.txt		.pip-audit-ignore.txt
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.api		Dockerfile.api
Dockerfile.chroma		Dockerfile.chroma
FRONTEND.md		FRONTEND.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
bot.py		bot.py
docker-compose.monitoring.yml		docker-compose.monitoring.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-all.txt		requirements-all.txt
requirements-dev.txt		requirements-dev.txt
requirements-tests.txt		requirements-tests.txt
requirements.txt		requirements.txt
run_available_tests.sh		run_available_tests.sh
run_tests_with_coverage.sh		run_tests_with_coverage.sh
security-audit-report.md		security-audit-report.md
security-audit-report.pdf		security-audit-report.pdf
skills-lock.json		skills-lock.json
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation