A powerful command-line tool for generating word clouds from Telegram chat exports. Analyze your chat history, filter by author and time periods, and create beautiful visualizations of your conversations.
- 🚀 Efficient Processing: Handles large Telegram export files (multi-GB) with streaming parser
- 🎯 Smart Filtering: Filter by author, time periods, and part-of-speech types
- 🌍 Russian Language Support: Advanced Russian morphological analysis with pymorphy3
- 🎙️ Voice & Video Analysis: Transcribe voice messages and video notes using local Whisper models
- 💾 Smart Caching: Cache transcription results to avoid re-processing
- 📊 Multiple Outputs: Generate word clouds, frequency tables, and detailed reports
- 🎨 Customizable: Various word cloud themes and styles
- ⚡ Fast: Memory-efficient processing with progress tracking
- Python 3.8 or higher
- Telegram export data in JSON format
- For voice/video transcription: FFmpeg (see installation instructions below)
# Clone the repository
git clone https://github.com/brzvsk/telegram-wordcloud-cli.git
cd telegram-wordcloud-cli
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install the package
pip install -e .
# Optional: install contributor tools
pip install -e ".[dev]"
# Optional: enable voice/video transcription support
pip install -e ".[transcription]"
# Optional: run the module directly
python -m telegram_wordcloud_cli --help# Install directly from the repository
pip install git+https://github.com/brzvsk/telegram-wordcloud-cli.git
# Or include optional transcription support
pip install "telegram-wordcloud-cli[transcription] @ git+https://github.com/brzvsk/telegram-wordcloud-cli.git"To enable voice message and video note transcription:
macOS (with Homebrew):
brew install ffmpegUbuntu/Debian:
sudo apt update
sudo apt install ffmpegWindows: Download from https://ffmpeg.org/download.html
# Test that faster-whisper and ffmpeg work
python -c "import faster_whisper; print('Whisper OK')"
ffmpeg -versionChoose the right model for your needs:
| Model | Size | Speed | Accuracy | Memory | Best For |
|---|---|---|---|---|---|
tiny |
39MB | Fastest | Basic | ~1GB | Quick testing, large datasets |
base |
74MB | Fast | Good | ~1GB | General use (default) |
small |
244MB | Medium | Better | ~2GB | Balanced speed/accuracy |
medium |
769MB | Slower | Very Good | ~5GB | High accuracy needed |
large-v3 |
1550MB | Slowest | Best | ~10GB | Maximum accuracy |
- Open Telegram app
- Go to Settings > Privacy & Security > Data Export
- Select "Chats" and export as JSON
- Wait for the export to complete and download the archive
# Basic usage - analyze all your messages
tg-wordcloud -i path/to/result.json -o my_wordcloud
# Filter by author and time period
tg-wordcloud -i result.json -a "Your Name" -p 2024 -o my_2024_words
# Include adjectives and verbs, exclude custom words
tg-wordcloud -i result.json -w all -e exclude_words.txt -o comprehensive
# Analyze voice messages and text with Russian language hint
tg-wordcloud -i result.json --language ru -o with_voice
# Only analyze voice and video messages (no text)
tg-wordcloud -i result.json --no-text -o voice_onlyusage: tg-wordcloud [-h] -i INPUT [-o OUTPUT] [-e EXCLUDE_WORDS] [-p PERIOD]
[-a AUTHOR] [-w {nouns,adjectives,verbs,all}]
[--include-text] [--no-text] [--include-voice] [--no-voice]
[--include-video] [--no-video]
[--skip-transcription] [--transcriber {local}]
[--model-name MODEL_NAME] [--language LANGUAGE]
[--cache-dir CACHE_DIR] [--no-cache]
[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Generate word clouds from Telegram export data
required arguments:
-i, --input INPUT Path to Telegram JSON export file
optional arguments:
-h, --help Show this help message and exit
-o, --output OUTPUT Output prefix for generated files (default: output)
-e, --exclude-words EXCLUDE_WORDS
Path to text file containing words to exclude (one per line)
-p, --period PERIOD Time period filter (examples: "2024", "2023-2024", "Jan-2024")
-a, --author AUTHOR Filter by author name (case-insensitive, comma-separated)
-w, --word-types {nouns,adjectives,verbs,all}
Part-of-speech types to include (default: nouns)
--include-text Include text messages (enabled by default)
--no-text Exclude text messages
--include-voice Include voice messages via transcription (enabled by default)
--no-voice Exclude voice messages
--include-video Include video messages via transcription (enabled by default)
--no-video Exclude video messages
--skip-transcription Skip transcription (only process cached transcripts)
--transcriber {local} Transcription backend to use (default: local)
--model-name MODEL_NAME
Whisper model name (tiny, base, small, medium, large-v3) (default: base)
--language LANGUAGE Language hint for transcription (e.g., ru, en)
--cache-dir CACHE_DIR Directory for transcription cache (default: .transcription_cache)
--no-cache Disable transcription caching
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set logging level (default: INFO)
# Generate word cloud from all messages
tg-wordcloud -i telegram_export.json -o my_messages
# This creates:
# - my_messages.png (word cloud image)
# - my_messages.csv (frequency table)
# - my_messages.txt (detailed report)# Analyze messages from a specific year
tg-wordcloud -i export.json -p 2024 -o messages_2024
# Analyze a date range
tg-wordcloud -i export.json -p "2023-2024" -o messages_recent
# Analyze specific month
tg-wordcloud -i export.json -p "Jan-2024" -o messages_january
tg-wordcloud -i export.json -p "2024-01" -o messages_january_alt
# Relative periods
tg-wordcloud -i export.json -p "this-year" -o messages_this_year
tg-wordcloud -i export.json -p "last-year" -o messages_last_year# Filter by your messages only
tg-wordcloud -i export.json -a "Your Name" -o my_words
# Filter by multiple authors
tg-wordcloud -i export.json -a "Alice,Bob,Charlie" -o group_words# Only nouns (default)
tg-wordcloud -i export.json -w nouns -o nouns_only
# Only adjectives
tg-wordcloud -i export.json -w adjectives -o descriptive_words
# Only verbs
tg-wordcloud -i export.json -w verbs -o action_words
# All word types
tg-wordcloud -i export.json -w all -o all_wordsCreate a file exclude_words.txt:
привет
спасибо
пожалуйста
хорошо
tg-wordcloud -i export.json -e exclude_words.txt -o filtered_words# Analyze all content including voice and video messages
tg-wordcloud -i export.json --language ru -o complete_analysis
# Use a smaller/faster Whisper model
tg-wordcloud -i export.json --model-name tiny -o fast_transcription
# Use a more accurate Whisper model
tg-wordcloud -i export.json --model-name large-v3 -o accurate_transcription
# Only analyze voice messages (skip text and video)
tg-wordcloud -i export.json --no-text --no-video -o voice_only
# Skip transcription, only use cached results
tg-wordcloud -i export.json --skip-transcription -o cached_only
# Disable caching
tg-wordcloud -i export.json --no-cache -o no_cache
# Custom cache directory
tg-wordcloud -i export.json --cache-dir /path/to/cache -o custom_cache# Analyze your messages from 2024, include all word types, with custom exclusions
tg-wordcloud -i export.json -a "Your Name" -p 2024 -w all -e custom_stops.txt -o analysis_2024
# Analyze voice messages from specific author with custom model
tg-wordcloud -i export.json -a "Friend Name" --model-name medium --language en -o friend_voice
# Debug mode for troubleshooting
tg-wordcloud -i export.json -o debug_output --log-level DEBUG
# Minimal output (only essential info)
tg-wordcloud -i export.json -o quiet_output --log-level WARNINGThe tool generates multiple output files:
- High-resolution word cloud visualization
- Automatic font selection for Russian text support
- Customizable colors and layout
- Ranked list of words with frequencies
- Includes rank, word, frequency, and percentage columns
- Perfect for further analysis in Excel or other tools
- Comprehensive statistics about your text
- Top 50 most frequent words
- Processing summary and metadata
The tool supports various time period formats:
| Format | Example | Description |
|---|---|---|
| Year | 2024 |
All messages from 2024 |
| Year Range | 2023-2024 |
Messages from 2023 to 2024 |
| Month (YYYY-MM) | 2024-01 |
January 2024 |
| Month (Mon-YYYY) | Jan-2024 |
January 2024 |
| Month (Full) | January 2024 |
January 2024 |
| Specific Date | 2024-01-15 |
January 15, 2024 |
| Relative | this-year |
Current year |
| Relative | last-year |
Previous year |
| Quarter | Q1-2024 |
First quarter of 2024 |
- Files over 100MB automatically use streaming parser
- Memory usage stays low regardless of file size
- Processing time scales linearly with file size
- Use specific time periods to reduce processing time
- Filter by author early to reduce text processing load
- Use
--log-level WARNINGfor faster processing (less output)
- Model Selection: Use
--model-name tinyfor fastest transcription,large-v3for best accuracy - Caching: Transcription results are cached automatically - subsequent runs are much faster
- Language Hints: Use
--language ruor--language ento improve accuracy and speed - Memory: Whisper models require 1-8GB RAM depending on model size
- GPU: CUDA-capable GPUs dramatically accelerate transcription (requires separate CUDA setup)
- Processing Time: Expect ~10-30% of audio duration for transcription time (varies by model/hardware)
- Text-only: 1000 messages/second
- With transcription: 1-5 minutes of audio per minute of processing (model-dependent)
- Cache hits: Instant processing of previously transcribed files
- Check if author name matches exactly (case-insensitive)
- Verify time period format
- Try broader filters (e.g.,
-w allinstead of-w nouns) - If using only voice/video, ensure transcription is working (check logs)
- Use absolute paths for input files
- Check file permissions
- Verify JSON export is complete
- For voice/video files, ensure the export includes media files
- The tool automatically finds system fonts
- On Linux, install
fonts-dejavupackage - On Windows, ensure Arial or similar fonts are available
- Streaming parser should handle files up to 10GB+
- If issues persist, try filtering by time period first
- Close other memory-intensive applications
pip install faster-whisper- Install FFmpeg (see installation instructions above)
- Verify FFmpeg is in PATH:
ffmpeg -version - On Windows, restart terminal after FFmpeg installation
- Use smaller models:
--model-name tinyor--model-name base - Enable GPU acceleration (requires CUDA setup)
- Process smaller time periods or specific authors
- Check if audio files exist in the export
- Try different language hints:
--language ru,--language en - Check log output with
--log-level DEBUG
- Clear cache: manually delete
.transcription_cachedirectory - Use
--no-cacheto disable caching temporarily - Check disk space for cache directory
- Close other memory-intensive applications
tg-wordcloud -i export.json -o debug --log-level DEBUGThis provides detailed logging including:
- File parsing progress
- Filtering statistics
- Text processing details
- Error traces
See CONTRIBUTING.md for contribution guidelines and expectations.
git clone https://github.com/brzvsk/telegram-wordcloud-cli.git
cd telegram-wordcloud-cli
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Run tests
pytest
# Type checking
mypy telegram_wordcloud_cli/
# Code formatting
black telegram_wordcloud_cli/ main.py
isort telegram_wordcloud_cli/ main.pyThis project is licensed under the MIT License - see the LICENSE file for details.
- Requires significant computational resources (CPU/RAM)
- Transcription accuracy depends on audio quality and language
- Supported formats: OGG, M4A, MP4, WAV (via FFmpeg)
- Internet connection required for initial model download
- Processing time scales with audio duration
- Russian language processing optimized, other languages supported but may need tuning
- Very large exports (>10GB) may require streaming mode and adequate disk space
- Telegram must include media files in export (not just metadata)
- pymorphy3 for Russian morphological analysis
- wordcloud for word cloud generation
- NLTK for text processing utilities
- faster-whisper for efficient speech transcription
- OpenAI Whisper for the underlying speech recognition models
- Telegram for providing data export functionality