Telegram Word Cloud CLI

A powerful command-line tool for generating word clouds from Telegram chat exports. Analyze your chat history, filter by author and time periods, and create beautiful visualizations of your conversations.

Features

🚀 Efficient Processing: Handles large Telegram export files (multi-GB) with streaming parser
🎯 Smart Filtering: Filter by author, time periods, and part-of-speech types
🌍 Russian Language Support: Advanced Russian morphological analysis with pymorphy3
🎙️ Voice & Video Analysis: Transcribe voice messages and video notes using local Whisper models
💾 Smart Caching: Cache transcription results to avoid re-processing
📊 Multiple Outputs: Generate word clouds, frequency tables, and detailed reports
🎨 Customizable: Various word cloud themes and styles
⚡ Fast: Memory-efficient processing with progress tracking

Installation

Prerequisites

Python 3.8 or higher
Telegram export data in JSON format
For voice/video transcription: FFmpeg (see installation instructions below)

Install from Source

# Clone the repository
git clone https://github.com/brzvsk/telegram-wordcloud-cli.git
cd telegram-wordcloud-cli

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package
pip install -e .

# Optional: install contributor tools
pip install -e ".[dev]"

# Optional: enable voice/video transcription support
pip install -e ".[transcription]"

# Optional: run the module directly
python -m telegram_wordcloud_cli --help

Install as Package

# Install directly from the repository
pip install git+https://github.com/brzvsk/telegram-wordcloud-cli.git

# Or include optional transcription support
pip install "telegram-wordcloud-cli[transcription] @ git+https://github.com/brzvsk/telegram-wordcloud-cli.git"

Voice & Video Transcription Setup

To enable voice message and video note transcription:

1. Install FFmpeg

macOS (with Homebrew):

brew install ffmpeg

Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

Windows: Download from https://ffmpeg.org/download.html

2. Verify Installation

# Test that faster-whisper and ffmpeg work
python -c "import faster_whisper; print('Whisper OK')"
ffmpeg -version

3. Whisper Model Selection

Choose the right model for your needs:

Model	Size	Speed	Accuracy	Memory	Best For
`tiny`	39MB	Fastest	Basic	~1GB	Quick testing, large datasets
`base`	74MB	Fast	Good	~1GB	General use (default)
`small`	244MB	Medium	Better	~2GB	Balanced speed/accuracy
`medium`	769MB	Slower	Very Good	~5GB	High accuracy needed
`large-v3`	1550MB	Slowest	Best	~10GB	Maximum accuracy

Quick Start

1. Get Your Telegram Data

Open Telegram app
Go to Settings > Privacy & Security > Data Export
Select "Chats" and export as JSON
Wait for the export to complete and download the archive

2. Generate Your First Word Cloud

# Basic usage - analyze all your messages
tg-wordcloud -i path/to/result.json -o my_wordcloud

# Filter by author and time period
tg-wordcloud -i result.json -a "Your Name" -p 2024 -o my_2024_words

# Include adjectives and verbs, exclude custom words
tg-wordcloud -i result.json -w all -e exclude_words.txt -o comprehensive

# Analyze voice messages and text with Russian language hint
tg-wordcloud -i result.json --language ru -o with_voice

# Only analyze voice and video messages (no text)
tg-wordcloud -i result.json --no-text -o voice_only

Command Line Options

usage: tg-wordcloud [-h] -i INPUT [-o OUTPUT] [-e EXCLUDE_WORDS] [-p PERIOD] 
                    [-a AUTHOR] [-w {nouns,adjectives,verbs,all}] 
                    [--include-text] [--no-text] [--include-voice] [--no-voice]
                    [--include-video] [--no-video]
                    [--skip-transcription] [--transcriber {local}]
                    [--model-name MODEL_NAME] [--language LANGUAGE]
                    [--cache-dir CACHE_DIR] [--no-cache]
                    [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Generate word clouds from Telegram export data

required arguments:
  -i, --input INPUT     Path to Telegram JSON export file

optional arguments:
  -h, --help            Show this help message and exit
  -o, --output OUTPUT   Output prefix for generated files (default: output)
  -e, --exclude-words EXCLUDE_WORDS
                        Path to text file containing words to exclude (one per line)
  -p, --period PERIOD   Time period filter (examples: "2024", "2023-2024", "Jan-2024")
  -a, --author AUTHOR   Filter by author name (case-insensitive, comma-separated)
  -w, --word-types {nouns,adjectives,verbs,all}
                        Part-of-speech types to include (default: nouns)
  --include-text        Include text messages (enabled by default)
  --no-text             Exclude text messages
  --include-voice       Include voice messages via transcription (enabled by default)
  --no-voice            Exclude voice messages
  --include-video       Include video messages via transcription (enabled by default)
  --no-video            Exclude video messages
  --skip-transcription  Skip transcription (only process cached transcripts)
  --transcriber {local} Transcription backend to use (default: local)
  --model-name MODEL_NAME
                        Whisper model name (tiny, base, small, medium, large-v3) (default: base)
  --language LANGUAGE   Language hint for transcription (e.g., ru, en)
  --cache-dir CACHE_DIR Directory for transcription cache (default: .transcription_cache)
  --no-cache            Disable transcription caching
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set logging level (default: INFO)

Usage Examples

Basic Word Cloud Generation

# Generate word cloud from all messages
tg-wordcloud -i telegram_export.json -o my_messages

# This creates:
# - my_messages.png (word cloud image)
# - my_messages.csv (frequency table)
# - my_messages.txt (detailed report)

Time Period Filtering

# Analyze messages from a specific year
tg-wordcloud -i export.json -p 2024 -o messages_2024

# Analyze a date range
tg-wordcloud -i export.json -p "2023-2024" -o messages_recent

# Analyze specific month
tg-wordcloud -i export.json -p "Jan-2024" -o messages_january
tg-wordcloud -i export.json -p "2024-01" -o messages_january_alt

# Relative periods
tg-wordcloud -i export.json -p "this-year" -o messages_this_year
tg-wordcloud -i export.json -p "last-year" -o messages_last_year

Author Filtering

# Filter by your messages only
tg-wordcloud -i export.json -a "Your Name" -o my_words

# Filter by multiple authors
tg-wordcloud -i export.json -a "Alice,Bob,Charlie" -o group_words

Part-of-Speech Filtering

# Only nouns (default)
tg-wordcloud -i export.json -w nouns -o nouns_only

# Only adjectives
tg-wordcloud -i export.json -w adjectives -o descriptive_words

# Only verbs
tg-wordcloud -i export.json -w verbs -o action_words

# All word types
tg-wordcloud -i export.json -w all -o all_words

Custom Word Exclusion

Create a file exclude_words.txt:

привет
спасибо
пожалуйста
хорошо

tg-wordcloud -i export.json -e exclude_words.txt -o filtered_words

Voice & Video Message Examples

# Analyze all content including voice and video messages
tg-wordcloud -i export.json --language ru -o complete_analysis

# Use a smaller/faster Whisper model
tg-wordcloud -i export.json --model-name tiny -o fast_transcription

# Use a more accurate Whisper model
tg-wordcloud -i export.json --model-name large-v3 -o accurate_transcription

# Only analyze voice messages (skip text and video)
tg-wordcloud -i export.json --no-text --no-video -o voice_only

# Skip transcription, only use cached results
tg-wordcloud -i export.json --skip-transcription -o cached_only

# Disable caching
tg-wordcloud -i export.json --no-cache -o no_cache

# Custom cache directory
tg-wordcloud -i export.json --cache-dir /path/to/cache -o custom_cache

Advanced Examples

# Analyze your messages from 2024, include all word types, with custom exclusions
tg-wordcloud -i export.json -a "Your Name" -p 2024 -w all -e custom_stops.txt -o analysis_2024

# Analyze voice messages from specific author with custom model
tg-wordcloud -i export.json -a "Friend Name" --model-name medium --language en -o friend_voice

# Debug mode for troubleshooting
tg-wordcloud -i export.json -o debug_output --log-level DEBUG

# Minimal output (only essential info)
tg-wordcloud -i export.json -o quiet_output --log-level WARNING

Output Files

The tool generates multiple output files:

Word Cloud Image (`output.png`)

High-resolution word cloud visualization
Automatic font selection for Russian text support
Customizable colors and layout

Frequency Table (`output.csv`)

Ranked list of words with frequencies
Includes rank, word, frequency, and percentage columns
Perfect for further analysis in Excel or other tools

Detailed Report (`output.txt`)

Comprehensive statistics about your text
Top 50 most frequent words
Processing summary and metadata

Time Period Formats

The tool supports various time period formats:

Format	Example	Description
Year	`2024`	All messages from 2024
Year Range	`2023-2024`	Messages from 2023 to 2024
Month (YYYY-MM)	`2024-01`	January 2024
Month (Mon-YYYY)	`Jan-2024`	January 2024
Month (Full)	`January 2024`	January 2024
Specific Date	`2024-01-15`	January 15, 2024
Relative	`this-year`	Current year
Relative	`last-year`	Previous year
Quarter	`Q1-2024`	First quarter of 2024

Performance Tips

Large Files

Files over 100MB automatically use streaming parser
Memory usage stays low regardless of file size
Processing time scales linearly with file size

Optimization

Use specific time periods to reduce processing time
Filter by author early to reduce text processing load
Use --log-level WARNING for faster processing (less output)

Voice/Video Performance

Model Selection: Use --model-name tiny for fastest transcription, large-v3 for best accuracy
Caching: Transcription results are cached automatically - subsequent runs are much faster
Language Hints: Use --language ru or --language en to improve accuracy and speed
Memory: Whisper models require 1-8GB RAM depending on model size
GPU: CUDA-capable GPUs dramatically accelerate transcription (requires separate CUDA setup)
Processing Time: Expect ~10-30% of audio duration for transcription time (varies by model/hardware)

Typical Performance

Text-only: 1000 messages/second
With transcription: 1-5 minutes of audio per minute of processing (model-dependent)
Cache hits: Instant processing of previously transcribed files

Troubleshooting

Common Issues

"No words found after filtering"

Check if author name matches exactly (case-insensitive)
Verify time period format
Try broader filters (e.g., -w all instead of -w nouns)
If using only voice/video, ensure transcription is working (check logs)

"File not found" errors

Use absolute paths for input files
Check file permissions
Verify JSON export is complete
For voice/video files, ensure the export includes media files

Missing Russian characters in word cloud

The tool automatically finds system fonts
On Linux, install fonts-dejavu package
On Windows, ensure Arial or similar fonts are available

Memory issues with very large files

Streaming parser should handle files up to 10GB+
If issues persist, try filtering by time period first
Close other memory-intensive applications

Voice/Video Transcription Issues

"faster-whisper is not installed"

pip install faster-whisper

"FFmpeg not found" or audio decoding errors

Install FFmpeg (see installation instructions above)
Verify FFmpeg is in PATH: ffmpeg -version
On Windows, restart terminal after FFmpeg installation

Transcription is slow

Use smaller models: --model-name tiny or --model-name base
Enable GPU acceleration (requires CUDA setup)
Process smaller time periods or specific authors

Empty transcriptions

Check if audio files exist in the export
Try different language hints: --language ru, --language en
Check log output with --log-level DEBUG

Cache issues

Clear cache: manually delete .transcription_cache directory
Use --no-cache to disable caching temporarily
Check disk space for cache directory
Close other memory-intensive applications

Debug Mode

tg-wordcloud -i export.json -o debug --log-level DEBUG

This provides detailed logging including:

File parsing progress
Filtering statistics
Text processing details
Error traces

Contributing

See CONTRIBUTING.md for contribution guidelines and expectations.

Development Setup

git clone https://github.com/brzvsk/telegram-wordcloud-cli.git
cd telegram-wordcloud-cli
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Run tests
pytest

# Type checking
mypy telegram_wordcloud_cli/

# Code formatting
black telegram_wordcloud_cli/ main.py
isort telegram_wordcloud_cli/ main.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Limitations

Voice & Video Transcription

Requires significant computational resources (CPU/RAM)
Transcription accuracy depends on audio quality and language
Supported formats: OGG, M4A, MP4, WAV (via FFmpeg)
Internet connection required for initial model download
Processing time scales with audio duration

General

Russian language processing optimized, other languages supported but may need tuning
Very large exports (>10GB) may require streaming mode and adequate disk space
Telegram must include media files in export (not just metadata)

Acknowledgments

pymorphy3 for Russian morphological analysis
wordcloud for word cloud generation
NLTK for text processing utilities
faster-whisper for efficient speech transcription
OpenAI Whisper for the underlying speech recognition models
Telegram for providing data export functionality

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
telegram_wordcloud_cli		telegram_wordcloud_cli
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
exclude_words_sample.txt		exclude_words_sample.txt
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Telegram Word Cloud CLI

Features

Installation

Prerequisites

Install from Source

Install as Package

Voice & Video Transcription Setup

1. Install FFmpeg

2. Verify Installation

3. Whisper Model Selection

Quick Start

1. Get Your Telegram Data

2. Generate Your First Word Cloud

Command Line Options

Usage Examples

Basic Word Cloud Generation

Time Period Filtering

Author Filtering

Part-of-Speech Filtering

Custom Word Exclusion

Voice & Video Message Examples

Advanced Examples

Output Files

Word Cloud Image (output.png)

Frequency Table (output.csv)

Detailed Report (output.txt)

Time Period Formats

Performance Tips

Large Files

Optimization

Voice/Video Performance

Typical Performance

Troubleshooting

Common Issues

"No words found after filtering"

"File not found" errors

Missing Russian characters in word cloud

Memory issues with very large files

Voice/Video Transcription Issues

"faster-whisper is not installed"

"FFmpeg not found" or audio decoding errors

Transcription is slow

Empty transcriptions

Cache issues

Debug Mode

Contributing

Development Setup

License

Limitations

Voice & Video Transcription

General

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Word Cloud Image (`output.png`)

Frequency Table (`output.csv`)

Detailed Report (`output.txt`)

Packages