Skip to content

brzvsk/telegram-wordcloud-cli

Repository files navigation

Telegram Word Cloud CLI

A powerful command-line tool for generating word clouds from Telegram chat exports. Analyze your chat history, filter by author and time periods, and create beautiful visualizations of your conversations.

Features

  • 🚀 Efficient Processing: Handles large Telegram export files (multi-GB) with streaming parser
  • 🎯 Smart Filtering: Filter by author, time periods, and part-of-speech types
  • 🌍 Russian Language Support: Advanced Russian morphological analysis with pymorphy3
  • 🎙️ Voice & Video Analysis: Transcribe voice messages and video notes using local Whisper models
  • 💾 Smart Caching: Cache transcription results to avoid re-processing
  • 📊 Multiple Outputs: Generate word clouds, frequency tables, and detailed reports
  • 🎨 Customizable: Various word cloud themes and styles
  • Fast: Memory-efficient processing with progress tracking

Installation

Prerequisites

  • Python 3.8 or higher
  • Telegram export data in JSON format
  • For voice/video transcription: FFmpeg (see installation instructions below)

Install from Source

# Clone the repository
git clone https://github.com/brzvsk/telegram-wordcloud-cli.git
cd telegram-wordcloud-cli

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package
pip install -e .

# Optional: install contributor tools
pip install -e ".[dev]"

# Optional: enable voice/video transcription support
pip install -e ".[transcription]"

# Optional: run the module directly
python -m telegram_wordcloud_cli --help

Install as Package

# Install directly from the repository
pip install git+https://github.com/brzvsk/telegram-wordcloud-cli.git

# Or include optional transcription support
pip install "telegram-wordcloud-cli[transcription] @ git+https://github.com/brzvsk/telegram-wordcloud-cli.git"

Voice & Video Transcription Setup

To enable voice message and video note transcription:

1. Install FFmpeg

macOS (with Homebrew):

brew install ffmpeg

Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

Windows: Download from https://ffmpeg.org/download.html

2. Verify Installation

# Test that faster-whisper and ffmpeg work
python -c "import faster_whisper; print('Whisper OK')"
ffmpeg -version

3. Whisper Model Selection

Choose the right model for your needs:

Model Size Speed Accuracy Memory Best For
tiny 39MB Fastest Basic ~1GB Quick testing, large datasets
base 74MB Fast Good ~1GB General use (default)
small 244MB Medium Better ~2GB Balanced speed/accuracy
medium 769MB Slower Very Good ~5GB High accuracy needed
large-v3 1550MB Slowest Best ~10GB Maximum accuracy

Quick Start

1. Get Your Telegram Data

  1. Open Telegram app
  2. Go to Settings > Privacy & Security > Data Export
  3. Select "Chats" and export as JSON
  4. Wait for the export to complete and download the archive

2. Generate Your First Word Cloud

# Basic usage - analyze all your messages
tg-wordcloud -i path/to/result.json -o my_wordcloud

# Filter by author and time period
tg-wordcloud -i result.json -a "Your Name" -p 2024 -o my_2024_words

# Include adjectives and verbs, exclude custom words
tg-wordcloud -i result.json -w all -e exclude_words.txt -o comprehensive

# Analyze voice messages and text with Russian language hint
tg-wordcloud -i result.json --language ru -o with_voice

# Only analyze voice and video messages (no text)
tg-wordcloud -i result.json --no-text -o voice_only

Command Line Options

usage: tg-wordcloud [-h] -i INPUT [-o OUTPUT] [-e EXCLUDE_WORDS] [-p PERIOD] 
                    [-a AUTHOR] [-w {nouns,adjectives,verbs,all}] 
                    [--include-text] [--no-text] [--include-voice] [--no-voice]
                    [--include-video] [--no-video]
                    [--skip-transcription] [--transcriber {local}]
                    [--model-name MODEL_NAME] [--language LANGUAGE]
                    [--cache-dir CACHE_DIR] [--no-cache]
                    [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Generate word clouds from Telegram export data

required arguments:
  -i, --input INPUT     Path to Telegram JSON export file

optional arguments:
  -h, --help            Show this help message and exit
  -o, --output OUTPUT   Output prefix for generated files (default: output)
  -e, --exclude-words EXCLUDE_WORDS
                        Path to text file containing words to exclude (one per line)
  -p, --period PERIOD   Time period filter (examples: "2024", "2023-2024", "Jan-2024")
  -a, --author AUTHOR   Filter by author name (case-insensitive, comma-separated)
  -w, --word-types {nouns,adjectives,verbs,all}
                        Part-of-speech types to include (default: nouns)
  --include-text        Include text messages (enabled by default)
  --no-text             Exclude text messages
  --include-voice       Include voice messages via transcription (enabled by default)
  --no-voice            Exclude voice messages
  --include-video       Include video messages via transcription (enabled by default)
  --no-video            Exclude video messages
  --skip-transcription  Skip transcription (only process cached transcripts)
  --transcriber {local} Transcription backend to use (default: local)
  --model-name MODEL_NAME
                        Whisper model name (tiny, base, small, medium, large-v3) (default: base)
  --language LANGUAGE   Language hint for transcription (e.g., ru, en)
  --cache-dir CACHE_DIR Directory for transcription cache (default: .transcription_cache)
  --no-cache            Disable transcription caching
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set logging level (default: INFO)

Usage Examples

Basic Word Cloud Generation

# Generate word cloud from all messages
tg-wordcloud -i telegram_export.json -o my_messages

# This creates:
# - my_messages.png (word cloud image)
# - my_messages.csv (frequency table)
# - my_messages.txt (detailed report)

Time Period Filtering

# Analyze messages from a specific year
tg-wordcloud -i export.json -p 2024 -o messages_2024

# Analyze a date range
tg-wordcloud -i export.json -p "2023-2024" -o messages_recent

# Analyze specific month
tg-wordcloud -i export.json -p "Jan-2024" -o messages_january
tg-wordcloud -i export.json -p "2024-01" -o messages_january_alt

# Relative periods
tg-wordcloud -i export.json -p "this-year" -o messages_this_year
tg-wordcloud -i export.json -p "last-year" -o messages_last_year

Author Filtering

# Filter by your messages only
tg-wordcloud -i export.json -a "Your Name" -o my_words

# Filter by multiple authors
tg-wordcloud -i export.json -a "Alice,Bob,Charlie" -o group_words

Part-of-Speech Filtering

# Only nouns (default)
tg-wordcloud -i export.json -w nouns -o nouns_only

# Only adjectives
tg-wordcloud -i export.json -w adjectives -o descriptive_words

# Only verbs
tg-wordcloud -i export.json -w verbs -o action_words

# All word types
tg-wordcloud -i export.json -w all -o all_words

Custom Word Exclusion

Create a file exclude_words.txt:

привет
спасибо
пожалуйста
хорошо
tg-wordcloud -i export.json -e exclude_words.txt -o filtered_words

Voice & Video Message Examples

# Analyze all content including voice and video messages
tg-wordcloud -i export.json --language ru -o complete_analysis

# Use a smaller/faster Whisper model
tg-wordcloud -i export.json --model-name tiny -o fast_transcription

# Use a more accurate Whisper model
tg-wordcloud -i export.json --model-name large-v3 -o accurate_transcription

# Only analyze voice messages (skip text and video)
tg-wordcloud -i export.json --no-text --no-video -o voice_only

# Skip transcription, only use cached results
tg-wordcloud -i export.json --skip-transcription -o cached_only

# Disable caching
tg-wordcloud -i export.json --no-cache -o no_cache

# Custom cache directory
tg-wordcloud -i export.json --cache-dir /path/to/cache -o custom_cache

Advanced Examples

# Analyze your messages from 2024, include all word types, with custom exclusions
tg-wordcloud -i export.json -a "Your Name" -p 2024 -w all -e custom_stops.txt -o analysis_2024

# Analyze voice messages from specific author with custom model
tg-wordcloud -i export.json -a "Friend Name" --model-name medium --language en -o friend_voice

# Debug mode for troubleshooting
tg-wordcloud -i export.json -o debug_output --log-level DEBUG

# Minimal output (only essential info)
tg-wordcloud -i export.json -o quiet_output --log-level WARNING

Output Files

The tool generates multiple output files:

Word Cloud Image (output.png)

  • High-resolution word cloud visualization
  • Automatic font selection for Russian text support
  • Customizable colors and layout

Frequency Table (output.csv)

  • Ranked list of words with frequencies
  • Includes rank, word, frequency, and percentage columns
  • Perfect for further analysis in Excel or other tools

Detailed Report (output.txt)

  • Comprehensive statistics about your text
  • Top 50 most frequent words
  • Processing summary and metadata

Time Period Formats

The tool supports various time period formats:

Format Example Description
Year 2024 All messages from 2024
Year Range 2023-2024 Messages from 2023 to 2024
Month (YYYY-MM) 2024-01 January 2024
Month (Mon-YYYY) Jan-2024 January 2024
Month (Full) January 2024 January 2024
Specific Date 2024-01-15 January 15, 2024
Relative this-year Current year
Relative last-year Previous year
Quarter Q1-2024 First quarter of 2024

Performance Tips

Large Files

  • Files over 100MB automatically use streaming parser
  • Memory usage stays low regardless of file size
  • Processing time scales linearly with file size

Optimization

  • Use specific time periods to reduce processing time
  • Filter by author early to reduce text processing load
  • Use --log-level WARNING for faster processing (less output)

Voice/Video Performance

  • Model Selection: Use --model-name tiny for fastest transcription, large-v3 for best accuracy
  • Caching: Transcription results are cached automatically - subsequent runs are much faster
  • Language Hints: Use --language ru or --language en to improve accuracy and speed
  • Memory: Whisper models require 1-8GB RAM depending on model size
  • GPU: CUDA-capable GPUs dramatically accelerate transcription (requires separate CUDA setup)
  • Processing Time: Expect ~10-30% of audio duration for transcription time (varies by model/hardware)

Typical Performance

  • Text-only: 1000 messages/second
  • With transcription: 1-5 minutes of audio per minute of processing (model-dependent)
  • Cache hits: Instant processing of previously transcribed files

Troubleshooting

Common Issues

"No words found after filtering"

  • Check if author name matches exactly (case-insensitive)
  • Verify time period format
  • Try broader filters (e.g., -w all instead of -w nouns)
  • If using only voice/video, ensure transcription is working (check logs)

"File not found" errors

  • Use absolute paths for input files
  • Check file permissions
  • Verify JSON export is complete
  • For voice/video files, ensure the export includes media files

Missing Russian characters in word cloud

  • The tool automatically finds system fonts
  • On Linux, install fonts-dejavu package
  • On Windows, ensure Arial or similar fonts are available

Memory issues with very large files

  • Streaming parser should handle files up to 10GB+
  • If issues persist, try filtering by time period first
  • Close other memory-intensive applications

Voice/Video Transcription Issues

"faster-whisper is not installed"
pip install faster-whisper
"FFmpeg not found" or audio decoding errors
  • Install FFmpeg (see installation instructions above)
  • Verify FFmpeg is in PATH: ffmpeg -version
  • On Windows, restart terminal after FFmpeg installation
Transcription is slow
  • Use smaller models: --model-name tiny or --model-name base
  • Enable GPU acceleration (requires CUDA setup)
  • Process smaller time periods or specific authors
Empty transcriptions
  • Check if audio files exist in the export
  • Try different language hints: --language ru, --language en
  • Check log output with --log-level DEBUG
Cache issues
  • Clear cache: manually delete .transcription_cache directory
  • Use --no-cache to disable caching temporarily
  • Check disk space for cache directory
  • Close other memory-intensive applications

Debug Mode

tg-wordcloud -i export.json -o debug --log-level DEBUG

This provides detailed logging including:

  • File parsing progress
  • Filtering statistics
  • Text processing details
  • Error traces

Contributing

See CONTRIBUTING.md for contribution guidelines and expectations.

Development Setup

git clone https://github.com/brzvsk/telegram-wordcloud-cli.git
cd telegram-wordcloud-cli
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Run tests
pytest

# Type checking
mypy telegram_wordcloud_cli/

# Code formatting
black telegram_wordcloud_cli/ main.py
isort telegram_wordcloud_cli/ main.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Limitations

Voice & Video Transcription

  • Requires significant computational resources (CPU/RAM)
  • Transcription accuracy depends on audio quality and language
  • Supported formats: OGG, M4A, MP4, WAV (via FFmpeg)
  • Internet connection required for initial model download
  • Processing time scales with audio duration

General

  • Russian language processing optimized, other languages supported but may need tuning
  • Very large exports (>10GB) may require streaming mode and adequate disk space
  • Telegram must include media files in export (not just metadata)

Acknowledgments

About

Generate word clouds from Telegram export data, with filters for authors, time periods, and optional voice transcription.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages