Work in Progress - This project is under active development. Some features may be incomplete or subject to change.
A Flask application to extract YouTube comments and perform topic modeling analysis.
- Search YouTube channels by handle (
@channelname) or ID - Multi-channel support: Extract multiple channels at once (comma-separated)
- Parallel extraction with configurable worker count (1 to 2x CPU cores)
- Queue system: Add multiple channels to a queue, processed sequentially
- Real-time progress bar with live updates
- Stop button to cancel extraction mid-process
- Skip already downloaded videos to resume interrupted extractions
- Progressive saving: each video saved individually (no data loss on interruption)
Each channel is saved in its own folder:
data/
@ChannelName/
info.json # Channel metadata (subscribers, description, etc.)
videos/
<video_id>.json # One file per video with comments
<video_id>.json
...
- View all extracted channels
- Channel statistics (subscribers, videos, comments)
- Comments per video chart
- Comments timeline visualization
- Video list sorted by engagement
Complete pipeline for analyzing YouTube comments:
- Data Selection - Multi-channel selection with preview (total comments, languages detected, recommended topics)
- Preprocessing - Intelligent text cleaning:
- Auto language detection (French/English)
- spaCy lemmatization
- Custom stopwords (including YouTube-specific terms)
- Emoji and URL removal
- Algorithms - Choose from:
- LDA (Latent Dirichlet Allocation) - Fast, probabilistic, good for <5k comments
- NMF (Non-negative Matrix Factorization) - Balanced, deterministic, good for 1-10k comments
- Configurable Parameters:
- Number of topics (2-20, with auto-recommendation)
- N-gram range (unigrams, bigrams, or both)
- Language processing mode (auto, French, English, mixed)
- Results Visualization:
- Topic keywords with weights
- Representative comments per topic
- Topic distribution chart (Plotly)
- Diversity score
- Real-time Progress - Live tracking of preprocessing, training, and finalization stages
Want to contribute? Fork the repository first, then follow the steps below.
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download spaCy language models (required for topic modeling)
python -m spacy download fr_core_news_sm # French
python -m spacy download en_core_web_sm # EnglishQuick setup (Linux/Mac):
source .venv/bin/activate
./setup_modeling.sh # Automated installation scriptpython app.pyTo use a different port:
python app.py --port 8080Enter multiple channels separated by commas:
@MrBeast, @Fireship, @TechWithTim
All channels will be added to the queue and processed one after another.
Use the slider to adjust the number of parallel workers (1 to 2x your CPU cores). More workers = faster extraction, but may hit YouTube rate limits.
- Extract Comments - Use the Extraction tab to download comments from YouTube channels
- Navigate to Modeling Tab - Click on "Modeling" in the sidebar
- Select Data:
- Choose one or more channels from the dropdown
- Click "Preview Data" to see statistics (total comments, languages, recommended topics)
- Configure Algorithm:
- Choose LDA (fast, <5k comments) or NMF (balanced, 1-10k comments)
- Adjust number of topics (auto-recommended based on comment count)
- Select n-gram range (unigrams, bigrams, or both)
- Choose language processing mode (auto-detect, French, English, or mixed)
- Start Modeling - Click "Start Modeling" and watch real-time progress
- Analyze Results:
- View discovered topics with keywords and weights
- Read representative comments for each topic
- Explore topic distribution chart
- Check diversity score (higher = more distinct topics)
Example: Analyzing @defendintelligence (11k comments):
- Select channel → Preview Data → Choose LDA → 5 topics → Start Modeling
- Wait ~30-60 seconds
- Results: Topics about "machine learning", "intelligence artificielle", "code python", etc.
topic-modeling-youtube/
├── app.py # Flask application with extraction & modeling routes
├── requirements.txt # Python dependencies
├── README.md # Documentation
├── IMPLEMENTATION.md # Topic modeling implementation guide
├── setup_modeling.sh # Automated setup script
├── nlp/ # NLP preprocessing modules
│ ├── language_detector.py # Auto language detection (FR/EN)
│ ├── preprocessing.py # Text cleaning, lemmatization, stopwords
│ └── stopwords.py # Custom stopwords lists
├── modeling/ # Topic modeling algorithms
│ ├── base_model.py # Abstract base class
│ ├── lda_model.py # LDA implementation
│ └── nmf_model.py # NMF implementation
├── export/ # Export utilities (planned)
│ └── (JSON/HTML exporters)
├── templates/
│ └── index.html # Web interface (3 tabs: Extraction, Data, Modeling)
└── data/ # Extracted data (per channel)
└── @ChannelName/
├── info.json # Channel metadata
└── videos/
└── *.json # Individual video comments
{
"channel_name": "ChannelName",
"channel_id": "UCxxxxxx",
"channel_url": "https://www.youtube.com/channel/UCxxxxxx",
"description": "Channel description...",
"subscriber_count": 1500000,
"total_videos": 150,
"videos_extracted": 150,
"total_comments": 25000,
"last_updated": "2025-12-22T15:30:00"
}{
"video_id": "abc123",
"title": "Video Title",
"url": "https://www.youtube.com/watch?v=abc123",
"comment_count": 500,
"comments": [
{
"author": "User1",
"author_id": "UC...",
"text": "Great video!",
"likes": 42,
"timestamp": 1703257800,
"parent": "root",
"is_reply": false
}
]
}| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web interface |
/api/channel-info |
POST | Get channel info |
/api/scrape-comments |
POST | Queue channel(s) extraction |
/api/extraction-status |
GET | Get real-time extraction progress |
/api/stop-extraction |
POST | Stop current extraction |
/api/clear-queue |
POST | Clear completed queue items |
/api/system-info |
GET | Get CPU/worker info |
/api/files-stats |
GET | List channels with statistics |
/api/file-detail/<folder> |
GET | Get channel details |
| Endpoint | Method | Description |
|---|---|---|
/api/modeling/select-data |
POST | Preview data selection (comments count, languages) |
/api/modeling/run |
POST | Start topic modeling job |
/api/modeling/status/<job_id> |
GET | Get job progress and status |
/api/modeling/results/<job_id> |
GET | Get completed job results |
/api/modeling/jobs |
GET | List all modeling jobs |
/api/modeling/jobs/<job_id> |
DELETE | Delete a modeling job |
- Backend: Flask, yt-dlp, ThreadPoolExecutor
- Frontend: HTML/CSS/JavaScript, Plotly.js
- Topic Modeling: scikit-learn (LDA, NMF), Gensim
- NLP: spaCy (lemmatization), langdetect (language detection)
- Data Processing: NumPy, Pandas
- Visualization: Plotly.js (interactive charts)
- Future: BERTopic, sentence-transformers, UMAP, t-SNE
- YouTube comment extraction
- Parallel extraction (configurable workers)
- Multi-channel queue system
- Real-time progress bar
- Stop/cancel extraction
- Skip already downloaded videos
- Per-video JSON storage
- Channel metadata (subscribers, description)
- Web interface with tabs
- Data insights dashboard
- NLP preprocessing pipeline (auto language detection FR/EN, spaCy lemmatization, custom stopwords)
- LDA/NMF implementation (scikit-learn, configurable parameters)
- Topic modeling UI (4-step workflow: data selection, configuration, progress, results)
- Real-time topic modeling progress tracking
- Basic visualization (topic distribution chart with Plotly)
- Topic analysis (keywords, representative comments, diversity score)
- Export functionality
- JSON export with full results
- HTML report generation with embedded visualizations
- CSV export for topic assignments
- Advanced visualizations
- Word clouds per topic
- Document-topic heatmap
- Topic timeline/trends over time
- Inter-topic distance map (2D projection)
- BERTopic integration
- Sentence transformer embeddings
- Multilingual model support
- Dynamic topic modeling
- Hierarchical topic structure
- Advanced features
- Topic labeling with GPT/LLM API
- Sentiment analysis per topic
- Comment search by topic
- Topic evolution tracking
- Compare topics across channels
- Performance improvements
- Results caching and persistence
- Incremental topic modeling
- GPU acceleration for large datasets
- UI enhancements
- Topic renaming/merging
- Interactive topic exploration
- Filter comments by topic
- Topic comparison view
MIT