Transform your D&D session recordings into searchable, organized transcripts with automatic speaker identification and in-character/out-of-character separation.
# Install dependencies
pip install -r requirements.txt
# Start web interface
python app.py
# Or use CLI
python cli.py process your_session.m4aSee SETUP.md for detailed installation instructions.
- 💾 Resumable Processing (Checkpoints): Automatically saves progress after each major pipeline stage, allowing you to resume processing from where it left off if interrupted. Essential for long-running sessions.
- 🎤 Multi-Speaker Diarization: Automatically identify who is speaking
- 🗣️ Dutch Language Support: Optimized for Dutch D&D sessions
- 🎭 IC/OOC Classification: Separate in-character dialogue from meta-discussion
- 📊 Multiple Output Formats: Plain text, IC-only, OOC-only, and JSON
- 🎯 Party Configuration: Save and reuse character/player setups
- 👤 Character Profiles: Track character development, actions, inventory, and relationships
- 📚 Campaign Knowledge Base: Automatic extraction of quests, NPCs, locations, and plot hooks
- 📋 Campaign Dashboard: Visual health check of all campaign components
- 📝 Import Session Notes: Backfill early sessions from written notes
- 💬 LLM Chat: Interact directly with the local LLM and role-play as characters
- 📖 Session Notebooks: Transform transcripts into narrative perspectives (narrator + character POV)
- 💰 Zero Budget Compatible: 100% free with local models
- ⚡ Fast Processing: Optional cloud APIs for speed
- 🔄 Learning System: Speaker profiles improve over time
- 🖥️ Dual Interface: Modern web UI and powerful CLI
The Gradio web UI has been completely redesigned with a modern, streamlined interface:
- 5 Consolidated Tabs (down from 16!) - No more hidden overflow menus
- Visual Workflow Stepper - Clear step-by-step guidance for processing sessions
- Progressive Disclosure - Advanced options hidden in collapsible sections until needed
- Modern Design - Clean, professional aesthetic with Indigo/Cyan color palette
- Card-Based Layouts - Beautiful visual organization for characters, sessions, and knowledge
The 5 Main Sections:
- 🎬 Process Session - Upload and process your recordings with clear workflow
- 📚 Campaign - Dashboard, knowledge base, session library, and party management
- 👥 Characters - Visual character browser with auto-extraction tools
- 📖 Stories & Output - View transcripts, stories, and export content
- ⚙️ Settings & Tools - Configuration, diagnostics, logs, and advanced features
- 4-hour D&D session recorded on Google Recorder (M4A format)
- Single room microphone (not ideally placed)
- 4 speakers: 3 players + 1 DM
- All in Dutch
Each session creates a timestamped folder in the output directory:
output/
└── YYYYMMDD_HHMMSS_session_id/
├── session_id_full.txt # Everything with speaker labels and IC/OOC markers
├── session_id_ic_only.txt # Game narrative only (perfect for session notes!)
├── session_id_ooc_only.txt # Banter and meta-discussion
├── session_id_data.json # Complete data for further processing
├── session_id_full.srt # Full transcript as subtitles
├── session_id_ic_only.srt # IC-only as subtitles
└── session_id_ooc_only.srt # OOC-only as subtitles
Optional: Enable audio snippet export to save per-segment WAV clips plus a manifest.json in a segments/ subdirectory.
[00:15:23] DM (IC): Je betreedt een donkere grot. De muren druipen van het vocht.
[00:15:45] Alice as Thorin (IC): Ik steek mijn fakkel aan en kijk om me heen.
[00:16:02] Bob (OOC): Haha, alweer een grot! Hoeveel grotten zijn dit nu al?
[00:16:30] DM (IC): Je ziet in het licht van de fakkel oude runen op de muur.
- SETUP.md - Installation and configuration
- FIRST_TIME_SETUP.md - Quick setup guide for new users
- USAGE.md - Detailed usage guide with examples
- QUICKREF.md - One-page command reference
- PARTY_CONFIG.md - Party configuration system guide
- CHARACTER_PROFILES.md - Character profiling and overview generation
- CAMPAIGN_KNOWLEDGE_BASE.md - Automatic campaign knowledge extraction
- CAMPAIGN_DASHBOARD.md - Campaign health check and overview
- SESSION_NOTEBOOK.md - Session notebook and perspective transformations
- DEVELOPMENT.md - Technical implementation details
M4A Recording
↓
Audio Conversion (16kHz mono WAV)
↓
Smart Chunking (10-min chunks with 10s overlap)
↓
Transcription (Whisper - Dutch optimized)
↓
Overlap Merging (LCS algorithm)
↓
Speaker Diarization (PyAnnote.audio)
↓
IC/OOC Classification (Ollama + Llama 3.1)
↓
4 Output Formats (TXT + JSON)
| Component | Technology | Why |
|---|---|---|
| Audio Conversion | FFmpeg | Universal format support |
| Voice Detection | Silero VAD | Best free VAD |
| Transcription | faster-whisper / Groq / OpenAI | Local or cloud options |
| Diarization | PyAnnote.audio 3.1 | State-of-the-art |
| Classification | Ollama (Llama 3.1) | Free, local, Dutch support |
| UI | Gradio + Click + Rich | User-friendly interfaces |
A production-ready system for processing long-form Dutch D&D session recordings with intelligent speaker diarization, character distinction, and in-character/out-of-character (IC/OOC) content separation.
- Source: Google Recorder app (M4A format, convertible to WAV)
- Duration: ~4 hours per session
- Audio Quality: Single room microphone, not ideally placed
- Language: Dutch
- Speakers: 3 players + 1 DM (+ rare passersby)
-
Multi-layered Speaker Identity
- Each player has their own voice (persona)
- Each player voices their character(s) (in-character)
- DM voices themselves + multiple NPCs
- Need to distinguish: Player A → Character X vs Player A → OOC
-
Content Classification
- In-character dialogue (game narrative)
- Out-of-character banter (meta-discussion, jokes, breaks)
- DM narration vs DM as NPC
-
Technical Constraints
- Zero budget (free APIs/local models only)
- Dutch language support required
- Single-mic recording (overlapping speech possible)
M4A Input → Audio Conversion → VAD Segmentation → Chunks
Tools:
- FFmpeg: Convert M4A to WAV (free, local)
- Silero VAD (Voice Activity Detection): Detect speech segments and silence
- PyAnnote.audio or Resemblyzer: Create speaker embeddings
Chunking Strategy (based on research & best practices):
- Hybrid Approach: Combine silence detection with fixed-length chunking
- Primary: Use VAD to detect natural speech pauses
- Fallback: 10-minute (600 second) maximum chunks if no suitable pause found
- Overlap: 10-second overlap between chunks to prevent word splitting
- Only 1.67% overhead with 10-min chunks
- Prevents context loss at boundaries
- Audio Format: Convert to 16kHz mono WAV/FLAC for optimal Whisper performance
- Merge Strategy: Use longest common subsequence (LCS) algorithm to merge overlapping transcriptions without duplicates
Why This Works:
- Whisper was trained on 30s segments, but longer chunks (up to 10 min) provide better context
- Groq API and local Whisper both handle longer chunks well
- Overlap prevents cutting words mid-utterance
- Natural pauses (silence) create better semantic boundaries for D&D sessions
Audio Chunks → Speaker Embeddings → Clustering → Speaker Labels
Tools:
- PyAnnote.audio (free, local): State-of-the-art speaker diarization
- Pre-trained models available
- Can be fine-tuned with speaker samples
- Alternative: Resemblyzer + UMAP/HDBSCAN clustering
Process:
- Extract speaker embeddings from audio
- Cluster embeddings to identify unique speakers
- Build speaker profiles over time (learning across sessions)
- Manual labeling in first session → auto-labeling in subsequent sessions
Speaker-labeled Chunks → Whisper API/Local → Dutch Transcription
Tools (Multiple Options):
-
Local Whisper (Recommended for zero budget):
faster-whisper: Optimized version, 4x faster than original- Model:
large-v3for best Dutch accuracy - Fully free, runs on your hardware
- Supports
verbose_jsonfor detailed timestamps
-
Groq API (Fast & free cloud option):
- Uses Whisper models with hardware acceleration
- Much faster than local processing
- Free tier: significant daily allowance
- Good for testing/prototyping
-
OpenAI Whisper API (Official cloud option):
- Official OpenAI implementation (whisper-1 model)
- High quality, reliable results
- Pay-per-use pricing
- Excellent Dutch support
Process:
- Transcribe each chunk with
language="nl"parameter for faster/better results - Use
verbose_jsonresponse format for detailed timestamps and word-level data - Implement retry logic for API rate limiting (if using Groq/OpenAI)
- Merge overlapping chunk transcriptions using LCS alignment
- Associate speaker labels from diarization with transcribed segments
Transcribed Text → LLM Analysis → IC/OOC Classification + Character Attribution
Tools:
- Ollama (free, local LLM): Run GPT-OSS 20B by default (fall back to Llama/Qwen if hardware-limited)
- Alternative: GPT-4o-mini API (has free tier)
Process:
-
Semantic Analysis of transcribed text (no audio cues available):
- IC Indicators: Narrative language, character actions ("I do X"), dialogue in-world context, fantasy/game vocabulary
- OOC Indicators: Meta-discussion about rules/mechanics, real-world topics (food, bathroom breaks), game strategy discussion, laughter/jokes about the game itself
- Context clues: Character names being used vs player names, present tense action vs past tense discussion
-
LLM Prompt Strategy:
- Use context window of surrounding segments (not just individual sentences)
- Provide character names and player names as reference
- Ask LLM to classify with confidence scores
- Use few-shot examples from manually labeled early sessions
-
Classification Output for each segment:
- Speaker: [Player Name | DM]
- Character: [Character Name | NPC Name | OOC | Narration]
- Type: [IC | OOC | MIXED]
- Confidence: 0.0-1.0
-
Iterative Learning:
- Build character voice profiles over sessions
- Learn common OOC patterns specific to your group
- User can manually correct classifications to improve future sessions
Classified Segments → Formatting → Multiple Output Formats
Output Formats:
- Full Transcript (plain text with markers)
[00:15:23] DM (Narration): Je betreedt de donkere grot...
[00:15:45] Player1 as CharacterA (IC): Ik steek mijn fakkel aan.
[00:16:02] Player2 (OOC): Haha, weer een grot!
- IC-Only Transcript (game narrative only)
[00:15:23] DM: Je betreedt de donkere grot...
[00:15:45] CharacterA: Ik steek mijn fakkel aan.
- Structured JSON (for further processing)
{
"segments": [
{
"timestamp": "00:15:23",
"speaker": "DM",
"character": null,
"type": "narration",
"text": "Je betreedt de donkere grot..."
}
]
}- Python 3.10+: Main language
- FFmpeg: Audio conversion
- PyTorch: ML framework for models
pyannote.audio: Speaker diarizationfaster-whisper: Optimized Whisper transcription (recommended)pydub: Audio chunking and manipulationsilero-vad: Voice activity detection for smart chunkinggroq: Optional API client for faster transcriptionollama-python: Local LLM for IC/OOC classificationnumpy,scipy: Audio processing utilities
- Gradio: Simple web UI for Python
- Streamlit: Alternative with more customization
- Or Electron + Python backend: For desktop app
- ✅ Audio conversion (M4A → WAV)
- ✅ Basic chunking by silence detection
- ✅ Whisper transcription with timestamps
- ✅ Simple speaker diarization (PyAnnote)
- ✅ Plain text output with speaker labels
- Speaker profile learning across sessions
- IC/OOC classification using LLM
- Character attribution
- Multiple output formats
- Web UI for processing and review
- Manual correction interface
- Speaker profile management
- Batch processing multiple sessions
- Character voice profile refinement
- Automatic OOC filtering improvement
- Session summary generation
- Search functionality across transcripts
- Accuracy: >85% speaker identification accuracy
- Character Attribution: >80% correct IC/OOC classification
- Processing Time: <1x real-time (4hr audio processed in <4hrs)
- Usability: Non-technical user can process sessions with <5 min setup
| Challenge | Impact | Mitigation |
|---|---|---|
| Single mic = overlapping speech | Harder to separate speakers | Use stricter VAD, accept some loss |
| DM voices multiple NPCs | Confusion in speaker ID | Use contextual LLM analysis |
| Dutch language support | Fewer pre-trained models | Whisper has excellent Dutch support |
| Zero budget | Limited API calls | Prioritize local models (Whisper, Ollama) |
| Voice similarity (same person, different characters) | Character attribution errors | Learn character patterns over time |
| Long-session audio exports | Per-segment clipping loads full WAV (~450 MB for 4hr) | Recommend 16 GB RAM (documented) or process sessions in smaller blocks |
These guides informed the chunking and transcription approach:
- Groq Community: Chunking Audio for Whisper - 10-min chunks with overlap strategy
- Murray Cole: Whisper Audio to Text - Practical implementation patterns
- Set up development environment (Python 3.10+, FFmpeg, dependencies)
- Implement audio conversion (M4A → 16kHz mono WAV)
- Implement hybrid chunking (VAD + 10-min max with 10s overlap)
- Test Whisper transcription quality on sample chunk
- Implement LCS merge algorithm for overlapping transcriptions
- Evaluate PyAnnote diarization on sample
- Build MVP command-line tool
- Iterate on IC/OOC classification
Challenge: There are no explicit audio or verbal cues when conversation shifts between in-character and out-of-character.
Solution: Post-transcription semantic analysis using LLM reasoning:
Context: D&D session in Dutch with 3 players + 1 DM
Characters: [CharacterA, CharacterB, CharacterC]
Players: [Player1, Player2, Player3, DM]
Analyze this segment and classify as IC (in-character) or OOC (out-of-character):
Previous segment: "Ik steek mijn fakkel aan en loop de grot in."
Current segment: "Wacht, moet ik daar een perception check voor doen?"
Next segment: "Ja, gooi maar een d20."
Classification: OOC (discussing game mechanics)
Confidence: 0.95
Reason: Discussion about dice rolls and game rules
This approach relies on the LLM understanding:
- D&D game context and terminology
- Narrative vs meta-discussion patterns
- Dutch language nuances
✅ COMPLETE - Full production system implemented!
All phases completed:
- ✅ Audio conversion and chunking
- ✅ Multi-backend transcription (local + Groq API)
- ✅ Overlap merging with LCS algorithm
- ✅ Speaker diarization with PyAnnote
- ✅ IC/OOC classification with Ollama
- ✅ Multiple output formats
- ✅ Web UI (Gradio)
- ✅ CLI (Click + Rich)
- ✅ Complete documentation
- Session Notes: Automatic IC-only transcripts for campaign journal
- Quote Mining: Search for memorable moments and quotes
- Analysis: Track character speaking time and participation
- Accessibility: Make sessions accessible to deaf/hard-of-hearing players
- Recap Creation: Quick session recaps from IC-only output
- Rules Reference: Find when specific rules were discussed (OOC-only)
Implemented features (see documentation for details):
- Session Notebooks: Transform IC transcripts into narrative formats
- Character first-person POV
- Third-person narrator style
- Character journal/diary entries
- CLI and Python API
- Campaign Knowledge Base: Automatic extraction and tracking of campaign entities
- Campaign Dashboard: Visual health check of campaign configuration
- Import Session Notes: Backfill early sessions from written notes
Planned features:
- SRT subtitle export for video overlay
- Automatic session summary generation
- Character emotion/tone detection
- Combat encounter extraction
- Multi-session search and analysis
- Voice cloning for TTS playback
- Multi-session chronicle compilation
- Custom narrative style templates
This is a personal project, but suggestions and improvements are welcome!
This project is provided as-is for personal use. See individual library licenses for dependencies.
- OpenAI Whisper team for excellent multilingual transcription
- PyAnnote.audio team for state-of-the-art diarization
- Ollama team for making local LLMs accessible
- Research from Groq community and Murray Cole for chunking strategies
Built with love for D&D players who want to preserve their campaign memories! 🎲✨
- Diagnostics tab (app.py): Run
Discover Teststo list collected pytest node IDs and execute targeted cases or the full suite directly inside the Gradio UI. - Session Manager dashboard: The landing page now auto-refreshes every 2s, shows the exact options provided for the run (party config, skip flags, output paths), and tracks per-stage start/end times with live progress indicators.
- Event timeline: Recent status entries highlight when each pipeline stage finished and which stage starts next so you can monitor long-running jobs without tailing logs.
- Load the campaign notebook via Document Viewer (Google Doc export) so the LLM can weave in contextual cues.
- Open Story Notebooks to pick a processed session, then generate the narrator summary and per-character POVs.
- Each run saves Markdown files under the session's
narratives/folder, making it easy to iterate on edits or share drafts. - Dashboard idle state: When
app.pyisn't listening, the manager now shows an idle summary instead of stale pipeline data, so you always know when a run is actually active.