🎲 D&D Session Transcription & Diarization System

Transform your D&D session recordings into searchable, organized transcripts with automatic speaker identification and in-character/out-of-character separation.

🚀 Quick Start

# Install dependencies
pip install -r requirements.txt

# Start web interface
python app.py

# Or use CLI
python cli.py process your_session.m4a

See SETUP.md for detailed installation instructions.

✨ Features

💾 Resumable Processing (Checkpoints): Automatically saves progress after each major pipeline stage, allowing you to resume processing from where it left off if interrupted. Essential for long-running sessions.
🎤 Multi-Speaker Diarization: Automatically identify who is speaking
🗣️ Dutch Language Support: Optimized for Dutch D&D sessions
🎭 IC/OOC Classification: Separate in-character dialogue from meta-discussion
📊 Multiple Output Formats: Plain text, IC-only, OOC-only, and JSON
🎯 Party Configuration: Save and reuse character/player setups
👤 Character Profiles: Track character development, actions, inventory, and relationships
📚 Campaign Knowledge Base: Automatic extraction of quests, NPCs, locations, and plot hooks
📋 Campaign Dashboard: Visual health check of all campaign components
📝 Import Session Notes: Backfill early sessions from written notes
💬 LLM Chat: Interact directly with the local LLM and role-play as characters
📖 Session Notebooks: Transform transcripts into narrative perspectives (narrator + character POV)
💰 Zero Budget Compatible: 100% free with local models
⚡ Fast Processing: Optional cloud APIs for speed
🔄 Learning System: Speaker profiles improve over time
🖥️ Dual Interface: Modern web UI and powerful CLI

🎨 Modern Web Interface

The Gradio web UI has been completely redesigned with a modern, streamlined interface:

5 Consolidated Tabs (down from 16!) - No more hidden overflow menus
Visual Workflow Stepper - Clear step-by-step guidance for processing sessions
Progressive Disclosure - Advanced options hidden in collapsible sections until needed
Modern Design - Clean, professional aesthetic with Indigo/Cyan color palette
Card-Based Layouts - Beautiful visual organization for characters, sessions, and knowledge

The 5 Main Sections:

🎬 Process Session - Upload and process your recordings with clear workflow
📚 Campaign - Dashboard, knowledge base, session library, and party management
👥 Characters - Visual character browser with auto-extraction tools
📖 Stories & Output - View transcripts, stories, and export content
⚙️ Settings & Tools - Configuration, diagnostics, logs, and advanced features

📋 What It Does

Input

4-hour D&D session recorded on Google Recorder (M4A format)
Single room microphone (not ideally placed)
4 speakers: 3 players + 1 DM
All in Dutch

Output

Each session creates a timestamped folder in the output directory:

output/
  └── YYYYMMDD_HHMMSS_session_id/
      ├── session_id_full.txt       # Everything with speaker labels and IC/OOC markers
      ├── session_id_ic_only.txt    # Game narrative only (perfect for session notes!)
      ├── session_id_ooc_only.txt   # Banter and meta-discussion
      ├── session_id_data.json      # Complete data for further processing
      ├── session_id_full.srt       # Full transcript as subtitles
      ├── session_id_ic_only.srt    # IC-only as subtitles
      └── session_id_ooc_only.srt   # OOC-only as subtitles

Optional: Enable audio snippet export to save per-segment WAV clips plus a manifest.json in a segments/ subdirectory.

Example Output

[00:15:23] DM (IC): Je betreedt een donkere grot. De muren druipen van het vocht.
[00:15:45] Alice as Thorin (IC): Ik steek mijn fakkel aan en kijk om me heen.
[00:16:02] Bob (OOC): Haha, alweer een grot! Hoeveel grotten zijn dit nu al?
[00:16:30] DM (IC): Je ziet in het licht van de fakkel oude runen op de muur.

📖 Documentation

Getting Started

SETUP.md - Installation and configuration
FIRST_TIME_SETUP.md - Quick setup guide for new users
USAGE.md - Detailed usage guide with examples
QUICKREF.md - One-page command reference

Features

PARTY_CONFIG.md - Party configuration system guide
CHARACTER_PROFILES.md - Character profiling and overview generation
CAMPAIGN_KNOWLEDGE_BASE.md - Automatic campaign knowledge extraction
CAMPAIGN_DASHBOARD.md - Campaign health check and overview
SESSION_NOTEBOOK.md - Session notebook and perspective transformations

Technical

DEVELOPMENT.md - Technical implementation details

🏗️ Architecture Overview

M4A Recording
    ↓
Audio Conversion (16kHz mono WAV)
    ↓
Smart Chunking (10-min chunks with 10s overlap)
    ↓
Transcription (Whisper - Dutch optimized)
    ↓
Overlap Merging (LCS algorithm)
    ↓
Speaker Diarization (PyAnnote.audio)
    ↓
IC/OOC Classification (Ollama + Llama 3.1)
    ↓
4 Output Formats (TXT + JSON)

🛠️ Technology Stack

Component	Technology	Why
Audio Conversion	FFmpeg	Universal format support
Voice Detection	Silero VAD	Best free VAD
Transcription	faster-whisper / Groq / OpenAI	Local or cloud options
Diarization	PyAnnote.audio 3.1	State-of-the-art
Classification	Ollama (Llama 3.1)	Free, local, Dutch support
UI	Gradio + Click + Rich	User-friendly interfaces

📦 Project Overview

A production-ready system for processing long-form Dutch D&D session recordings with intelligent speaker diarization, character distinction, and in-character/out-of-character (IC/OOC) content separation.

The Challenge

Input Characteristics

Source: Google Recorder app (M4A format, convertible to WAV)
Duration: ~4 hours per session
Audio Quality: Single room microphone, not ideally placed
Language: Dutch
Speakers: 3 players + 1 DM (+ rare passersby)

Complexity Factors

Multi-layered Speaker Identity
- Each player has their own voice (persona)
- Each player voices their character(s) (in-character)
- DM voices themselves + multiple NPCs
- Need to distinguish: Player A → Character X vs Player A → OOC
Content Classification
- In-character dialogue (game narrative)
- Out-of-character banter (meta-discussion, jokes, breaks)
- DM narration vs DM as NPC
Technical Constraints
- Zero budget (free APIs/local models only)
- Dutch language support required
- Single-mic recording (overlapping speech possible)

Proposed Architecture

Phase 1: Audio Processing & Chunking

M4A Input → Audio Conversion → VAD Segmentation → Chunks

Tools:

FFmpeg: Convert M4A to WAV (free, local)
Silero VAD (Voice Activity Detection): Detect speech segments and silence
PyAnnote.audio or Resemblyzer: Create speaker embeddings

Chunking Strategy (based on research & best practices):

Hybrid Approach: Combine silence detection with fixed-length chunking
- Primary: Use VAD to detect natural speech pauses
- Fallback: 10-minute (600 second) maximum chunks if no suitable pause found
Overlap: 10-second overlap between chunks to prevent word splitting
- Only 1.67% overhead with 10-min chunks
- Prevents context loss at boundaries
Audio Format: Convert to 16kHz mono WAV/FLAC for optimal Whisper performance
Merge Strategy: Use longest common subsequence (LCS) algorithm to merge overlapping transcriptions without duplicates

Why This Works:

Whisper was trained on 30s segments, but longer chunks (up to 10 min) provide better context
Groq API and local Whisper both handle longer chunks well
Overlap prevents cutting words mid-utterance
Natural pauses (silence) create better semantic boundaries for D&D sessions

Phase 2: Speaker Diarization

Audio Chunks → Speaker Embeddings → Clustering → Speaker Labels

Tools:

PyAnnote.audio (free, local): State-of-the-art speaker diarization
- Pre-trained models available
- Can be fine-tuned with speaker samples
Alternative: Resemblyzer + UMAP/HDBSCAN clustering

Process:

Extract speaker embeddings from audio
Cluster embeddings to identify unique speakers
Build speaker profiles over time (learning across sessions)
Manual labeling in first session → auto-labeling in subsequent sessions

Phase 3: Transcription

Speaker-labeled Chunks → Whisper API/Local → Dutch Transcription

Tools (Multiple Options):

Local Whisper (Recommended for zero budget):
- faster-whisper: Optimized version, 4x faster than original
- Model: large-v3 for best Dutch accuracy
- Fully free, runs on your hardware
- Supports verbose_json for detailed timestamps
Groq API (Fast & free cloud option):
- Uses Whisper models with hardware acceleration
- Much faster than local processing
- Free tier: significant daily allowance
- Good for testing/prototyping
OpenAI Whisper API (Official cloud option):
- Official OpenAI implementation (whisper-1 model)
- High quality, reliable results
- Pay-per-use pricing
- Excellent Dutch support

Process:

Transcribe each chunk with language="nl" parameter for faster/better results
Use verbose_json response format for detailed timestamps and word-level data
Implement retry logic for API rate limiting (if using Groq/OpenAI)
Merge overlapping chunk transcriptions using LCS alignment
Associate speaker labels from diarization with transcribed segments

Phase 4: Character & Context Classification

Transcribed Text → LLM Analysis → IC/OOC Classification + Character Attribution

Tools:

Ollama (free, local LLM): Run GPT-OSS 20B by default (fall back to Llama/Qwen if hardware-limited)
Alternative: GPT-4o-mini API (has free tier)

Process:

Semantic Analysis of transcribed text (no audio cues available):
- IC Indicators: Narrative language, character actions ("I do X"), dialogue in-world context, fantasy/game vocabulary
- OOC Indicators: Meta-discussion about rules/mechanics, real-world topics (food, bathroom breaks), game strategy discussion, laughter/jokes about the game itself
- Context clues: Character names being used vs player names, present tense action vs past tense discussion
LLM Prompt Strategy:
- Use context window of surrounding segments (not just individual sentences)
- Provide character names and player names as reference
- Ask LLM to classify with confidence scores
- Use few-shot examples from manually labeled early sessions
Classification Output for each segment:
- Speaker: [Player Name | DM]
- Character: [Character Name | NPC Name | OOC | Narration]
- Type: [IC | OOC | MIXED]
- Confidence: 0.0-1.0
Iterative Learning:
- Build character voice profiles over sessions
- Learn common OOC patterns specific to your group
- User can manually correct classifications to improve future sessions

Phase 5: Output Generation

Classified Segments → Formatting → Multiple Output Formats

Output Formats:

Full Transcript (plain text with markers)

[00:15:23] DM (Narration): Je betreedt de donkere grot...
[00:15:45] Player1 as CharacterA (IC): Ik steek mijn fakkel aan.
[00:16:02] Player2 (OOC): Haha, weer een grot!

IC-Only Transcript (game narrative only)

[00:15:23] DM: Je betreedt de donkere grot...
[00:15:45] CharacterA: Ik steek mijn fakkel aan.

Structured JSON (for further processing)

{
  "segments": [
    {
      "timestamp": "00:15:23",
      "speaker": "DM",
      "character": null,
      "type": "narration",
      "text": "Je betreedt de donkere grot..."
    }
  ]
}

Technology Stack

Core Components

Python 3.10+: Main language
FFmpeg: Audio conversion
PyTorch: ML framework for models

Libraries

pyannote.audio: Speaker diarization
faster-whisper: Optimized Whisper transcription (recommended)
pydub: Audio chunking and manipulation
silero-vad: Voice activity detection for smart chunking
groq: Optional API client for faster transcription
ollama-python: Local LLM for IC/OOC classification
numpy, scipy: Audio processing utilities

UI Framework (for future phases)

Gradio: Simple web UI for Python
Streamlit: Alternative with more customization
Or Electron + Python backend: For desktop app

Development Phases

MVP (Minimum Viable Product)

✅ Audio conversion (M4A → WAV)
✅ Basic chunking by silence detection
✅ Whisper transcription with timestamps
✅ Simple speaker diarization (PyAnnote)
✅ Plain text output with speaker labels

Phase 2 Enhancements

Speaker profile learning across sessions
IC/OOC classification using LLM
Character attribution
Multiple output formats

Phase 3 Advanced Features

Web UI for processing and review
Manual correction interface
Speaker profile management
Batch processing multiple sessions

Phase 4 Polish

Character voice profile refinement
Automatic OOC filtering improvement
Session summary generation
Search functionality across transcripts

Success Metrics

Accuracy: >85% speaker identification accuracy
Character Attribution: >80% correct IC/OOC classification
Processing Time: <1x real-time (4hr audio processed in <4hrs)
Usability: Non-technical user can process sessions with <5 min setup

Known Limitations & Mitigation

Challenge	Impact	Mitigation
Single mic = overlapping speech	Harder to separate speakers	Use stricter VAD, accept some loss
DM voices multiple NPCs	Confusion in speaker ID	Use contextual LLM analysis
Dutch language support	Fewer pre-trained models	Whisper has excellent Dutch support
Zero budget	Limited API calls	Prioritize local models (Whisper, Ollama)
Voice similarity (same person, different characters)	Character attribution errors	Learn character patterns over time
Long-session audio exports	Per-segment clipping loads full WAV (~450 MB for 4hr)	Recommend 16 GB RAM (documented) or process sessions in smaller blocks

Implementation References

These guides informed the chunking and transcription approach:

Groq Community: Chunking Audio for Whisper - 10-min chunks with overlap strategy
Murray Cole: Whisper Audio to Text - Practical implementation patterns

Next Steps

Set up development environment (Python 3.10+, FFmpeg, dependencies)
Implement audio conversion (M4A → 16kHz mono WAV)
Implement hybrid chunking (VAD + 10-min max with 10s overlap)
Test Whisper transcription quality on sample chunk
Implement LCS merge algorithm for overlapping transcriptions
Evaluate PyAnnote diarization on sample
Build MVP command-line tool
Iterate on IC/OOC classification

IC/OOC Classification Strategy

Challenge: There are no explicit audio or verbal cues when conversation shifts between in-character and out-of-character.

Solution: Post-transcription semantic analysis using LLM reasoning:

Example Prompting Approach

Context: D&D session in Dutch with 3 players + 1 DM
Characters: [CharacterA, CharacterB, CharacterC]
Players: [Player1, Player2, Player3, DM]

Analyze this segment and classify as IC (in-character) or OOC (out-of-character):

Previous segment: "Ik steek mijn fakkel aan en loop de grot in."
Current segment: "Wacht, moet ik daar een perception check voor doen?"
Next segment: "Ja, gooi maar een d20."

Classification: OOC (discussing game mechanics)
Confidence: 0.95
Reason: Discussion about dice rolls and game rules

This approach relies on the LLM understanding:

D&D game context and terminology
Narrative vs meta-discussion patterns
Dutch language nuances

🎯 Project Status

✅ COMPLETE - Full production system implemented!

All phases completed:

✅ Audio conversion and chunking
✅ Multi-backend transcription (local + Groq API)
✅ Overlap merging with LCS algorithm
✅ Speaker diarization with PyAnnote
✅ IC/OOC classification with Ollama
✅ Multiple output formats
✅ Web UI (Gradio)
✅ CLI (Click + Rich)
✅ Complete documentation

💡 Use Cases

Session Notes: Automatic IC-only transcripts for campaign journal
Quote Mining: Search for memorable moments and quotes
Analysis: Track character speaking time and participation
Accessibility: Make sessions accessible to deaf/hard-of-hearing players
Recap Creation: Quick session recaps from IC-only output
Rules Reference: Find when specific rules were discussed (OOC-only)

🔮 Future Enhancements

Implemented features (see documentation for details):

Session Notebooks: Transform IC transcripts into narrative formats
- Character first-person POV
- Third-person narrator style
- Character journal/diary entries
- CLI and Python API
Campaign Knowledge Base: Automatic extraction and tracking of campaign entities
Campaign Dashboard: Visual health check of campaign configuration
Import Session Notes: Backfill early sessions from written notes

Planned features:

SRT subtitle export for video overlay
Automatic session summary generation
Character emotion/tone detection
Combat encounter extraction
Multi-session search and analysis
Voice cloning for TTS playback
Multi-session chronicle compilation
Custom narrative style templates

🤝 Contributing

This is a personal project, but suggestions and improvements are welcome!

📝 License

This project is provided as-is for personal use. See individual library licenses for dependencies.

🙏 Acknowledgments

OpenAI Whisper team for excellent multilingual transcription
PyAnnote.audio team for state-of-the-art diarization
Ollama team for making local LLMs accessible
Research from Groq community and Murray Cole for chunking strategies

Built with love for D&D players who want to preserve their campaign memories! 🎲✨

Diagnostics & Monitoring Enhancements

Diagnostics tab (app.py): Run Discover Tests to list collected pytest node IDs and execute targeted cases or the full suite directly inside the Gradio UI.
Session Manager dashboard: The landing page now auto-refreshes every 2s, shows the exact options provided for the run (party config, skip flags, output paths), and tracks per-stage start/end times with live progress indicators.
Event timeline: Recent status entries highlight when each pipeline stage finished and which stage starts next so you can monitor long-running jobs without tailing logs.

Story Notebooks Tab

Load the campaign notebook via Document Viewer (Google Doc export) so the LLM can weave in contextual cues.
Open Story Notebooks to pick a processed session, then generate the narrator summary and per-character POVs.
Each run saves Markdown files under the session's narratives/ folder, making it easy to iterate on edits or share drafts.
Dashboard idle state: When app.py isn't listening, the manager now shows an idle summary instead of stale pipeline data, so you always know when a run is actually active.

Name		Name	Last commit message	Last commit date
Latest commit History 296 Commits
.claude		.claude
.github/workflows		.github/workflows
docs		docs
models		models
notes		notes
prompts		prompts
schemas		schemas
src		src
tests		tests
tools		tools
.coverage		.coverage
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
AGENTS.md		AGENTS.md
AGENT_B_CLI_API_TASK_REPORT.md		AGENT_B_CLI_API_TASK_REPORT.md
AGENT_ONBOARDING.md		AGENT_ONBOARDING.md
ANALYTICS_CRITICAL_REVIEW.md		ANALYTICS_CRITICAL_REVIEW.md
ANTIGRAVITY.md		ANTIGRAVITY.md
CHANGELOG_2025-11-25.md		CHANGELOG_2025-11-25.md
CLAUDE.md		CLAUDE.md
CLOUD_INFERENCE_OPTIONS.md		CLOUD_INFERENCE_OPTIONS.md
COLAB_SETUP.md		COLAB_SETUP.md
IMPLEMENTATION_PLANS.md		IMPLEMENTATION_PLANS.md
IMPLEMENTATION_PLANS_INTERACTIVE_CLARIFICATION.md		IMPLEMENTATION_PLANS_INTERACTIVE_CLARIFICATION.md
IMPLEMENTATION_PLANS_PART2.md		IMPLEMENTATION_PLANS_PART2.md
IMPLEMENTATION_PLANS_PART3.md		IMPLEMENTATION_PLANS_PART3.md
IMPLEMENTATION_PLANS_PART4.md		IMPLEMENTATION_PLANS_PART4.md
IMPLEMENTATION_PLANS_SUMMARY.md		IMPLEMENTATION_PLANS_SUMMARY.md
IMPLEMENTATION_PLAN_CHARACTER_ANALYTICS.md		IMPLEMENTATION_PLAN_CHARACTER_ANALYTICS.md
IMPLEMENTATION_PLAN_LANGCHAIN_UX_POLISH.md		IMPLEMENTATION_PLAN_LANGCHAIN_UX_POLISH.md
IMPLEMENTATION_PLAN_OOC_TOPIC_ANALYSIS.md		IMPLEMENTATION_PLAN_OOC_TOPIC_ANALYSIS.md
IMPLEMENTATION_PLAN_SESSION_ANALYTICS.md		IMPLEMENTATION_PLAN_SESSION_ANALYTICS.md
IMPLEMENTATION_PLAN_SESSION_SEARCH.md		IMPLEMENTATION_PLAN_SESSION_SEARCH.md
IMPLEMENTATION_PLAN_STREAMING_SNIPPET_EXPORT.md		IMPLEMENTATION_PLAN_STREAMING_SNIPPET_EXPORT.md
MCP_SETUP.md		MCP_SETUP.md
README.md		README.md
REFACTORING_CANDIDATES.md		REFACTORING_CANDIDATES.md
ROADMAP.md		ROADMAP.md
SELF_REVIEW_CHARACTER_ANALYTICS.md		SELF_REVIEW_CHARACTER_ANALYTICS.md
TASK_INDEX.md		TASK_INDEX.md
TESTING.md		TESTING.md
TROUBLESHOOTING_OLLAMA.md		TROUBLESHOOTING_OLLAMA.md
UI_IMPROVEMENTS.md		UI_IMPROVEMENTS.md
UX_IMPROVEMENTS.md		UX_IMPROVEMENTS.md
UX_LANDSCAPE_ANALYSIS.md		UX_LANDSCAPE_ANALYSIS.md
UX_QUICK_REFERENCE.md		UX_QUICK_REFERENCE.md
WORK_INITIATION_PROMPT.md		WORK_INITIATION_PROMPT.md
app.py		app.py
app.py.backup		app.py.backup
app_manager.py		app_manager.py
app_modern_preview.py		app_modern_preview.py
cli.py		cli.py
colab_classification_worker.ipynb		colab_classification_worker.ipynb
create_colab_job.py		create_colab_job.py
example.py		example.py
find_gdrive.py		find_gdrive.py
inspect_gradio_api.py		inspect_gradio_api.py
jules.md		jules.md
mcp_server.py		mcp_server.py
process_from_intermediate.py		process_from_intermediate.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
test_api_keys.py		test_api_keys.py
test_colab_classifier.py		test_colab_classifier.py
test_process.py		test_process.py
test_ui.py		test_ui.py
test_ui_interaction.py		test_ui_interaction.py
test_worker_direct.py		test_worker_direct.py

Gambitnl/Video_chunking

Folders and files

Latest commit

History

Repository files navigation

🎲 D&D Session Transcription & Diarization System

🚀 Quick Start

✨ Features

🎨 Modern Web Interface

📋 What It Does

Input

Output

Example Output

📖 Documentation

Getting Started

Features

Technical

🏗️ Architecture Overview

🛠️ Technology Stack

📦 Project Overview

The Challenge

Input Characteristics

Complexity Factors

Proposed Architecture

Phase 1: Audio Processing & Chunking

Phase 2: Speaker Diarization

Phase 3: Transcription

Phase 4: Character & Context Classification

Phase 5: Output Generation

Technology Stack

Core Components

Libraries

UI Framework (for future phases)

Development Phases

MVP (Minimum Viable Product)

Phase 2 Enhancements

Phase 3 Advanced Features

Phase 4 Polish

Success Metrics

Known Limitations & Mitigation

Implementation References

Next Steps

IC/OOC Classification Strategy

Example Prompting Approach

🎯 Project Status

💡 Use Cases

🔮 Future Enhancements

🤝 Contributing

📝 License

🙏 Acknowledgments

Diagnostics & Monitoring Enhancements

Story Notebooks Tab

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages