FoxNoseTech · loookashow · May 6, 2026 · May 6, 2026
diff --git a/README.md b/README.md
@@ -23,9 +23,9 @@ for seg in result.segments:
     print(f"  [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")
 ```
 
-**~5.0% weighted DER** on VoxConverse dev. Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers.
+**~4.8% weighted DER** on VoxConverse dev. Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers.
 
-> Benchmarked on a single dataset ([VoxConverse](https://github.com/joonson/voxconverse)). Cross-dataset validation is [in progress](#roadmap).
+> Primary benchmark: [VoxConverse](https://github.com/joonson/voxconverse). Preliminary AMI meeting-domain validation is [in progress](#roadmap).
 
 ## How diarize compares
 
@@ -35,7 +35,7 @@ for seg in result.segments:
 | GPU required | No | No (7x slower on CPU) | No |
 | HuggingFace account | No | Yes | Yes |
 | Auto speaker count | Yes | Yes | Yes |
-| DER (VoxConverse dev) | **~5.0%** | ~11.2% | ~8.5% |
+| DER (VoxConverse dev) | **~4.8%** | ~11.2% | ~8.5% |
 | CPU speed (RTF) | **0.12** | 0.86 | — |
 | Install | `pip install diarize` | `pip install pyannote.audio` | `pip install pyannote.audio` |
 
@@ -102,7 +102,7 @@ Evaluated on [VoxConverse](https://github.com/joonson/voxconverse) dev set (216
 | System | Weighted DER | Notes |
 |--------|----------|-------|
 | pyannote precision-2 | ~8.5% | Commercial license |
-| **diarize** | **~5.0%** | **Apache 2.0, CPU-only, no API key** |
+| **diarize** | **~4.8%** | **Apache 2.0, CPU-only, no API key** |
 | pyannote community-1 | ~11.2% | CC-BY-4.0, needs HF token |
 | pyannote 3.1 (legacy) | ~11.2% | MIT, needs HF token |
 
@@ -111,26 +111,36 @@ Evaluated on [VoxConverse](https://github.com/joonson/voxconverse) dev set (216
 | Metric | Result |
 |--------|--------|
 | Files | 216 |
-| Exact match | 117/216 (54%) |
-| Within ±1 | 175/216 (81%) |
+| Exact match | 125/216 (58%) |
+| Within ±1 | 178/216 (82%) |
 
 Many-speaker files remain the weak spot: automatic count estimation degrades above 7 speakers. Pass `num_speakers` when the count is known.
 
+Preliminary AMI meeting-domain check (16 Mix-Headset test files, 4–9 speakers):
+
+| Metric | Result |
+|--------|--------|
+| Weighted DER | 14.96% |
+| Speaker count exact match | 4/16 (25%) |
+| Speaker count within ±1 | 8/16 (50%) |
+
+AMI confirms that meeting-domain speaker counting is harder: the estimator often collapses 6+ speaker meetings to 4–5 speakers.
+
 Full benchmark results, speed comparison, and methodology: [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/).
 
 ## When to use something else
 
-- **You need commercial support or cross-dataset validation.** pyannote's commercial model has published production-oriented benchmarks beyond this single VoxConverse evaluation. If accuracy is the top priority and you have budget, compare on your own data.
+- **You need commercial support or broad cross-dataset validation.** pyannote's commercial model has published production-oriented benchmarks beyond this limited VoxConverse/AMI evaluation. If accuracy is the top priority and you have budget, compare on your own data.
 - **You need very stable speaker labels in transcripts.** Temporal smoothing reduces short label jumps, but diarize can still show speaker fragmentation / label switching: one real speaker may be split across multiple `SPEAKER_XX` labels, especially on noisy real-world audio.
 - **Your audio has 8+ speakers.** Automatic speaker count estimation degrades above 7 speakers. You can pass `num_speakers` explicitly, but test carefully.
 - **You need overlapping speech detection.** diarize assigns each segment to one speaker. Overlapping speech is not modeled.
 - **You need GPU-accelerated throughput.** diarize is CPU-only by design. For processing thousands of hours with GPU infrastructure, NeMo or pyannote on GPU will be faster.
 
 ## Roadmap
 
-Current benchmarks are based on VoxConverse dev set only. We are actively working on:
+Current benchmarks include VoxConverse dev and preliminary AMI test validation. We are actively working on:
 
-- **Cross-dataset validation** — AMI, DIHARD III, CALLHOME, and other standard benchmarks in isolated environments
+- **Cross-dataset validation** — DIHARD III, CALLHOME, and other standard benchmarks in isolated environments
 - **Speaker count estimation benchmarks** — comparison of speaker counting accuracy against other systems
 - **Broader system comparison** — NeMo, WhisperX, and other diarization solutions
 - **Streaming / real-time diarization** — live audio streams with real-time speaker detection

diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -1,15 +1,17 @@
 # Benchmarks
 
-Evaluated on the [VoxConverse](https://github.com/joonson/voxconverse)
-dev set (216 files, 1--20 speakers per file).
+Primary published numbers are evaluated on the
+[VoxConverse](https://github.com/joonson/voxconverse) dev set
+(216 files, 1--20 speakers per file). We also run preliminary
+cross-dataset checks on AMI meetings to track generalisation.
 
 ## Speaker Count Estimation
 
 | Metric | Result |
 |--------|--------|
 | Files | 216 |
-| Exact match | 117/216 (54%) |
-| Within +/-1 | 175/216 (81%) |
+| Exact match | 125/216 (58%) |
+| Within +/-1 | 178/216 (82%) |
 
 The automatic estimator is usually close, but exact counting remains the
 main weak spot. Accuracy drops for many-speaker files --- see
@@ -23,21 +25,42 @@ DER is the standard metric for speaker diarization, computed with
 | System | Weighted DER | Median DER | Notes |
 |--------|----------|------------|-------|
 | pyannote precision-2 | ~8.5% | -- | Commercial license |
-| **diarize** | **~5.0%** | **~2.2%** | **Apache 2.0, CPU-only, no API key** |
+| **diarize** | **~4.8%** | **~2.1%** | **Apache 2.0, CPU-only, no API key** |
 | pyannote community-1 | ~11.2% | -- | CC-BY-4.0, needs HF token |
 | pyannote 3.1 (legacy) | ~11.2% | -- | MIT, needs HF token |
 
 pyannote DER numbers are self-reported from the
 [pyannote benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1)
 on VoxConverse v0.3.
 
-!!! note "VoxConverse-only result"
+!!! note "Dataset-specific result"
     On this VoxConverse dev evaluation, `diarize` reports lower weighted
     DER than the published pyannote VoxConverse figures, while requiring
     no HuggingFace token or account registration. Treat this as a
-    single-dataset benchmark and compare on your own audio when accuracy
+    VoxConverse-specific benchmark and compare on your own audio when accuracy
     is the top priority.
 
+## Cross-Dataset Check: AMI
+
+Preliminary AMI test-set evaluation uses 16 Mix-Headset meeting
+recordings (4--9 speakers per file), RTTM annotations from the
+standard AMI speaker-diarization benchmark, and the same DER settings
+(``collar=0.25``, ``skip_overlap=True``).
+
+| Metric | Result |
+|--------|--------|
+| Files | 16 |
+| Weighted DER | 14.96% |
+| Mean DER | 14.63% |
+| Median DER | 14.18% |
+| Speaker count exact match | 4/16 (25%) |
+| Speaker count within +/-1 | 8/16 (50%) |
+
+This confirms that meeting-domain audio is a harder case for automatic
+speaker counting. The estimator often collapses 6+ speaker meetings to
+4--5 speakers, even when aggregate DER remains moderate because some
+ground-truth speakers have little speaking time.
+
 ## CPU Speed (Real Time Factor)
 
 RTF = processing_time / audio_duration.  Lower is faster; RTF < 1.0 means
@@ -76,6 +99,34 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max
   warm-up.  RTF = processing_time / audio_duration.
 - **Hardware:** Apple M2 Pro, macOS, CPU only (no GPU).
 
+## Reproducing and Extending Benchmarks
+
+The repository includes a dataset-agnostic RTTM runner for local
+experiments:
+
+```bash
+python scripts/benchmark_rttm.py \
+  --dataset voxconverse-dev \
+  --audio-dir /path/to/voxconverse/dev/audio \
+  --rttm-dir /path/to/voxconverse/rttm_annotations/dev \
+  --output results_voxconverse_dev.json
+```
+
+It also supports combined RTTM files and targeted diagnostics:
+
+```bash
+python scripts/benchmark_rttm.py \
+  --dataset ami-test \
+  --audio-dir /path/to/ami/mix-headset/test \
+  --rttm-file /path/to/AMI.SpeakerDiarization.Benchmark.test.rttm \
+  --oracle-speakers \
+  --file-id IS1009a
+```
+
+Use ``--oracle-speakers`` to isolate speaker assignment and clustering
+quality when the true speaker count is known. Use ``--list-only`` to
+verify audio/RTTM matching without running inference.
+
 ## Limitations
 
 !!! warning "Speaker count > 7"
@@ -108,14 +159,14 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max
 
 ## Future Work
 
-!!! info "Single-dataset disclaimer"
-    All results above are from VoxConverse dev set only.  We are actively
-    expanding evaluation to ensure the algorithm generalises well and is
-    not overfit to a single benchmark.
+!!! info "Cross-dataset validation in progress"
+    VoxConverse remains the primary published benchmark. AMI is now used
+    as an additional meeting-domain check, and more datasets are needed
+    before making broad accuracy claims.
 
 **Planned evaluation:**
 
-- **Cross-dataset validation** --- AMI, DIHARD III, CALLHOME, and other
+- **Cross-dataset validation** --- DIHARD III, CALLHOME, and other
   standard benchmarks, run in isolated environments with controlled
   CPU/memory limits.
 - **Speaker count estimation comparison** --- dedicated benchmarks comparing

diff --git a/docs/how-it-works.md b/docs/how-it-works.md
@@ -76,9 +76,10 @@ speakers while keeping computational cost low.
 
 **Step 3 --- Silhouette refinement.** BIC is used as an anchor, then a
 small neighbourhood around it is scored with silhouette over cosine
-distance. The candidate range is clamped by `min_speakers`,
-`max_speakers`, and the number of available embeddings. This catches
-some BIC undercounts and overcounts without searching the full range.
+distance plus a small logarithmic bonus for larger *k*. The candidate
+range is clamped by `min_speakers`, `max_speakers`, and the number of
+available embeddings. This catches some BIC undercounts and overcounts
+without searching the full range.
 
 !!! warning
     For **8 or more speakers** the estimator can undercount.

diff --git a/docs/index.md b/docs/index.md
@@ -29,7 +29,7 @@ for seg in result.segments:
 | GPU required | No | No (7x slower on CPU) | No |
 | HuggingFace account | No | Yes | Yes |
 | Auto speaker count | Yes | Yes | Yes |
-| DER (VoxConverse dev) | **~5.0%** | ~11.2% | ~8.5% |
+| DER (VoxConverse dev) | **~4.8%** | ~11.2% | ~8.5% |
 | CPU speed (RTF) | **0.12** | 0.86 | --- |
 
 DER and speed numbers for pyannote are from their
@@ -40,7 +40,7 @@ The diarize number is from the VoxConverse dev evaluation described in
 ## Next Steps
 
 - [How It Works](how-it-works.md) --- pipeline architecture and algorithms
-- [Benchmarks](benchmarks.md) --- VoxConverse evaluation, speed comparison, limitations
+- [Benchmarks](benchmarks.md) --- VoxConverse, AMI, speed comparison, limitations
 - [API Reference](api.md) --- full auto-generated API documentation
 
 ## License

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "diarize"
-version = "0.1.1"
+version = "0.1.2"
 description = "Speaker diarization for Python — detect who spoke when in audio files. CPU-only, no GPU, no API keys, no account signup. Automatic speaker count detection."
 readme = "README.md"
 license = "Apache-2.0"