Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 19 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ for seg in result.segments:
print(f" [{seg.start:.1f}s - {seg.end:.1f}s] {seg.speaker}")
```

**~5.0% weighted DER** on VoxConverse dev. Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers.
**~4.8% weighted DER** on VoxConverse dev. Processes audio **~8x faster than real-time** on CPU. Automatically detects the number of speakers.

> Benchmarked on a single dataset ([VoxConverse](https://github.com/joonson/voxconverse)). Cross-dataset validation is [in progress](#roadmap).
> Primary benchmark: [VoxConverse](https://github.com/joonson/voxconverse). Preliminary AMI meeting-domain validation is [in progress](#roadmap).

## How diarize compares

Expand All @@ -35,7 +35,7 @@ for seg in result.segments:
| GPU required | No | No (7x slower on CPU) | No |
| HuggingFace account | No | Yes | Yes |
| Auto speaker count | Yes | Yes | Yes |
| DER (VoxConverse dev) | **~5.0%** | ~11.2% | ~8.5% |
| DER (VoxConverse dev) | **~4.8%** | ~11.2% | ~8.5% |
| CPU speed (RTF) | **0.12** | 0.86 | — |
| Install | `pip install diarize` | `pip install pyannote.audio` | `pip install pyannote.audio` |

Expand Down Expand Up @@ -102,7 +102,7 @@ Evaluated on [VoxConverse](https://github.com/joonson/voxconverse) dev set (216
| System | Weighted DER | Notes |
|--------|----------|-------|
| pyannote precision-2 | ~8.5% | Commercial license |
| **diarize** | **~5.0%** | **Apache 2.0, CPU-only, no API key** |
| **diarize** | **~4.8%** | **Apache 2.0, CPU-only, no API key** |
| pyannote community-1 | ~11.2% | CC-BY-4.0, needs HF token |
| pyannote 3.1 (legacy) | ~11.2% | MIT, needs HF token |

Expand All @@ -111,26 +111,36 @@ Evaluated on [VoxConverse](https://github.com/joonson/voxconverse) dev set (216
| Metric | Result |
|--------|--------|
| Files | 216 |
| Exact match | 117/216 (54%) |
| Within ±1 | 175/216 (81%) |
| Exact match | 125/216 (58%) |
| Within ±1 | 178/216 (82%) |

Many-speaker files remain the weak spot: automatic count estimation degrades above 7 speakers. Pass `num_speakers` when the count is known.

Preliminary AMI meeting-domain check (16 Mix-Headset test files, 4–9 speakers):

| Metric | Result |
|--------|--------|
| Weighted DER | 14.96% |
| Speaker count exact match | 4/16 (25%) |
| Speaker count within ±1 | 8/16 (50%) |

AMI confirms that meeting-domain speaker counting is harder: the estimator often collapses 6+ speaker meetings to 4–5 speakers.

Full benchmark results, speed comparison, and methodology: [benchmarks](https://foxnosetech.github.io/diarize/benchmarks/).

## When to use something else

- **You need commercial support or cross-dataset validation.** pyannote's commercial model has published production-oriented benchmarks beyond this single VoxConverse evaluation. If accuracy is the top priority and you have budget, compare on your own data.
- **You need commercial support or broad cross-dataset validation.** pyannote's commercial model has published production-oriented benchmarks beyond this limited VoxConverse/AMI evaluation. If accuracy is the top priority and you have budget, compare on your own data.
- **You need very stable speaker labels in transcripts.** Temporal smoothing reduces short label jumps, but diarize can still show speaker fragmentation / label switching: one real speaker may be split across multiple `SPEAKER_XX` labels, especially on noisy real-world audio.
- **Your audio has 8+ speakers.** Automatic speaker count estimation degrades above 7 speakers. You can pass `num_speakers` explicitly, but test carefully.
- **You need overlapping speech detection.** diarize assigns each segment to one speaker. Overlapping speech is not modeled.
- **You need GPU-accelerated throughput.** diarize is CPU-only by design. For processing thousands of hours with GPU infrastructure, NeMo or pyannote on GPU will be faster.

## Roadmap

Current benchmarks are based on VoxConverse dev set only. We are actively working on:
Current benchmarks include VoxConverse dev and preliminary AMI test validation. We are actively working on:

- **Cross-dataset validation** — AMI, DIHARD III, CALLHOME, and other standard benchmarks in isolated environments
- **Cross-dataset validation** — DIHARD III, CALLHOME, and other standard benchmarks in isolated environments
- **Speaker count estimation benchmarks** — comparison of speaker counting accuracy against other systems
- **Broader system comparison** — NeMo, WhisperX, and other diarization solutions
- **Streaming / real-time diarization** — live audio streams with real-time speaker detection
Expand Down
75 changes: 63 additions & 12 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# Benchmarks

Evaluated on the [VoxConverse](https://github.com/joonson/voxconverse)
dev set (216 files, 1--20 speakers per file).
Primary published numbers are evaluated on the
[VoxConverse](https://github.com/joonson/voxconverse) dev set
(216 files, 1--20 speakers per file). We also run preliminary
cross-dataset checks on AMI meetings to track generalisation.

## Speaker Count Estimation

| Metric | Result |
|--------|--------|
| Files | 216 |
| Exact match | 117/216 (54%) |
| Within +/-1 | 175/216 (81%) |
| Exact match | 125/216 (58%) |
| Within +/-1 | 178/216 (82%) |

The automatic estimator is usually close, but exact counting remains the
main weak spot. Accuracy drops for many-speaker files --- see
Expand All @@ -23,21 +25,42 @@ DER is the standard metric for speaker diarization, computed with
| System | Weighted DER | Median DER | Notes |
|--------|----------|------------|-------|
| pyannote precision-2 | ~8.5% | -- | Commercial license |
| **diarize** | **~5.0%** | **~2.2%** | **Apache 2.0, CPU-only, no API key** |
| **diarize** | **~4.8%** | **~2.1%** | **Apache 2.0, CPU-only, no API key** |
| pyannote community-1 | ~11.2% | -- | CC-BY-4.0, needs HF token |
| pyannote 3.1 (legacy) | ~11.2% | -- | MIT, needs HF token |

pyannote DER numbers are self-reported from the
[pyannote benchmark page](https://huggingface.co/pyannote/speaker-diarization-3.1)
on VoxConverse v0.3.

!!! note "VoxConverse-only result"
!!! note "Dataset-specific result"
On this VoxConverse dev evaluation, `diarize` reports lower weighted
DER than the published pyannote VoxConverse figures, while requiring
no HuggingFace token or account registration. Treat this as a
single-dataset benchmark and compare on your own audio when accuracy
VoxConverse-specific benchmark and compare on your own audio when accuracy
is the top priority.

## Cross-Dataset Check: AMI

Preliminary AMI test-set evaluation uses 16 Mix-Headset meeting
recordings (4--9 speakers per file), RTTM annotations from the
standard AMI speaker-diarization benchmark, and the same DER settings
(``collar=0.25``, ``skip_overlap=True``).

| Metric | Result |
|--------|--------|
| Files | 16 |
| Weighted DER | 14.96% |
| Mean DER | 14.63% |
| Median DER | 14.18% |
| Speaker count exact match | 4/16 (25%) |
| Speaker count within +/-1 | 8/16 (50%) |

This confirms that meeting-domain audio is a harder case for automatic
speaker counting. The estimator often collapses 6+ speaker meetings to
4--5 speakers, even when aggregate DER remains moderate because some
ground-truth speakers have little speaking time.

## CPU Speed (Real Time Factor)

RTF = processing_time / audio_duration. Lower is faster; RTF < 1.0 means
Expand Down Expand Up @@ -76,6 +99,34 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max
warm-up. RTF = processing_time / audio_duration.
- **Hardware:** Apple M2 Pro, macOS, CPU only (no GPU).

## Reproducing and Extending Benchmarks

The repository includes a dataset-agnostic RTTM runner for local
experiments:

```bash
python scripts/benchmark_rttm.py \
--dataset voxconverse-dev \
--audio-dir /path/to/voxconverse/dev/audio \
--rttm-dir /path/to/voxconverse/rttm_annotations/dev \
--output results_voxconverse_dev.json
```

It also supports combined RTTM files and targeted diagnostics:

```bash
python scripts/benchmark_rttm.py \
--dataset ami-test \
--audio-dir /path/to/ami/mix-headset/test \
--rttm-file /path/to/AMI.SpeakerDiarization.Benchmark.test.rttm \
--oracle-speakers \
--file-id IS1009a
```

Use ``--oracle-speakers`` to isolate speaker assignment and clustering
quality when the true speaker count is known. Use ``--list-only`` to
verify audio/RTTM matching without running inference.

## Limitations

!!! warning "Speaker count > 7"
Expand Down Expand Up @@ -108,14 +159,14 @@ Measured on VoxConverse dev files on Apple M2 Pro / M2 Max

## Future Work

!!! info "Single-dataset disclaimer"
All results above are from VoxConverse dev set only. We are actively
expanding evaluation to ensure the algorithm generalises well and is
not overfit to a single benchmark.
!!! info "Cross-dataset validation in progress"
VoxConverse remains the primary published benchmark. AMI is now used
as an additional meeting-domain check, and more datasets are needed
before making broad accuracy claims.

**Planned evaluation:**

- **Cross-dataset validation** --- AMI, DIHARD III, CALLHOME, and other
- **Cross-dataset validation** --- DIHARD III, CALLHOME, and other
standard benchmarks, run in isolated environments with controlled
CPU/memory limits.
- **Speaker count estimation comparison** --- dedicated benchmarks comparing
Expand Down
7 changes: 4 additions & 3 deletions docs/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,10 @@ speakers while keeping computational cost low.

**Step 3 --- Silhouette refinement.** BIC is used as an anchor, then a
small neighbourhood around it is scored with silhouette over cosine
distance. The candidate range is clamped by `min_speakers`,
`max_speakers`, and the number of available embeddings. This catches
some BIC undercounts and overcounts without searching the full range.
distance plus a small logarithmic bonus for larger *k*. The candidate
range is clamped by `min_speakers`, `max_speakers`, and the number of
available embeddings. This catches some BIC undercounts and overcounts
without searching the full range.

!!! warning
For **8 or more speakers** the estimator can undercount.
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ for seg in result.segments:
| GPU required | No | No (7x slower on CPU) | No |
| HuggingFace account | No | Yes | Yes |
| Auto speaker count | Yes | Yes | Yes |
| DER (VoxConverse dev) | **~5.0%** | ~11.2% | ~8.5% |
| DER (VoxConverse dev) | **~4.8%** | ~11.2% | ~8.5% |
| CPU speed (RTF) | **0.12** | 0.86 | --- |

DER and speed numbers for pyannote are from their
Expand All @@ -40,7 +40,7 @@ The diarize number is from the VoxConverse dev evaluation described in
## Next Steps

- [How It Works](how-it-works.md) --- pipeline architecture and algorithms
- [Benchmarks](benchmarks.md) --- VoxConverse evaluation, speed comparison, limitations
- [Benchmarks](benchmarks.md) --- VoxConverse, AMI, speed comparison, limitations
- [API Reference](api.md) --- full auto-generated API documentation

## License
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "diarize"
version = "0.1.1"
version = "0.1.2"
description = "Speaker diarization for Python — detect who spoke when in audio files. CPU-only, no GPU, no API keys, no account signup. Automatic speaker count detection."
readme = "README.md"
license = "Apache-2.0"
Expand Down
Loading
Loading