First steps in setting up a diarisation benchmark for Slovenian and related languages
Run from repository root:
# 1) Prepare dataset artifacts (gold RTTM, optional trimming)
./prepare_data.sh --yes rog_dialog# 2) Build inference backends
./build_backends.sh# 3) Run inference
./run_inference.sh --dataset ROG-Dialog --hf-token "$HF_TOKEN"# 4) Evaluate and generate report
./scripts/run_evaluation_report.sh --dataset rog_dialog --yesDetailed full-flow docs: End-to-end pipeline and Inference guide.
We currently support three datasets:
-
ROG-Dialog (primary benchmark)
- official source: http://hdl.handle.net/11356/2073
- use
prepare_data_rog_dialog.shto download/unpack/reorganize and generatedata/ROG-Dialog/ref_rttm/*.rttm - converter:
rog_dialog_data_process.py(merge/min filtering, optional.pogvs.stdselection)
-
ROG Art (Training corpus of spoken Slovenian ROG 1.1)
- official source: https://www.clarin.si/repository/xmlui/handle/11356/2062
- use
prepare_data_rog_art.shto download/unpack/reorganize and generatedata/ROG-Art/ref_rttm/*.rttm - converter:
rog_art_data_process.py(multi-speaker subset fromROG-speeches.tsv, merge/min filtering,.pog/.stdpreference) - this benchmark uses only multi-speaker recordings as filtered by
SPK-IDsUTTS
-
CHILDES Croatian Corpus of Preschool Child Language (CCPCL)
- source: requires registration/login via https://talkbank.org/childes/access/Slavic/Croatian/CCPCL.html
- download archive manually to
data/raw/CCPCL.zip, then run./prepare_data_ccpcl.sh(optional first argument: gold RTTM basename, defaultccpcl_gold_standard; see docs/data_preparation.md) prepare_data_ccpcl.shextracts todata/raw/CCPCL, validates.wavavailability and optionally runsccpcl_data_process.pyccpcl_data_process.pyreadsdata/raw/CCPCL/CCPCL/*.cha(or nested layout resolved by the shell script), writesdata/CHILDES-CCPCL/ref_rttm/<basename>.rttmwith the same merge/min-duration defaults as other datasets
Explicit supported model names (canonical Hugging Face or vendor ids), as stored in each run’s benchmark_metadata.json under model_name:
NVIDIA NeMo (Sortformer)
nvidia/diar_sortformer_4spk-v1nvidia/diar_streaming_sortformer_4spk-v2nvidia/diar_streaming_sortformer_4spk-v2.1
PyAnnote
pyannote/speaker-diarization-3.1pyannote/speaker-diarization-community-1pyannote/speaker-diarization-precision-2
Revai
Revai/reverb-diarization-v2
DiariZen
BUT-FIT/diarizen-wavlm-large-s80-mdBUT-FIT/diarizen-wavlm-large-s80-md-v2
Metadata for reporting is read from results/<Dataset>/<run_folder>/benchmark_metadata.json (for example results/ROG-Dialog/pyannote_3_1/benchmark_metadata.json). The run_folder names in the repo mirror the benchmark layout (e.g. diarizen, diar_streaming_sortformer_4spk-v2, speaker-diarization-precision-2).
Inference backends: models/pyannote/, models/nemo/, models/diarizen/, and Revai via the pyannote container where applicable—see each module’s README.
Preferred: from the repository root, run scripts/run_evaluation_report.sh (--help lists options; reuses/builds the eval Docker image or falls back to uv; --dataset all --yes runs every dataset with defaults). Full detail (Docker/uv, trimmed gold, errata — manual file ROG-Dialog only in this repo, optional auto AUTO_DATASET_ERRATA.json, OK-only headline aggregates) is in docs/evaluation.md. Module-level notes: evaluation/README.md.
The Markdown report includes DER / Miss / FA / Conf, JER, boundary P/R/F1, purity and coverage, hardware RTF/VRAM, per-file deep dive, and (for generate_report_universal.py) category plots driven by dataset metadata.
generate_report_universal.py also emits a machine-readable JSON file alongside the Markdown report (default name <report_stem>.machine.json, schema_version 1.0). See Machine-readable report JSON in docs/evaluation.md for the stable top-level keys and compatibility rules.
Additional report controls:
--boundary_tolerance— boundary P/R/F1 tolerance (seconds)--analysis_collar— collar for category/domain plots and tables (snapped toCOLLAR_SETTINGS)--no_auto_errata— skip merged auto errata beside--gold(reports);--no-auto-errataonscore.py--audio_dir,--json_output,--no_json— universal report only; dataset technical probing and JSON output path (see docs/evaluation.md)scripts/run_evaluation_report.sh:--dataset all,--batch/--non-interactive, and--rebuildfor forced Docker rebuilds
Optional uv setups isolate silence-trimming dependencies (Parselmouth) and the evaluation stack from your system Python. Install uv first if it is missing (see Installing uv), then follow Python environment (uv): uv sync --group trim at the repo root, export DIABENCH_PYTHON="uv run --group trim python" before ./prepare_data.sh …, and cd evaluation && uv sync for reports. Remove the envs when finished: rm -rf .venv (repo root) and rm -rf evaluation/.venv.
- Download the data:
bash prepare_data.sh
-
Run pyannote:
- Build the docker image:
cd models/pyannote docker build -t benchmark-pyannote . cd ../..
- Run first model:
sudo docker run --rm \ -v "$(pwd)/data/ROG-Dialog/audio:/data/audio" \ -v "$(pwd)/results/ROG-Dialog/pyannote_3_1:/data/output" \ -e HOST_UID=$(id -u) \ -e HOST_GID=$(id -g) \ -e HF_TOKEN="YOURTOKEN" \ benchmark-pyannote \ --input /data/audio \ --output /data/output \ --model pyannote/speaker-diarization-3.1This works, but on GPU2 there is no docker, and on my laptop, there is no GPU.
Consequently, processing takes ages, with RTF = 2!
-
Nemo models:
- Build the docker image:
cd models/nemo sudo docker build -t benchmark-nemo . cd ../..
- Run first model:
sudo docker run --rm \ -v "$(pwd)/data/ROG-Dialog/audio:/data/audio" \ -v "$(pwd)/results/ROG-Dialog/diar_streaming_sortformer_4spk-v2:/data/output" \ -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ -e HOST_UID=$(id -u) \ -e HOST_GID=$(id -g) \ -e HF_TOKEN="YOURTOKEN" \ benchmark-nemo \ --input /data/audio \ --output /data/outputThis runs faster, with RTF of 0.1 cca.
-
DiariZen models:
Build and run as in models/diarizen/README.md (CPU-oriented image; mount
results/ROG-Dialog/diarizenorresults/ROG-Dialog/diarizen_v2for output). Example:cd models/diarizen docker build -t benchmark-diarizen . cd ../.. sudo docker run --rm \ -v "$(pwd)/data/ROG-Dialog/audio:/data/audio" \ -v "$(pwd)/results/ROG-Dialog/diarizen_v2:/data/output" \ -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \ -e HOST_UID=$(id -u) \ -e HOST_GID=$(id -g) \ -e HF_TOKEN="YOURTOKEN" \ benchmark-diarizen \ --input /data/audio \ --output /data/output \ --model BUT-FIT/diarizen-wavlm-large-s80-md-v2
-
Run the eval
cd evaluation
sudo docker build -t benchmark-eval .
cd ..
sudo docker run --rm \
-v "$(pwd)/data/ROG-Dialog:/data/rog" \
-v "$(pwd)/results/ROG-Dialog:/data/results" \
-v "$(pwd)/reports:/data/reports" \
-v "$(pwd)/evaluation/DATASET_ERRATA.json:/app/DATASET_ERRATA.json" \
-e HOST_UID=$(id -u) -e HOST_GID=$(id -g) \
benchmark-eval \
--gold /data/rog/ref_rttm/default_gold_standard_trimmed.rttm \
--results_dir /data/results \
--metadata /data/rog/docs/ROG-Dia-meta-speeches.tsv \
--errata /app/DATASET_ERRATA.json \
--output /data/reports/ROG-Dia-TrimHuman-annotated segment boundaries in the gold RTTM often include leading/trailing silence (annotators clicking a bit too early or too late). The trim_gold_silences_rttm.py module uses Praat's pitch and intensity analysis (via Parselmouth) to detect actual speech onset/offset and tighten those boundaries automatically. It can also split segments at long internal silences.
- Loads each audio file and analyses segments using pitch detection + intensity relative to segment peak
- Trims segment edges to where speech actually starts/ends, with a configurable guard margin
- Optionally splits segments at internal silences (e.g. ≥500ms gaps within a single annotation)
- Drops segments that become too short after trimming (configurable threshold)
- Writes results incrementally per file (crash-safe — no data lost if it fails mid-run)
Trimming the gold standard consistently lowers DER across all models, driven almost entirely by reduced Miss rates — the original annotations include silence at segment edges that unfairly penalizes models for not predicting speech where there is none. FA increases slightly (smaller speech denominator), while Confusion stays stable since trimming doesn't affect speaker identity.
Example using the best-performing model (pyannote speaker-diarization-precision-2, collar=0.25):
| Metric | Original Gold | Trimmed Gold |
|---|---|---|
| DER | 20.25% | 9.52% |
| Miss | 17.40% | 5.78% |
| FA | 1.26% | 2.37% |
| Conf | 1.22% | 1.36% |
| Purity | 86.91% | 86.89% |
| Coverage | 89.32% | 89.09% |
The trimmed evaluation better reflects actual diarisation performance by removing measurement artifacts from imprecise annotation boundaries.
python trim_gold_silences_rttm.py \
--rttm data/ROG-Dialog/ref_rttm/gold_standard.rttm \
--audio-dir data/ROG-Dialog/audio \
--output data/ROG-Dialog/ref_rttm/gold_trimmed.rttm \
--trim-silence-within \
--verboseRun python trim_gold_silences_rttm.py --help for all options (pitch range, intensity threshold, guard margin, max trim, etc.).
convert_trs_to_trim_rttm.py imports the trimming module and runs the full pipeline: parse TRS → merge segments → trim with audio → write RTTM + optional EXB files. All settings are configured at the top of the script:
ENABLE_TRIMMING = True # set False to skip audio analysis
GENERATE_EXB = True # generate EXB files for visual inspection
TRIM_PARAMS = TrimParams(
intensity_drop_db=15.0, # dB below segment peak = "silence"
trim_silence_within=True,
min_silence_dur=0.5,
verbose=False,
# ... other params with sensible defaults
)python convert_trs_to_trim_rttm.pyOutput filename is automatic: gold_standard_trimmed_{int}db.rttm when trimming is on, gold_standard.rttm when off. A _metadata.txt file with full parameters and statistics is written alongside.
When GENERATE_EXB = True, the script produces EXB files with [Dia_gold_trim] tiers that can be opened in EXMARaLDA Partitur Editor alongside the original transcription tiers for visual verification of trim quality.
TRS filenames (e.g. ROG-Dia-GSO-P0005-std.trs) are stripped of -std/-pog suffixes to derive the file ID (ROG-Dia-GSO-P0005). This must match the corresponding .wav and .exb filenames.