Skip to content

Speaker-naming simulator + same-voice consolidation to cut speakers-to-name#1114

Merged
r3dbars merged 5 commits into
mainfrom
claude/speaker-naming-simulator-guviy7
Jun 16, 2026
Merged

Speaker-naming simulator + same-voice consolidation to cut speakers-to-name#1114
r3dbars merged 5 commits into
mainfrom
claude/speaker-naming-simulator-guviy7

Conversation

@r3dbars

@r3dbars r3dbars commented Jun 13, 2026

Copy link
Copy Markdown
Owner

What & why

Speaker naming feedback says the review sheet can ask the user to name 4-7 "people" after a 1-on-1 call. The likely failure is offline VBx over-segmenting one remote voice into several large clusters that survive small-cluster absorption.

This PR does two things:

  • adds a conservative same-voice consolidation pass in EmbeddingClusterer
  • adds a deterministic speaker-naming simulator that measures the user-facing review-row count, not just raw diarizer cluster count

Changes

Same-voice consolidation

  • EmbeddingClusterer.postProcess now runs consolidateSameVoiceClusters after small-cluster absorption and before DB-informed split.
  • Offline system and opt-in mic diarization paths both get the pass through their existing postProcess(... pairwiseMergeThreshold: nil) calls.
  • The pass agglomeratively merges clusters above the SpeakerNamingPolicy auto-accept bar (> 0.88) and recomputes centroids after each merge.
  • Added a boundary test so clusters at exactly 0.88 similarity stay separate, matching the naming policy edge.

Speaker-naming simulator

scripts/ops/speaker-naming-simulator.py now reports:

  • review_before / review_after
  • expected review rows
  • expected labels
  • remote vs local role
  • false-merge flags

Coverage now includes:

  • cold unknown 1:1 over-segmentation: 5 review rows -> 1
  • named repeat speaker persistence: saved name auto-applies, 0 review rows
  • tentative known speaker: 1 confirmation row, not duplicate rows
  • remote small/crowded groups
  • near-threshold similar distinct voices
  • local mic default-off behavior: You, 0 review rows
  • opt-in local split behavior: room speakers become local review rows

Docs

  • docs/qa-test-bench.md documents the simulator as review-row/label/false-merge coverage.

Local verification

Passed on macOS in this worktree:

python3 -m py_compile scripts/ops/speaker-naming-simulator.py
scripts/ops/speaker-naming-simulator.py
scripts/ops/speaker-naming-simulator.py --sweep
git diff --check
bash scripts/dev/agent-preflight.sh
bash -n scripts/ops/transcripted-qa-bench.sh
bash -n scripts/ops/run-local-summary-fixture.sh
python3 -m py_compile scripts/ops/validate-meeting-corpus.py
python3 -m py_compile scripts/ops/compare-meeting-corpus.py
bash build-deps.sh --force
bash build.sh --no-open
bash run-tests.sh
bash run-integration-smoke.sh
swift test
bash scripts/ops/run-local-summary-fixture.sh
swift test --package-path Tools/TranscriptedQA
bash scripts/ops/transcripted-qa-bench.sh --mode quick

Notable counts:

  • run-tests.sh: 5255 passed, 0 failed
  • swift test: 474 passed, 1 skipped, 0 failed
  • Tools/TranscriptedQA: 34 passed, 0 failed
  • quick QA bench: PASS, 6/6 checks

Review status / blockers

Keep this PR draft for now.

  • codex review --base origin/main was attempted but hit the Codex account usage limit before producing findings.
  • Claude Code second-opinion review was attempted through claude -p but hung without output and was interrupted.
  • CI should re-run on the pushed head 80ca3e7f.
  • No live audio/manual corpus proof has been run; this simulator is deterministic model coverage, not real speaker-ID accuracy proof on Justin's recordings.

Merge recommendation

Hold as draft until an independent review completes cleanly and, ideally, at least one real or corpus meeting verifies that the reduced review-row count does not hide a true distinct speaker.

@r3dbars r3dbars mentioned this pull request Jun 14, 2026
12 tasks
@r3dbars r3dbars force-pushed the claude/speaker-naming-simulator-guviy7 branch from 80ca3e7 to d858f9c Compare June 15, 2026 13:19
@r3dbars

r3dbars commented Jun 15, 2026

Copy link
Copy Markdown
Owner Author

Rebased onto main + cleared the merge-state CI failure

This branch was stale (forked at 0c72c29c, before SpeakerNamingSimulationRunner landed on main). On the merged tree CI was red on a single test:
SpeakerNamingSimulationRunnerTests.testSimulationReportFlagsConfusionFalseMergeAndFalseSplit (XCTAssertFalse at :107 + XCTAssertTrue at :110).

Root cause (verified locally, not just assumed)

The negative-control suite built its false split from two clusters with identical alex embeddings. The new same-voice consolidation pass (cosine > 0.88) correctly merges identical voices, so the two clusters collapsed into one review row and the false split disappeared — falseSplitIndicators came back empty. Reproduced on the rebased tree; the report showed False split indicators: none while false-merge/confusion detection still fired. The detector was never broken — consolidation (the feature) legitimately removed the split the test depended on.

Fixes

  1. Rebased the 2 PR commits onto current main (clean, no conflicts; main had not touched any PR file since the fork point).
  2. Negative control now models a drifted same-voice over-segmentation (near(alex, degrees: 35) ≈ 0.82 cosine, below the 0.88 bar) instead of an identical voice. Consolidation correctly leaves it split, the user mislabels it, and the runner's unchanged detector flags a genuine false split. No assertion was weakened — the scenario now exercises the residual false-split risk that consolidation cannot fix.
  3. Threshold drift guard: named both thresholds (SpeakerNamingPolicy.autoAcceptSimilarityThreshold and EmbeddingClusterer.sameVoiceConsolidationThreshold), wired the consolidation defaults to the constant, and added EmbeddingClustererTests.testConsolidationThresholdMatchesAutoAcceptBar asserting they stay equal. Pure refactor — both remain 0.88.

Tests (local swift test, the same path CI runs)

  • Full package: 486 executed, 1 skipped, 0 failures (was 1 failure pre-fix).
  • Speaker/clusterer/pipeline suites: 128 passed, 0 failed — incl. EmbeddingClustererTests (12, with the new guard) and SpeakerNamingSimulationRunnerTests (6, with the previously-failing test now green for the right reason).

Still manual / unproven (unchanged by this PR)

No real-audio or corpus validation was run — speaker attribution on real meetings stays manual and unknown. The honest claim remains guardrails against over-merging, not solved speaker attribution: aggressive same-voice consolidation can still over-merge genuinely distinct speakers (e.g. a 6-person call collapsing). Recommend a real-corpus pass before relying on the review-row reduction in production.

@r3dbars

r3dbars commented Jun 15, 2026

Copy link
Copy Markdown
Owner Author

Codex review follow-up: known-profile consolidation guard

Pushed f77aa172 to keep same-voice consolidation from merging two plausible different known speakers before review. The fix adds a lower known-profile conflict guard (aligned to the 0.70 DB match floor) plus regressions for both exact known-profile conflicts and looser profile matches below the 0.88 consolidation bar.

Local proof on the final patch:

  • bash build-deps.sh --force
  • bash build.sh --no-open
  • bash run-tests.sh (5350/5350)
  • bash run-integration-smoke.sh
  • swift test (488 executed, 1 skipped)
  • swift test --filter EmbeddingClustererTests (14/14)
  • swift test --filter SpeakerNamingSimulationRunnerTests (6/6)
  • scripts/ops/speaker-naming-simulator.py + --sweep
  • bash scripts/ops/transcripted-qa-bench.sh --mode quick (6/6)
  • codex-review --mode branch: clean, no accepted/actionable findings

Hold reason: PR is still draft; GitHub repo-hygiene passed after push, but build-and-test was still pending when I stopped watching. Real-audio/corpus validation is still not proven, so keep this YELLOW until CI is green and a real meeting corpus/manual audio pass confirms attribution behavior.

claude and others added 5 commits June 16, 2026 05:54
Reduce how many speakers a meeting asks you to name. Offline VBx clustering
often splits one remote voice into several clusters that each exceed the 30s
small-cluster floor, so a one-on-one call surfaces 4-7 "speakers" for a single
person in the post-meeting naming sheet.

- EmbeddingClusterer gains a same-voice consolidation pass that agglomeratively
  merges clusters whose mean embeddings clear the SpeakerNamingPolicy
  auto-accept bar (0.88), recomputing centroids after each merge so distinct
  speakers in a crowded meeting do not chain-collapse. Runs for both the system
  and mic offline paths via postProcess; opt-out with consolidationThreshold:nil.
- New scripts/ops/speaker-naming-simulator.py models the naming count without
  audio or ML models: synthetic over-segmented scenarios run through a faithful
  port of the post-processing, reporting names_before vs names_after and a
  threshold sweep that shows the merge tradeoff. Exits non-zero on regression.
- EmbeddingClustererTests cover consolidation, distinct-speaker preservation,
  chain-collapse resistance, and the end-to-end one-on-one case.
- Document the simulator in docs/qa-test-bench.md.

https://claude.ai/code/session_01NSNvZsNWRaqU1k27DTAXNZ
…plit post-consolidation

After rebasing onto main, the negative-control suite built its false split from
two clusters with identical "alex" embeddings. The new same-voice consolidation
pass (cosine > 0.88) correctly merges them, so only one review row survived and
the false split vanished — testSimulationReportFlagsConfusionFalseMergeAndFalseSplit
failed because falseSplitIndicators was empty.

Model the second "alex" cluster as a drifted same-voice over-segmentation
(~0.82 cosine, below the 0.88 consolidation bar) instead. Consolidation
legitimately leaves it split, the user mislabels it, and the runner's
(unchanged) detector flags a real false split. No assertion was weakened; the
scenario now exercises the residual false-split risk consolidation cannot fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The consolidation pass merges clusters above 0.88 — described as "the
SpeakerNamingPolicy auto-accept bar" — but nothing tied the two values, so a
future change to one could silently diverge from the other.

Name both thresholds (SpeakerNamingPolicy.autoAcceptSimilarityThreshold and
EmbeddingClusterer.sameVoiceConsolidationThreshold), wire the consolidation
defaults to the named constant, and add a regression test asserting the two
stay equal. Pure refactor: both values remain 0.88, behavior is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@r3dbars r3dbars force-pushed the claude/speaker-naming-simulator-guviy7 branch from f77aa17 to 7768b03 Compare June 16, 2026 11:01
@r3dbars r3dbars marked this pull request as ready for review June 16, 2026 11:01
@r3dbars r3dbars merged commit 8d75259 into main Jun 16, 2026
3 checks passed
@r3dbars r3dbars deleted the claude/speaker-naming-simulator-guviy7 branch June 16, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants