Speaker-naming simulator + same-voice consolidation to cut speakers-to-name#1114
Conversation
80ca3e7 to
d858f9c
Compare
Rebased onto
|
Codex review follow-up: known-profile consolidation guardPushed Local proof on the final patch:
Hold reason: PR is still draft; GitHub |
Reduce how many speakers a meeting asks you to name. Offline VBx clustering often splits one remote voice into several clusters that each exceed the 30s small-cluster floor, so a one-on-one call surfaces 4-7 "speakers" for a single person in the post-meeting naming sheet. - EmbeddingClusterer gains a same-voice consolidation pass that agglomeratively merges clusters whose mean embeddings clear the SpeakerNamingPolicy auto-accept bar (0.88), recomputing centroids after each merge so distinct speakers in a crowded meeting do not chain-collapse. Runs for both the system and mic offline paths via postProcess; opt-out with consolidationThreshold:nil. - New scripts/ops/speaker-naming-simulator.py models the naming count without audio or ML models: synthetic over-segmented scenarios run through a faithful port of the post-processing, reporting names_before vs names_after and a threshold sweep that shows the merge tradeoff. Exits non-zero on regression. - EmbeddingClustererTests cover consolidation, distinct-speaker preservation, chain-collapse resistance, and the end-to-end one-on-one case. - Document the simulator in docs/qa-test-bench.md. https://claude.ai/code/session_01NSNvZsNWRaqU1k27DTAXNZ
…plit post-consolidation After rebasing onto main, the negative-control suite built its false split from two clusters with identical "alex" embeddings. The new same-voice consolidation pass (cosine > 0.88) correctly merges them, so only one review row survived and the false split vanished — testSimulationReportFlagsConfusionFalseMergeAndFalseSplit failed because falseSplitIndicators was empty. Model the second "alex" cluster as a drifted same-voice over-segmentation (~0.82 cosine, below the 0.88 consolidation bar) instead. Consolidation legitimately leaves it split, the user mislabels it, and the runner's (unchanged) detector flags a real false split. No assertion was weakened; the scenario now exercises the residual false-split risk consolidation cannot fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The consolidation pass merges clusters above 0.88 — described as "the SpeakerNamingPolicy auto-accept bar" — but nothing tied the two values, so a future change to one could silently diverge from the other. Name both thresholds (SpeakerNamingPolicy.autoAcceptSimilarityThreshold and EmbeddingClusterer.sameVoiceConsolidationThreshold), wire the consolidation defaults to the named constant, and add a regression test asserting the two stay equal. Pure refactor: both values remain 0.88, behavior is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f77aa17 to
7768b03
Compare
What & why
Speaker naming feedback says the review sheet can ask the user to name 4-7 "people" after a 1-on-1 call. The likely failure is offline VBx over-segmenting one remote voice into several large clusters that survive small-cluster absorption.
This PR does two things:
EmbeddingClustererChanges
Same-voice consolidation
EmbeddingClusterer.postProcessnow runsconsolidateSameVoiceClustersafter small-cluster absorption and before DB-informed split.postProcess(... pairwiseMergeThreshold: nil)calls.SpeakerNamingPolicyauto-accept bar (> 0.88) and recomputes centroids after each merge.0.88similarity stay separate, matching the naming policy edge.Speaker-naming simulator
scripts/ops/speaker-naming-simulator.pynow reports:review_before/review_afterCoverage now includes:
You, 0 review rowsDocs
docs/qa-test-bench.mddocuments the simulator as review-row/label/false-merge coverage.Local verification
Passed on macOS in this worktree:
Notable counts:
run-tests.sh: 5255 passed, 0 failedswift test: 474 passed, 1 skipped, 0 failedTools/TranscriptedQA: 34 passed, 0 failedReview status / blockers
Keep this PR draft for now.
codex review --base origin/mainwas attempted but hit the Codex account usage limit before producing findings.claude -pbut hung without output and was interrupted.80ca3e7f.Merge recommendation
Hold as draft until an independent review completes cleanly and, ideally, at least one real or corpus meeting verifies that the reduced review-row count does not hide a true distinct speaker.