-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
feat(realtime): Semantic VAD EOU token #10444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
richiejp
wants to merge
2
commits into
mudler:master
Choose a base branch
from
richiejp:feat/realtime-semantic-vad-eou
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+9,107
−595
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| --- | ||
| name: 'realtime-conformance' | ||
|
|
||
| # Verifies the realtime state-machine implementations conform to their formal | ||
| # designs (docs/design/realtime-state-machines.md, formal-verification/). BOTH | ||
| # layers are enforced and the gate is fail-closed: the Go conformance layer | ||
| # (respcoord + turncoord transition/rapid tests under -race) AND the FizzBee model check of | ||
| # the authoritative specs. FizzBee is pinned + checksum-verified | ||
| # (formal-verification/fizzbee.sha256), so a failed install fails the job rather | ||
| # than silently skipping verification. | ||
|
|
||
| on: | ||
| pull_request: | ||
| paths: | ||
| - 'core/http/endpoints/openai/coordinator/**' | ||
| - 'core/http/endpoints/openai/respcoord/**' | ||
| - 'core/http/endpoints/openai/turncoord/**' | ||
| - 'core/http/endpoints/openai/conncoord/**' | ||
| - 'core/http/endpoints/openai/compactcoord/**' | ||
| - 'core/http/endpoints/openai/ttscoord/**' | ||
| - 'formal-verification/**' | ||
| - 'scripts/realtime-conformance.sh' | ||
| - 'scripts/install-fizzbee.sh' | ||
| - '.github/workflows/realtime-conformance.yml' | ||
| push: | ||
| branches: | ||
| - master | ||
| paths: | ||
| - 'core/http/endpoints/openai/coordinator/**' | ||
| - 'core/http/endpoints/openai/respcoord/**' | ||
| - 'core/http/endpoints/openai/turncoord/**' | ||
| - 'core/http/endpoints/openai/conncoord/**' | ||
| - 'core/http/endpoints/openai/compactcoord/**' | ||
| - 'core/http/endpoints/openai/ttscoord/**' | ||
| - 'formal-verification/**' | ||
| - 'scripts/realtime-conformance.sh' | ||
|
|
||
| concurrency: | ||
| group: realtime-conformance-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }} | ||
| cancel-in-progress: ${{ github.event_name == 'pull_request' }} | ||
|
|
||
| jobs: | ||
| conformance: | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| matrix: | ||
| go-version: ['1.26.x'] | ||
| steps: | ||
| - name: Clone | ||
| uses: actions/checkout@v7 | ||
| - name: Setup Go ${{ matrix.go-version }} | ||
| uses: actions/setup-go@v5 | ||
| with: | ||
| go-version: ${{ matrix.go-version }} | ||
| cache: false | ||
| - name: Cache FizzBee | ||
| uses: actions/cache@v4 | ||
| with: | ||
| path: .tools/fizzbee | ||
| key: fizzbee-v0.5.2-${{ runner.os }}-${{ hashFiles('formal-verification/fizzbee.sha256') }} | ||
| - name: Install FizzBee (pinned, checksum-verified) | ||
| # No `|| true`: a failed/forged download must fail the job, not silently | ||
| # drop the design verification. install-fizzbee.sh is a no-op if the | ||
| # cached binary is already present and valid. | ||
| run: ./scripts/install-fizzbee.sh | ||
| - name: Run conformance gate (fail-closed) | ||
| # No skip env: both the Go conformance and the FizzBee model check are | ||
| # required. The gate auto-detects .tools/fizzbee/fizz. | ||
| run: make test-realtime-conformance |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| package main | ||
|
|
||
| // utteranceBoundary is the single definition of a small state machine that was | ||
| // previously open-coded three times — as a bare `finalEou` bool with an ad-hoc | ||
| // toggle — in the live feed (live.go), the file-stream text path, and the | ||
| // file-stream JSON path (goparakeetcpp.go). | ||
| // | ||
| // It answers one running question: does the decode currently rest on an | ||
| // end-of-utterance boundary? That is the value a closing FinalResult reports as | ||
| // .Eou and the realtime turn detector treats as a commit point. | ||
| // | ||
| // parakeet auto-resets its decoder after every <EOU>/<EOB>, so one streaming | ||
| // session is a sequence of utterances and this is a LATCH, not a monotonic | ||
| // flag: it closes on an <EOU> and reopens as soon as the next utterance starts. | ||
| // (Contrast the realtime API's per-turn `eouSeen`, which only ever goes | ||
| // false->true because each turn gets a fresh stream. Here the stream outlives | ||
| // the turn, so the boundary status must be able to reopen.) | ||
| // | ||
| // The only transitions, over the events one streamFeedResult carries — an | ||
| // <EOU>, an <EOB> (backchannel), or plain speech output (text and/or words): | ||
| // | ||
| // <EOU> | ||
| // open ───────────► closed | ||
| // ▲ ▲ │ │ │ | ||
| // │ └─┘ <EOB>|speech │ │ <EOU> | ||
| // │ (stay open) │ └─┘ (stay closed) | ||
| // └──────────────────┘ | ||
| // <EOB>|speech | ||
| // | ||
| // open = NOT on an utterance boundary: mid-utterance, the last boundary was | ||
| // a backchannel <EOB>, or the stream just began (the initial state). | ||
| // closed = the last meaningful event was an <EOU> with no later speech: a real | ||
| // turn boundary. | ||
| // | ||
| // A feed that carries nothing (no eou/eob/text/words — e.g. a finalize flush | ||
| // that produced no tail) is a no-op and leaves the state unchanged, matching | ||
| // the legacy "leave finalEou as it was" behaviour. | ||
| // | ||
| // The state carries no data, so it is modelled as a two-valued type (a named | ||
| // bool) rather than an int enum: every inhabitant is legal, so illegal states | ||
| // are unrepresentable — the payload-free analog of the sealed sum types the | ||
| // realtime machines use (those need interfaces because their states carry data, | ||
| // e.g. Active{ID}, where "Active with no ID" is the illegal combination a scalar | ||
| // cannot even express). | ||
| type utteranceBoundary bool | ||
|
|
||
| const ( | ||
| // boundaryOpen is the zero value (false), so a fresh decode starts open — | ||
| // exactly the legacy `var finalEou bool` (false) initial condition. | ||
| boundaryOpen utteranceBoundary = false | ||
| boundaryClosed utteranceBoundary = true | ||
| ) | ||
|
|
||
| // observe folds one decode increment into the latch and returns the new state. | ||
| // | ||
| // <EOU> takes priority when a single feed carries both an <EOU> and speech | ||
| // (e.g. {"text":"hello","eou":1}): the utterance both produced that text AND | ||
| // ended, so the decode rests on the boundary. This matches the legacy | ||
| // eou-checked-first ordering at every call site. | ||
| func (b utteranceBoundary) observe(r streamFeedResult) utteranceBoundary { | ||
| switch { | ||
| case r.Eou: | ||
| return boundaryClosed | ||
| case r.Eob || r.Delta != "" || len(r.Words) > 0: | ||
| return boundaryOpen | ||
| default: | ||
| return b | ||
| } | ||
| } | ||
|
|
||
| // ended reports whether the decode currently rests on an end-of-utterance | ||
| // boundary (a real <EOU>, not a backchannel <EOB>). This is what a closing | ||
| // FinalResult carries as .Eou. | ||
| func (b utteranceBoundary) ended() bool { return b == boundaryClosed } | ||
|
|
||
| func (b utteranceBoundary) String() string { | ||
| if b == boundaryClosed { | ||
| return "closed" | ||
| } | ||
| return "open" | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| package main | ||
|
|
||
| import ( | ||
| "math/rand/v2" | ||
|
|
||
| . "github.com/onsi/ginkgo/v2" | ||
| . "github.com/onsi/gomega" | ||
| ) | ||
|
|
||
| var _ = Describe("utteranceBoundary (decode end-of-utterance latch)", func() { | ||
| It("starts open: a fresh decode is not on a boundary", func() { | ||
| var b utteranceBoundary | ||
| Expect(b).To(Equal(boundaryOpen)) | ||
| Expect(b.ended()).To(BeFalse()) | ||
| }) | ||
|
|
||
| DescribeTable("single feed transition from the open state", | ||
| func(r streamFeedResult, wantEnded bool) { | ||
| Expect(boundaryOpen.observe(r).ended()).To(Equal(wantEnded)) | ||
| }, | ||
| Entry("<EOU> closes it", streamFeedResult{Eou: true}, true), | ||
| Entry("<EOU> with text closes it (eou wins)", streamFeedResult{Delta: "hi", Eou: true}, true), | ||
| Entry("<EOB> stays open (backchannel is not a turn boundary)", streamFeedResult{Eob: true}, false), | ||
| Entry("plain text stays open", streamFeedResult{Delta: "hello"}, false), | ||
| Entry("words-only stays open", streamFeedResult{Words: []transcriptWord{{W: "x"}}}, false), | ||
| Entry("empty feed is a no-op (stays open)", streamFeedResult{}, false), | ||
| ) | ||
|
|
||
| DescribeTable("single feed transition from the closed state", | ||
| func(r streamFeedResult, wantEnded bool) { | ||
| Expect(boundaryClosed.observe(r).ended()).To(Equal(wantEnded)) | ||
| }, | ||
| Entry("another <EOU> stays closed", streamFeedResult{Eou: true}, true), | ||
| Entry("trailing speech reopens it", streamFeedResult{Delta: "and more"}, false), | ||
| Entry("words reopen it", streamFeedResult{Words: []transcriptWord{{W: "x"}}}, false), | ||
| Entry("a backchannel <EOB> reopens it", streamFeedResult{Eob: true}, false), | ||
| Entry("empty feed is a no-op (stays closed)", streamFeedResult{}, true), | ||
| ) | ||
|
|
||
| It("is a latch: <EOU> then trailing speech reopens, then <EOU> closes again", func() { | ||
| b := boundaryOpen | ||
| b = b.observe(streamFeedResult{Delta: "turn one", Eou: true}) | ||
| Expect(b.ended()).To(BeTrue()) | ||
| b = b.observe(streamFeedResult{Delta: " and more"}) | ||
| Expect(b.ended()).To(BeFalse(), "trailing speech without an EOU is an open utterance") | ||
| b = b.observe(streamFeedResult{Eou: true}) | ||
| Expect(b.ended()).To(BeTrue()) | ||
| }) | ||
|
|
||
| It("treats a backchannel before a real EOU correctly", func() { | ||
| b := boundaryOpen | ||
| b = b.observe(streamFeedResult{Delta: "uh huh", Eob: true}) | ||
| Expect(b.ended()).To(BeFalse(), "a backchannel must not masquerade as a turn boundary") | ||
| b = b.observe(streamFeedResult{Delta: "done", Eou: true}) | ||
| Expect(b.ended()).To(BeTrue()) | ||
| }) | ||
|
|
||
| It("matches the reference fold over seeded random feed sequences", func() { | ||
| // The invariant: after any sequence of feeds, ended() is true iff the | ||
| // last feed that carried ANY event was an <EOU>. <EOU> takes priority | ||
| // when a feed carries both an EOU and speech; empty feeds are ignored. | ||
| for seed := uint64(1); seed <= 200; seed++ { | ||
| rng := rand.New(rand.NewPCG(seed, seed*2654435761)) | ||
| b := boundaryOpen | ||
| lastWasEou := false // reference: did the last meaningful feed end on EOU? | ||
| steps := rng.IntN(30) | ||
| for i := 0; i < steps; i++ { | ||
| var r streamFeedResult | ||
| switch rng.IntN(5) { | ||
| case 0: | ||
| r = streamFeedResult{Eou: true} | ||
| case 1: | ||
| r = streamFeedResult{Eob: true} | ||
| case 2: | ||
| r = streamFeedResult{Delta: "w"} | ||
| case 3: | ||
| r = streamFeedResult{Delta: "w", Eou: true} // eou + speech, eou wins | ||
| case 4: | ||
| r = streamFeedResult{} // empty: no-op | ||
| } | ||
| b = b.observe(r) | ||
| if r.Eou { | ||
| lastWasEou = true | ||
| } else if r.Eob || r.Delta != "" || len(r.Words) > 0 { | ||
| lastWasEou = false | ||
| } | ||
| } | ||
| Expect(b.ended()).To(Equal(lastWasEou), | ||
| "seed %d: latch disagreed with the reference fold", seed) | ||
| } | ||
| }) | ||
| }) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if doing tihs uplift, I think would make sense to then deprecate
rpc AudioTranscriptionStream(TranscriptRequest) returns (stream TranscriptStreamResponse) {}above, and since at it re-wire the backends to use AudioTranscriptionLive directly. Mainly to avoid code/logic dup split across where it could be unified in a single placeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most transcription backends can't really stream input. The Stream RPC only really means streaming the output and on live both are streamed. So to only use the live RPC you have to buffer input on each backend that doesn't support input streaming. Which would also mean that backends that don't really support bidirectional streaming would implement the RPC instead of returning unsupported which IMO is bad for UX.
I was very confused by what was going on in this code though so have made some changes. Including adding explicit state machines that are formally verified...