feat(eot): add audio models AGT-2919#1719
Conversation
🦋 Changeset detectedLatest commit: 03a6b3e The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
ed2e02f to
3e2fb33
Compare
add audio eot model and local inference support, deprecating silero and turn detector plugins
…frame The AudioFrame emitted on START_OF_SPEECH / END_OF_SPEECH sliced off the prefix-padding samples but still reported `samplesPerChannel = speechBufferIndex`, so the frame's metadata claimed more samples than its data contained and downstream consumers (STT, transcription) lost the pre-roll context the buffer machinery is designed to preserve. Slice from 0 instead so data length matches samplesPerChannel and the prefix-padding pre-roll is delivered, matching the Python original. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… to version
Rename the unreleased `inference.AudioTurnDetector` to `inference.TurnDetector`
and replace its `model` constructor option with `version` (`'v1' | 'v1-mini'`).
The `version` is the constructor knob only; the `model` field/getter is kept and
now holds the full model name (`turn-detector-v1` / `turn-detector-v1-mini`),
which telemetry/billing read via `detector.model` (metric `modelName` →
`EOTModelUsage.model` → remote sessions) unchanged.
Mirrors the upstream Python rename. The private base peers are renamed to the
modality-agnostic streaming scheme: `BaseStreamingTurnDetector`,
`BaseStreamingTurnDetectorStream`, `StreamingTurnDetectionTransport`,
`BaseStreamingTurnDetectorCallbacks`, `BaseStreamingTurnDetectorOptions`
(resolving the public-opts `TurnDetectorOptions` collision). Adds
`TurnDetectorVersion`; keeps `TurnDetectorModel` with updated values.
Also folds in in-flight AGT-2520 EOU work: VAD slow-inference guard fix,
`turnDetection: null` opt-out preserved distinctly from `undefined`,
silero `VAD.load()` delegating to `inference.VAD({ model: 'silero' })` for
16 kHz, and a `LocalTransport` cleanup refactor.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a copy of the turn detection model license, call it out in the root README alongside the Apache-2.0 license, and annotate it in REUSE.toml to keep the REUSE-3.2 lint green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
main dropped the flat `export *` re-exports (AgentSession, tool, logMetrics) in favor of namespace-only exports, and does not have the 1.5.0 Toolset API (Agent.create / array-style tools). Adapt basic_agent.ts to main's namespace conventions (new voice.Agent, object tools, voice.*/metrics.* prefixes) while preserving the multimodal-EOU session config. Regenerate pnpm-lock.yaml against the rebased package.json set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d3ee830 to
7e24939
Compare
| prewarm: async (proc: JobProcess) => { | ||
| proc.userData.vad = await silero.VAD.load(); | ||
| }, |
There was a problem hiding this comment.
Why do we removed these?
| // to use realtime model, replace the stt, llm, tts and vad with the following | ||
| // llm: new openai.realtime.RealtimeModel(), | ||
| userData: userdata, | ||
| turnDetection: new livekit.turnDetector.EnglishModel(), |
| } | ||
|
|
||
| /** | ||
| * Speaking-guard wrapper for the bounce-EOU task, mirroring Python's |
There was a problem hiding this comment.
Should we remove the comments phrasing that references "mirroring pythons", etc
| // A different stream means a fresh request lifecycle: drop any held | ||
| // prediction future and re-arm so the adopting recognition starts its own | ||
| // request on the next VAD event. |
There was a problem hiding this comment.
claude tends to add a bunch of inline comments, would be nice to clean them up, only left those that are necessary
…dal-EOU # Conflicts: # agents/src/inference/utils.ts # agents/src/voice/agent_activity.ts # agents/src/voice/audio_recognition.ts
| "LIVEKIT_INFERENCE_URL", | ||
| "LIVEKIT_OUTBOUND_TRUNK_ID", | ||
| "LIVEKIT_URL", | ||
| "LIVEKIT_WORKER_TOKEN", |
There was a problem hiding this comment.
🟡 Duplicate LIVEKIT_WORKER_TOKEN entry in turbo.json global env passthrough
The PR adds LIVEKIT_WORKER_TOKEN at line 45 (after LIVEKIT_URL), but the original file already has it at line 49 (after LIVEKIT_AGENT_NAME). This produces a duplicate entry in the globalPassThroughEnv array. While Turbo likely deduplicates or ignores redundant entries at runtime, the duplicate is unnecessary noise and potentially confusing.
| "LIVEKIT_WORKER_TOKEN", |
Was this helpful? React with 👍 or 👎 to provide feedback.
add audio eot model and local inference support, deprecating silero and turn detector plugins## Description
Changes Made
Adds streaming audio end-of-turn detection. Single user-facing
AudioTurnDetectorthat selects between two backends:turn-detectorturn-detector-miniOn cloud transport error or
predict_end_of_turntimeout, the session swaps to mini/local for the rest of the stream (sticky per session, one warning per failure mode).Local failures emit the default
1.0prediction and retry on the next turn.A user-set
unlikely_thresholdis scaled multiplicatively against the cloud default so the operating point survives a fallback.Pre-Review Checklist
Testing
restaurant_agent.tsandrealtime_agent.tswork properly (for major changes)Additional Notes
Python PR: livekit/agents#4722
Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.