Skip to content

feat(eot): add audio models AGT-2919#1719

Open
chenghao-mou wants to merge 28 commits into
mainfrom
feat/AGT-2520-multimodal-EOU
Open

feat(eot): add audio models AGT-2919#1719
chenghao-mou wants to merge 28 commits into
mainfrom
feat/AGT-2520-multimodal-EOU

Conversation

@chenghao-mou

@chenghao-mou chenghao-mou commented Jun 5, 2026

Copy link
Copy Markdown
Member

add audio eot model and local inference support, deprecating silero and turn detector plugins## Description

Changes Made

Adds streaming audio end-of-turn detection. Single user-facing AudioTurnDetector that selects between two backends:

  • turn-detector
  • turn-detector-mini

On cloud transport error or predict_end_of_turn timeout, the session swaps to mini/local for the rest of the stream (sticky per session, one warning per failure mode).
Local failures emit the default 1.0 prediction and retry on the next turn.

A user-set unlikely_threshold is scaled multiplicatively against the cloud default so the operating point survives a fallback.

Pre-Review Checklist

  • Build passes: All builds (lint, typecheck, tests) pass locally
  • AI-generated code reviewed: Removed unnecessary comments and ensured code quality
  • Changes explained: All changes are properly documented and justified above
  • Scope appropriate: All changes relate to the PR title, or explanations provided for why they're included
  • Video demo: A small video demo showing changes works as expected and did not break any existing functionality using Agent Playground (if applicable)

Testing

  • Automated tests added/updated (if applicable)
  • All tests pass
  • Make sure both restaurant_agent.ts and realtime_agent.ts work properly (for major changes)

Additional Notes

Python PR: livekit/agents#4722


Note to reviewers: Please ensure the pre-review checklist is completed before starting your review.

@changeset-bot

changeset-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 03a6b3e

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@chenghao-mou chenghao-mou changed the title feat(eot): add audio eot model support feat(eot): add audio models AGT-2919 Jun 7, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@chenghao-mou chenghao-mou requested a review from a team June 10, 2026 08:56
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@chenghao-mou chenghao-mou force-pushed the feat/AGT-2520-multimodal-EOU branch from ed2e02f to 3e2fb33 Compare June 11, 2026 14:06
@chenghao-mou chenghao-mou changed the base branch from main to 1.5.0 June 11, 2026 14:06
devin-ai-integration[bot]

This comment was marked as resolved.

@toubatbrian toubatbrian changed the base branch from 1.5.0 to main June 11, 2026 17:51
@toubatbrian toubatbrian changed the base branch from main to 1.5.0 June 11, 2026 17:54
chenghao-mou and others added 13 commits June 12, 2026 13:23
add audio eot model and local inference support, deprecating silero and turn detector plugins
…frame

The AudioFrame emitted on START_OF_SPEECH / END_OF_SPEECH sliced off
the prefix-padding samples but still reported `samplesPerChannel =
speechBufferIndex`, so the frame's metadata claimed more samples than
its data contained and downstream consumers (STT, transcription) lost
the pre-roll context the buffer machinery is designed to preserve.

Slice from 0 instead so data length matches samplesPerChannel and the
prefix-padding pre-roll is delivered, matching the Python original.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chenghao-mou and others added 11 commits June 12, 2026 13:23
… to version

Rename the unreleased `inference.AudioTurnDetector` to `inference.TurnDetector`
and replace its `model` constructor option with `version` (`'v1' | 'v1-mini'`).
The `version` is the constructor knob only; the `model` field/getter is kept and
now holds the full model name (`turn-detector-v1` / `turn-detector-v1-mini`),
which telemetry/billing read via `detector.model` (metric `modelName` →
`EOTModelUsage.model` → remote sessions) unchanged.

Mirrors the upstream Python rename. The private base peers are renamed to the
modality-agnostic streaming scheme: `BaseStreamingTurnDetector`,
`BaseStreamingTurnDetectorStream`, `StreamingTurnDetectionTransport`,
`BaseStreamingTurnDetectorCallbacks`, `BaseStreamingTurnDetectorOptions`
(resolving the public-opts `TurnDetectorOptions` collision). Adds
`TurnDetectorVersion`; keeps `TurnDetectorModel` with updated values.

Also folds in in-flight AGT-2520 EOU work: VAD slow-inference guard fix,
`turnDetection: null` opt-out preserved distinctly from `undefined`,
silero `VAD.load()` delegating to `inference.VAD({ model: 'silero' })` for
16 kHz, and a `LocalTransport` cleanup refactor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a copy of the turn detection model license, call it out in the root
README alongside the Apache-2.0 license, and annotate it in REUSE.toml
to keep the REUSE-3.2 lint green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
main dropped the flat `export *` re-exports (AgentSession, tool, logMetrics)
in favor of namespace-only exports, and does not have the 1.5.0 Toolset API
(Agent.create / array-style tools). Adapt basic_agent.ts to main's namespace
conventions (new voice.Agent, object tools, voice.*/metrics.* prefixes) while
preserving the multimodal-EOU session config. Regenerate pnpm-lock.yaml against
the rebased package.json set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chenghao-mou chenghao-mou force-pushed the feat/AGT-2520-multimodal-EOU branch from d3ee830 to 7e24939 Compare June 12, 2026 12:33
@chenghao-mou chenghao-mou changed the base branch from 1.5.0 to main June 12, 2026 12:33
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Comment on lines -75 to -77
prewarm: async (proc: JobProcess) => {
proc.userData.vad = await silero.VAD.load();
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we removed these?

// to use realtime model, replace the stt, llm, tts and vad with the following
// llm: new openai.realtime.RealtimeModel(),
userData: userdata,
turnDetection: new livekit.turnDetector.EnglishModel(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

}

/**
* Speaking-guard wrapper for the bounce-EOU task, mirroring Python's

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the comments phrasing that references "mirroring pythons", etc

Comment on lines +583 to +585
// A different stream means a fresh request lifecycle: drop any held
// prediction future and re-arm so the adopting recognition starts its own
// request on the next VAD event.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude tends to add a bunch of inline comments, would be nice to clean them up, only left those that are necessary

…dal-EOU

# Conflicts:
#	agents/src/inference/utils.ts
#	agents/src/voice/agent_activity.ts
#	agents/src/voice/audio_recognition.ts

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

Open in Devin Review

Comment thread turbo.json
"LIVEKIT_INFERENCE_URL",
"LIVEKIT_OUTBOUND_TRUNK_ID",
"LIVEKIT_URL",
"LIVEKIT_WORKER_TOKEN",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Duplicate LIVEKIT_WORKER_TOKEN entry in turbo.json global env passthrough

The PR adds LIVEKIT_WORKER_TOKEN at line 45 (after LIVEKIT_URL), but the original file already has it at line 49 (after LIVEKIT_AGENT_NAME). This produces a duplicate entry in the globalPassThroughEnv array. While Turbo likely deduplicates or ignores redundant entries at runtime, the duplicate is unnecessary noise and potentially confusing.

Suggested change
"LIVEKIT_WORKER_TOKEN",
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants