diff --git a/README.md b/README.md index 2a09aac241..059dc65bdb 100644 --- a/README.md +++ b/README.md @@ -1,375 +1,275 @@ - +# Context-Aware Interruption Logic for LiveKit Voice Agents - - - - The LiveKit icon, the name of the repository and some sample code in the background. - +This document explains the context-aware interruption logic implementation that distinguishes between passive acknowledgements (backchanneling) and active interruptions in real-time AI voice agents. - -
+## Overview -![PyPI - Version](https://img.shields.io/pypi/v/livekit-agents) -[![PyPI Downloads](https://static.pepy.tech/badge/livekit-agents/month)](https://pepy.tech/projects/livekit-agents) -[![Slack community](https://img.shields.io/endpoint?url=https%3A%2F%2Flivekit.io%2Fbadges%2Fslack)](https://livekit.io/join-slack) -[![Twitter Follow](https://img.shields.io/twitter/follow/livekit)](https://twitter.com/livekit) -[![Ask DeepWiki for understanding the codebase](https://deepwiki.com/badge.svg)](https://deepwiki.com/livekit/agents) -[![License](https://img.shields.io/github/license/livekit/livekit)](https://github.com/livekit/livekit/blob/master/LICENSE) +LiveKit's default Voice Activity Detection (VAD) is highly sensitive and immediately triggers interruptions whenever user audio is detected. This causes incorrect behavior when users provide passive acknowledgements such as "yeah", "ok", "aha", or "hmm" (commonly known as backchanneling). Instead of recognizing these as signs of engagement, the agent would abruptly stop speaking, breaking conversational flow. -
+**The Solution**: A context-aware decision system that correctly distinguishes between: +1. **Passive acknowledgements** (soft inputs) - should not interrupt when agent is speaking +2. **Active interruptions** (hard commands) - should always interrupt -Looking for the JS/TS library? Check out [AgentsJS](https://github.com/livekit/agents-js) +## How It Works -## What is Agents? +The interruption logic operates **above the VAD layer** and validates the final Speech-to-Text (STT) transcript before committing to an interruption. This ensures that: - +- VAD may fire before STT completes (false starts) +- The system waits for transcript validation before pausing +- Passive acknowledgements are ignored when the agent is speaking +- Interrupt commands are always respected immediately -The Agent Framework is designed for building realtime, programmable participants -that run on servers. Use it to create conversational, multi-modal voice -agents that can see, hear, and understand. +### Decision Flow - - -## Features - -- **Flexible integrations**: A comprehensive ecosystem to mix and match the right STT, LLM, TTS, and Realtime API to suit your use case. -- **Integrated job scheduling**: Built-in task scheduling and distribution with [dispatch APIs](https://docs.livekit.io/agents/build/dispatch/) to connect end users to agents. -- **Extensive WebRTC clients**: Build client applications using LiveKit's open-source SDK ecosystem, supporting all major platforms. -- **Telephony integration**: Works seamlessly with LiveKit's [telephony stack](https://docs.livekit.io/sip/), allowing your agent to make calls to or receive calls from phones. -- **Exchange data with clients**: Use [RPCs](https://docs.livekit.io/home/client/data/rpc/) and other [Data APIs](https://docs.livekit.io/home/client/data/) to seamlessly exchange data with clients. -- **Semantic turn detection**: Uses a transformer model to detect when a user is done with their turn, helps to reduce interruptions. -- **MCP support**: Native support for MCP. Integrate tools provided by MCP servers with one loc. -- **Builtin test framework**: Write tests and use judges to ensure your agent is performing as expected. -- **Open-source**: Fully open-source, allowing you to run the entire stack on your own servers, including [LiveKit server](https://github.com/livekit/livekit), one of the most widely used WebRTC media servers. - -## Installation - -To install the core Agents library, along with plugins for popular model providers: - -```bash -pip install "livekit-agents[openai,silero,deepgram,cartesia,turn-detector]~=1.0" +``` +User Audio Detected (VAD fires) + ↓ +Wait for STT Transcript (if agent is speaking) + ↓ +Classify Interruption: + ├─ Contains interrupt command? → INTERRUPT + ├─ Agent SPEAKING + passive acknowledgement? → IGNORE + ├─ Agent SILENT + passive acknowledgement? → RESPOND + └─ Agent SILENT + normal input? → RESPOND ``` -## Docs and guides - -Documentation on the framework and how to use it can be found [here](https://docs.livekit.io/agents/) - -## Core concepts - -- Agent: An LLM-based application with defined instructions. -- AgentSession: A container for agents that manages interactions with end users. -- entrypoint: The starting point for an interactive session, similar to a request handler in a web server. -- Worker: The main process that coordinates job scheduling and launches agents for user sessions. - -## Usage - -### Simple voice agent - ---- +## Decision Rules + +The system outputs exactly **ONE** of three decisions: + +### 1. `INTERRUPT` - Stop the agent immediately +- **Trigger**: Transcript contains ANY interrupt command, even if mixed with passive acknowledgements +- **Examples**: + - "stop" + - "wait" + - "yeah wait a second" → INTERRUPT (contains "wait") + - "ok no stop" → INTERRUPT (contains "no" and "stop") + +### 2. `IGNORE` - Continue speaking without interruption +- **Trigger**: Agent is SPEAKING AND transcript contains ONLY passive acknowledgement words +- **Examples**: + - Agent speaking: "yeah" → IGNORE + - Agent speaking: "ok" → IGNORE + - Agent speaking: "yeah ok" → IGNORE + - Agent speaking: "yeah but" → INTERRUPT (contains non-passive word) + +### 3. `RESPOND` - Process as new user turn +- **Trigger**: Agent is SILENT AND transcript contains any input +- **Examples**: + - Agent silent: "yeah" → RESPOND + - Agent silent: "hello" → RESPOND + - Agent silent: "tell me a story" → RESPOND + +## Passive Acknowledgements vs Interrupt Commands + +### Passive Acknowledgements (Soft Inputs) +These words indicate engagement but should not interrupt when the agent is speaking: +- `yeah` +- `ok` +- `okay` +- `hmm` +- `aha` +- `uh-huh` +- `uh huh` +- `right` + +### Interrupt Commands (Hard Inputs) +These words always trigger an interruption, regardless of context: +- `stop` +- `wait` +- `no` +- `cancel` +- `hold on` +- `hold` + +## Implementation Details + +### Core Function + +The logic is implemented in `livekit-agents/livekit/agents/voice/interruption_logic.py`: ```python -from livekit.agents import ( - Agent, - AgentSession, - JobContext, - RunContext, - WorkerOptions, - cli, - function_tool, -) -from livekit.plugins import deepgram, elevenlabs, openai, silero - -@function_tool -async def lookup_weather( - context: RunContext, - location: str, -): - """Used to look up weather information.""" - - return {"weather": "sunny", "temperature": 70} - - -async def entrypoint(ctx: JobContext): - await ctx.connect() - - agent = Agent( - instructions="You are a friendly voice assistant built by LiveKit.", - tools=[lookup_weather], - ) - session = AgentSession( - vad=silero.VAD.load(), - # any combination of STT, LLM, TTS, or realtime API can be used - stt=deepgram.STT(model="nova-3"), - llm=openai.LLM(model="gpt-4o-mini"), - tts=elevenlabs.TTS(), - ) - - await session.start(agent=agent, room=ctx.room) - await session.generate_reply(instructions="greet the user and ask about their day") - - -if __name__ == "__main__": - cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint)) +def classify_interruption( + agent_state: Literal["SPEAKING", "SILENT"], + user_transcript: str, +) -> InterruptionDecision: + """Classify user input as INTERRUPT, IGNORE, or RESPOND.""" + # Rule 1: Check for interrupt commands (highest priority) + # Rule 2: If speaking + passive acknowledgement → IGNORE + # Rule 3: If silent + passive acknowledgement → RESPOND + # Rule 4: If silent + normal input → RESPOND ``` -You'll need the following environment variables for this example: - -- DEEPGRAM_API_KEY -- OPENAI_API_KEY -- ELEVEN_API_KEY +### Integration Points + +The logic is integrated into the agent's audio activity handling at three key points: + +1. **`_interrupt_by_audio_activity()`** - Prevents premature pauses when VAD fires before transcript is available +2. **`on_interim_transcript()`** - Checks interim transcripts and resumes if false interruption detected +3. **`on_final_transcript()`** - Validates final transcripts and prevents interruption for passive acknowledgements +4. **`on_end_of_turn()`** - Prevents passive acknowledgements from being processed as new user turns + +### Key Features + +- **Transcript-first approach**: When agent is speaking, waits for transcript before pausing +- **False interruption recovery**: Automatically resumes if pause occurred before transcript validation +- **Turn prevention**: Prevents passive acknowledgements from triggering new LLM responses +- **Punctuation handling**: Normalizes transcripts to handle variations like "yeah!", "ok?", "hmm." + +## How to Run the Agent + +### Prerequisites + +1. **Python 3.12+** installed +2. **Environment variables** set: + ```bash + # LiveKit credentials + LIVEKIT_URL=wss://your-livekit-server.com + LIVEKIT_API_KEY=your-api-key + LIVEKIT_API_SECRET=your-api-secret + + # Groq API (for LLM and STT) + GROQ_API_KEY=your-groq-api-key + + # Cartesia API (for TTS) + CARTESIA_API_KEY=your-cartesia-api-key + ``` + +3. **Install dependencies**: + ```bash + # Install workspace packages in editable mode + uv pip install -e livekit-agents + uv pip install -e livekit-plugins/livekit-plugins-groq + uv pip install -e livekit-plugins/livekit-plugins-silero + uv pip install -e livekit-plugins/livekit-plugins-turn-detector + uv pip install -e livekit-plugins/livekit-plugins-cartesia + ``` + +### Running the Agent + +1. **Navigate to the examples directory**: + ```bash + cd examples/voice_agents + ``` + +2. **Start the agent in development mode**: + ```bash + python basic_agent.py dev + ``` + + This will: + - Start the agent server with hot reloading + - Connect to your LiveKit server + - Enable the interruption logic automatically + +3. **Connect using LiveKit Playground**: + - Visit https://agents-playground.livekit.io/ + - Enter your LiveKit credentials + - Start a conversation with the agent + +### Testing the Interruption Logic + +#### Test Case 1: Ignore "yeah" while agent is speaking +1. Start a conversation with the agent +2. Wait for the agent to start speaking +3. While the agent is speaking, say "yeah" or "ok" +4. **Expected**: Agent continues speaking without pause or interruption + +#### Test Case 2: Respond to "yeah" when agent is silent +1. Start a conversation with the agent +2. Wait for the agent to finish speaking (agent becomes silent) +3. Say "yeah" or "ok" +4. **Expected**: Agent processes this as a new user turn and responds + +#### Test Case 3: Stop for "stop" command +1. Start a conversation with the agent +2. While the agent is speaking, say "stop" or "wait" +3. **Expected**: Agent immediately stops speaking + +#### Test Case 4: Mixed commands +1. While agent is speaking, say "yeah wait a second" +2. **Expected**: Agent interrupts (because "wait" is an interrupt command) + +## Code Structure -### Multi-agent handoff - ---- - -This code snippet is abbreviated. For the full example, see [multi_agent.py](examples/voice_agents/multi_agent.py) - -```python -... -class IntroAgent(Agent): - def __init__(self) -> None: - super().__init__( - instructions=f"You are a story teller. Your goal is to gather a few pieces of information from the user to make the story personalized and engaging." - "Ask the user for their name and where they are from" - ) - - async def on_enter(self): - self.session.generate_reply(instructions="greet the user and gather information") - - @function_tool - async def information_gathered( - self, - context: RunContext, - name: str, - location: str, - ): - """Called when the user has provided the information needed to make the story personalized and engaging. - - Args: - name: The name of the user - location: The location of the user - """ - - context.userdata.name = name - context.userdata.location = location - - story_agent = StoryAgent(name, location) - return story_agent, "Let's start the story!" - - -class StoryAgent(Agent): - def __init__(self, name: str, location: str) -> None: - super().__init__( - instructions=f"You are a storyteller. Use the user's information in order to make the story personalized." - f"The user's name is {name}, from {location}" - # override the default model, switching to Realtime API from standard LLMs - llm=openai.realtime.RealtimeModel(voice="echo"), - chat_ctx=chat_ctx, - ) - - async def on_enter(self): - self.session.generate_reply() - - -async def entrypoint(ctx: JobContext): - await ctx.connect() - - userdata = StoryData() - session = AgentSession[StoryData]( - vad=silero.VAD.load(), - stt=deepgram.STT(model="nova-3"), - llm=openai.LLM(model="gpt-4o-mini"), - tts=openai.TTS(voice="echo"), - userdata=userdata, - ) - - await session.start( - agent=IntroAgent(), - room=ctx.room, - ) -... +``` +livekit-agents/ +└── livekit/ + └── agents/ + └── voice/ + ├── interruption_logic.py # Core classification logic + └── agent_activity.py # Integration with agent lifecycle ``` -### Testing +## Technical Constraints -Automated tests are essential for building reliable agents, especially with the non-deterministic behavior of LLMs. LiveKit Agents include native test integration to help you create dependable agents. +- **Latency**: Decisions are made in real-time with no perceptible delay +- **VAD Independence**: Logic operates above VAD layer, VAD kernel is not modified +- **Transcript Validation**: System validates final STT transcript before committing to interruption +- **False Start Handling**: Supports "false start" scenarios where VAD fires but transcript resolves to passive acknowledgement -```python -@pytest.mark.asyncio -async def test_no_availability() -> None: - llm = google.LLM() - async AgentSession(llm=llm) as sess: - await sess.start(MyAgent()) - result = await sess.run( - user_input="Hello, I need to place an order." - ) - result.expect.skip_next_event_if(type="message", role="assistant") - result.expect.next_event().is_function_call(name="start_order") - result.expect.next_event().is_function_call_output() - await ( - result.expect.next_event() - .is_message(role="assistant") - .judge(llm, intent="assistant should be asking the user what they would like") - ) +## Example Scenarios +### Scenario 1: Natural Backchanneling ``` - -## Examples - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

🎙️ Starter Agent

-

A starter agent optimized for voice conversations.

-

-Code -

-
-

🔄 Multi-user push to talk

-

Responds to multiple users in the room via push-to-talk.

-

-Code -

-
-

🎵 Background audio

-

Background ambient and thinking audio to improve realism.

-

-Code -

-
-

🛠️ Dynamic tool creation

-

Creating function tools dynamically.

-

-Code -

-
-

☎️ Outbound caller

-

Agent that makes outbound phone calls

-

-Code -

-
-

📋 Structured output

-

Using structured output from LLM to guide TTS tone.

-

-Code -

-
-

🔌 MCP support

-

Use tools from MCP servers

-

-Code -

-
-

💬 Text-only agent

-

Skip voice altogether and use the same code for text-only integrations

-

-Code -

-
-

📝 Multi-user transcriber

-

Produce transcriptions from all users in the room

-

-Code -

-
-

🎥 Video avatars

-

Add an AI avatar with Tavus, Beyond Presence, and Bithuman

-

-Code -

-
-

🍽️ Restaurant ordering and reservations

-

Full example of an agent that handles calls for a restaurant.

-

-Code -

-
-

👁️ Gemini Live vision

-

Full example (including iOS app) of Gemini Live agent that can see.

-

-Code -

-
- -## Running your agent - -### Testing in terminal - -```shell -python myagent.py console +Agent: "So the weather in London is typically..." +User: "yeah" +Agent: "...rainy during the winter months..." [continues seamlessly] ``` +**Result**: `IGNORE` - Agent continues speaking -Runs your agent in terminal mode, enabling local audio input and output for testing. -This mode doesn't require external servers or dependencies and is useful for quickly validating behavior. +### Scenario 2: User Wants to Interrupt +``` +Agent: "Let me explain how this works..." +User: "stop, I have a question" +Agent: [stops immediately] +``` +**Result**: `INTERRUPT` - Contains "stop" command -### Developing with LiveKit clients +### Scenario 3: User Responds When Agent is Silent +``` +Agent: [finished speaking, now silent] +User: "yeah, that makes sense" +Agent: [processes as new turn, generates response] +``` +**Result**: `RESPOND` - Agent is silent, so "yeah" triggers a response -```shell -python myagent.py dev +### Scenario 4: Ambiguous Input ``` +Agent: "The process involves three steps..." +User: "yeah but what about the cost?" +Agent: [interrupts] +``` +**Result**: `INTERRUPT` - Contains non-passive word "but", so treated as interruption -Starts the agent server and enables hot reloading when files change. This mode allows each process to host multiple concurrent agents efficiently. +## Debugging -The agent connects to LiveKit Cloud or your self-hosted server. Set the following environment variables: -- LIVEKIT_URL -- LIVEKIT_API_KEY -- LIVEKIT_API_SECRET +To see the interruption logic in action, check the logs for: +- `"Resumed speech after passive acknowledgement"` - Indicates false interruption was detected and corrected +- `"Ignoring passive acknowledgement as new turn"` - Indicates passive acknowledgement was prevented from triggering a new turn -You can connect using any LiveKit client SDK or telephony integration. -To get started quickly, try the [Agents Playground](https://agents-playground.livekit.io/). +## Limitations -### Running for production +- The logic is deterministic and rule-based. It does not use ML/NLP models for semantic understanding beyond keyword matching. +- Passive acknowledgements list is fixed. Adding new acknowledgements requires code changes. +- The system relies on accurate STT transcription. Mis-transcribed words may not be classified correctly. -```shell -python myagent.py start -``` +## Future Enhancements -Runs the agent with production-ready optimizations. +Potential improvements: +- Expand passive acknowledgements list dynamically +- Use semantic similarity for better classification +- Support multiple languages +- Learn from user behavior patterns ## Contributing -The Agents framework is under active development in a rapidly evolving field. We welcome and appreciate contributions of any kind, be it feedback, bugfixes, features, new plugins and tools, or better documentation. You can file issues under this repo, open a PR, or chat with us in LiveKit's [Slack community](https://livekit.io/join-slack). - - -
- - - - - - - - - -
LiveKit Ecosystem
LiveKit SDKsBrowser · iOS/macOS/visionOS · Android · Flutter · React Native · Rust · Node.js · Python · Unity · Unity (WebGL) · ESP32
Server APIsNode.js · Golang · Ruby · Java/Kotlin · Python · Rust · PHP (community) · .NET (community)
UI ComponentsReact · Android Compose · SwiftUI · Flutter
Agents FrameworksPython · Node.js · Playground
ServicesLiveKit server · Egress · Ingress · SIP
ResourcesDocs · Example apps · Cloud · Self-hosting · CLI
- +When modifying the interruption logic: +1. Update tests in `tests/test_interruption_logic.py` +2. Test all three decision types (INTERRUPT, IGNORE, RESPOND) +3. Verify edge cases (punctuation, mixed case, empty transcripts) +4. Ensure no regression in existing behavior + +--- + +**Note**: This interruption logic is automatically enabled when using the `AgentSession` with VAD and STT configured. No additional configuration is required. diff --git a/examples/voice_agents/basic_agent.py b/examples/voice_agents/basic_agent.py index f064dab5d7..4f9069b591 100644 --- a/examples/voice_agents/basic_agent.py +++ b/examples/voice_agents/basic_agent.py @@ -1,4 +1,28 @@ import logging +import os +import sys + +# CRITICAL: Ensure workspace livekit-agents is prioritized BEFORE any imports +# This must happen before dotenv or any other imports that might trigger livekit imports +_workspace_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "..")) +_livekit_agents_path = os.path.join(_workspace_root, "livekit-agents") +# Remove from path if already there, then insert at position 0 +if _livekit_agents_path in sys.path: + sys.path.remove(_livekit_agents_path) +sys.path.insert(0, _livekit_agents_path) + +# Add all plugin paths BEFORE any imports +_plugin_paths = [ + os.path.join(_workspace_root, "livekit-plugins", "livekit-plugins-groq"), + os.path.join(_workspace_root, "livekit-plugins", "livekit-plugins-openai"), + os.path.join(_workspace_root, "livekit-plugins", "livekit-plugins-silero"), + os.path.join(_workspace_root, "livekit-plugins", "livekit-plugins-turn-detector"), + os.path.join(_workspace_root, "livekit-plugins", "livekit-plugins-cartesia"), +] +for _plugin_path in _plugin_paths: + if _plugin_path in sys.path: + sys.path.remove(_plugin_path) + sys.path.insert(0, _plugin_path) from dotenv import load_dotenv @@ -15,9 +39,31 @@ room_io, ) from livekit.agents.llm import function_tool -from livekit.plugins import silero +from livekit.plugins import cartesia, silero from livekit.plugins.turn_detector.multilingual import MultilingualModel +# Import Groq plugin +try: + from livekit.plugins import groq +except ImportError: + # Try direct import + try: + from livekit.plugins.groq import LLM, STT, TTS + + class GroqNamespace: + LLM = LLM + STT = STT + TTS = TTS + + groq = GroqNamespace() + except ImportError as e: + raise ImportError( + f"Groq plugin not found. Please install it:\n" + f" uv pip install -e livekit-plugins/livekit-plugins-groq\n" + f"Or: pip install livekit-plugins-groq\n" + f"Original error: {e}" + ) from e + # uncomment to enable Krisp background voice/noise cancellation # from livekit.plugins import noise_cancellation @@ -81,14 +127,15 @@ async def entrypoint(ctx: JobContext): } session = AgentSession( # Speech-to-text (STT) is your agent's ears, turning the user's speech into text that the LLM can understand - # See all available models at https://docs.livekit.io/agents/models/stt/ - stt="deepgram/nova-3", + # Using Groq STT (whisper-large-v3-turbo) - fast and accurate + stt=groq.STT(model="whisper-large-v3-turbo"), # A Large Language Model (LLM) is your agent's brain, processing user input and generating a response - # See all available models at https://docs.livekit.io/agents/models/llm/ - llm="openai/gpt-4.1-mini", + # Using Groq LLM (llama-3.3-70b-versatile) - fast inference with high quality + llm=groq.LLM(model="llama-3.3-70b-versatile"), # Text-to-speech (TTS) is your agent's voice, turning the LLM's text into speech that the user can hear - # See all available models as well as voice selections at https://docs.livekit.io/agents/models/tts/ - tts="cartesia/sonic-2:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc", + # Using Cartesia TTS (Groq TTS playai-tts has been decommissioned) + # Cartesia provides high-quality, fast TTS with natural voices + tts=cartesia.TTS(), # VAD and turn detection are used to determine when the user is speaking and when the agent should respond # See more at https://docs.livekit.io/agents/build/turns turn_detection=MultilingualModel(), diff --git a/livekit-agents/livekit/agents/voice/agent_activity.py b/livekit-agents/livekit/agents/voice/agent_activity.py index 0c3f7c743d..367af05401 100644 --- a/livekit-agents/livekit/agents/voice/agent_activity.py +++ b/livekit-agents/livekit/agents/voice/agent_activity.py @@ -74,6 +74,7 @@ remove_instructions, update_instructions, ) +from .interruption_logic import classify_interruption from .speech_handle import SpeechHandle if TYPE_CHECKING: @@ -1174,6 +1175,44 @@ def _interrupt_by_audio_activity(self) -> None: # ignore if realtime model has turn detection enabled return + # Get current transcript for interruption decision logic + transcript = "" + if self._audio_recognition is not None: + transcript = self._audio_recognition.current_transcript + + agent_state = self._session.agent_state + + # If agent is speaking and we don't have a transcript yet, wait for it + # This prevents false pauses from VAD firing before STT provides transcript + if agent_state == "speaking" and not transcript: + # Don't interrupt yet - wait for transcript to make an informed decision + # The transcript handlers (on_interim_transcript, on_final_transcript) will + # call this method again with the transcript available + return + + # Apply interruption decision logic if we have a transcript + if transcript: + # Map agent state to interruption logic format + if agent_state == "speaking": + interruption_state = "SPEAKING" + else: + interruption_state = "SILENT" + + # Classify the interruption + decision = classify_interruption(interruption_state, transcript) + + # If decision is IGNORE, don't interrupt (agent continues speaking) + if decision == "IGNORE": + return + + # If decision is RESPOND but agent is speaking, treat as interrupt + # (RESPOND is only valid when agent is SILENT) + if decision == "RESPOND" and agent_state == "speaking": + # This shouldn't happen, but handle gracefully + decision = "INTERRUPT" + else: + # No transcript available yet - apply legacy word count check + # (This case should rarely happen now since we return early above when speaking) if ( self.stt is not None and opt.min_interruption_words > 0 @@ -1261,6 +1300,36 @@ def on_interim_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None) - "manual", "realtime_llm", ): + # Check if this is a passive acknowledgement that should be ignored + transcript = ev.alternatives[0].text + agent_state = self._session.agent_state + if agent_state == "speaking": + interruption_state = "SPEAKING" + else: + interruption_state = "SILENT" + + decision = classify_interruption(interruption_state, transcript) + + # If it's a passive acknowledgement and agent is speaking, don't interrupt + # and resume if already paused + if decision == "IGNORE" and agent_state == "speaking": + # Resume if we paused earlier (false interruption) + if ( + self._paused_speech + and self._session.options.resume_false_interruption + and (audio_output := self._session.output.audio) + and audio_output.can_pause + ): + self._session._update_agent_state("speaking") + audio_output.resume() + self._paused_speech = None + if self._false_interruption_timer: + self._false_interruption_timer.cancel() + self._false_interruption_timer = None + logger.debug("Resumed speech after passive acknowledgement", extra={"transcript": transcript}) + return # Don't interrupt for passive acknowledgements + + # Otherwise, proceed with normal interruption logic self._interrupt_by_audio_activity() if ( @@ -1292,6 +1361,36 @@ def on_final_transcript(self, ev: stt.SpeechEvent, *, speaking: bool | None = No "manual", "realtime_llm", ): + # Check if this is a passive acknowledgement that should be ignored + transcript = ev.alternatives[0].text + agent_state = self._session.agent_state + if agent_state == "speaking": + interruption_state = "SPEAKING" + else: + interruption_state = "SILENT" + + decision = classify_interruption(interruption_state, transcript) + + # If it's a passive acknowledgement and agent is speaking, don't interrupt + # and resume if already paused + if decision == "IGNORE" and agent_state == "speaking": + # Resume if we paused earlier (false interruption) + if ( + self._paused_speech + and self._session.options.resume_false_interruption + and (audio_output := self._session.output.audio) + and audio_output.can_pause + ): + self._session._update_agent_state("speaking") + audio_output.resume() + self._paused_speech = None + if self._false_interruption_timer: + self._false_interruption_timer.cancel() + self._false_interruption_timer = None + logger.debug("Resumed speech after passive acknowledgement", extra={"transcript": transcript}) + return # Don't interrupt for passive acknowledgements + + # Otherwise, proceed with normal interruption logic self._interrupt_by_audio_activity() if ( @@ -1365,6 +1464,38 @@ def on_end_of_turn(self, info: _EndOfTurnInfo) -> bool: # TODO(theomonnom): should we "forward" this new turn to the next agent/activity? return True + # Check if this is a passive acknowledgement that should be ignored + # This prevents the agent from processing "yeah", "ok", etc. as new turns + agent_state = self._session.agent_state + if agent_state == "speaking": + interruption_state = "SPEAKING" + else: + interruption_state = "SILENT" + + decision = classify_interruption(interruption_state, info.new_transcript) + + # If it's a passive acknowledgement and agent is speaking, ignore this turn completely + if decision == "IGNORE" and agent_state == "speaking": + self._cancel_preemptive_generation() + logger.debug( + "Ignoring passive acknowledgement as new turn", + extra={"transcript": info.new_transcript}, + ) + # Resume speech if it was paused + if ( + self._paused_speech + and self._session.options.resume_false_interruption + and (audio_output := self._session.output.audio) + and audio_output.can_pause + ): + self._session._update_agent_state("speaking") + audio_output.resume() + self._paused_speech = None + if self._false_interruption_timer: + self._false_interruption_timer.cancel() + self._false_interruption_timer = None + return False # Don't process this as a new turn + if ( self.stt is not None and self._turn_detection != "manual" diff --git a/livekit-agents/livekit/agents/voice/interruption_logic.py b/livekit-agents/livekit/agents/voice/interruption_logic.py new file mode 100644 index 0000000000..284e603ca2 --- /dev/null +++ b/livekit-agents/livekit/agents/voice/interruption_logic.py @@ -0,0 +1,117 @@ +"""Interruption decision logic for LiveKit voice agents. + +This module implements context-aware interruption classification that distinguishes +between passive acknowledgements (backchanneling) and active interruptions. +""" + +from __future__ import annotations + +from typing import Literal + +# Decision types +InterruptionDecision = Literal["INTERRUPT", "IGNORE", "RESPOND"] + +# Passive acknowledgements (soft inputs) +PASSIVE_ACKNOWLEDGEMENTS = { + "yeah", + "ok", + "okay", + "hmm", + "aha", + "uh-huh", + "uh huh", + "right", +} + +# Interrupt commands (hard inputs) +INTERRUPT_COMMANDS = { + "stop", + "wait", + "no", + "cancel", + "hold on", + "hold", +} + + +def classify_interruption( + agent_state: Literal["SPEAKING", "SILENT"], + user_transcript: str, +) -> InterruptionDecision: + """Classify user input as INTERRUPT, IGNORE, or RESPOND. + + This function implements the core interruption decision logic based on: + - Agent state (SPEAKING or SILENT) + - Semantic content of user transcript + + Args: + agent_state: Current state of the agent - "SPEAKING" or "SILENT" + user_transcript: The transcribed user speech text + + Returns: + One of: "INTERRUPT", "IGNORE", or "RESPOND" + + Rules: + 1. If transcript contains ANY interrupt command, return INTERRUPT + 2. If agent is SPEAKING and transcript contains ONLY passive acknowledgements, return IGNORE + 3. If agent is SILENT and transcript contains passive acknowledgement, return RESPOND + 4. If agent is SILENT and transcript contains normal input, return RESPOND + """ + # Normalize transcript for processing + transcript_lower = user_transcript.lower().strip() + + # Handle empty transcript + if not transcript_lower: + # If no transcript, default behavior depends on agent state + if agent_state == "SPEAKING": + return "INTERRUPT" # Unknown input while speaking = interrupt + else: + return "RESPOND" # Unknown input while silent = respond + + # Rule 1: Semantic Interruption Rule (Highest Priority) + # Check if transcript contains ANY interrupt command + words = transcript_lower.split() + + # Remove punctuation from words for matching + words_clean = [w.strip(".,!?;:") for w in words] + + # Check for single-word interrupt commands + for word in words_clean: + if word in INTERRUPT_COMMANDS: + return "INTERRUPT" + + # Check for multi-word interrupt commands (substring match) + for cmd in INTERRUPT_COMMANDS: + if cmd in transcript_lower: + return "INTERRUPT" + + # Rule 2: Speaking + Passive Acknowledgement + if agent_state == "SPEAKING": + # Check if transcript contains ONLY passive acknowledgement words + transcript_words = set(words_clean) + # Remove empty strings + transcript_words.discard("") + + # If no words after cleaning, check exact match + if not transcript_words: + transcript_normalized = transcript_lower.strip(".,!?;:") + if transcript_normalized in PASSIVE_ACKNOWLEDGEMENTS: + return "IGNORE" + + # Check if all words are passive acknowledgements + if transcript_words and transcript_words.issubset(PASSIVE_ACKNOWLEDGEMENTS): + return "IGNORE" + + # Also check for exact matches (handles punctuation variations) + transcript_normalized = transcript_lower.strip(".,!?;:") + if transcript_normalized in PASSIVE_ACKNOWLEDGEMENTS: + return "IGNORE" + + # Rule 3 & 4: Silent + Any Input + if agent_state == "SILENT": + return "RESPOND" + + # Default: if agent is speaking and it's not a passive acknowledgement, interrupt + # This handles cases where the transcript contains both passive and non-passive words + return "INTERRUPT" +