From 78718729490b5e3f12dabe5c276833152c822a43 Mon Sep 17 00:00:00 2001 From: Sirjan Singh Date: Mon, 2 Feb 2026 23:33:17 +0530 Subject: [PATCH 1/5] Revise README for intelligent interruption handling Updated README to reflect intelligent interruption handling implementation for LiveKit Voice Agent, detailing challenges, solutions, and key code changes. --- examples/voice_agents/README.md | 332 ++++++++++++++++++++++++++------ 1 file changed, 276 insertions(+), 56 deletions(-) diff --git a/examples/voice_agents/README.md b/examples/voice_agents/README.md index aa401505d1..50bbe91911 100644 --- a/examples/voice_agents/README.md +++ b/examples/voice_agents/README.md @@ -1,78 +1,298 @@ -# Voice Agents Examples +# Intelligent Interruption Handling for LiveKit Voice Agent -This directory contains a comprehensive collection of voice-based agent examples demonstrating various capabilities and integrations with the LiveKit Agents framework. +## Overview -## 📋 Table of Contents +This document explains the modifications made to `basic_agent.py` to implement intelligent interruption handling that distinguishes between **filler words** (acknowledgments like "yeah", "okay") and **command words** (interruptions like "stop", "wait"). -### 🚀 Getting Started +--- -- [`basic_agent.py`](./basic_agent.py) - A fundamental voice agent with metrics collection +## The Challenge -### 🛠️ Tool Integration & Function Calling +In a natural voice conversation, users often say acknowledgment words like "yeah", "okay", or "hmm" while the agent is speaking. These are **backchannel responses** that mean "I'm listening, continue" — not "stop talking." -- [`annotated_tool_args.py`](./annotated_tool_args.py) - Using Python type annotations for tool arguments -- [`dynamic_tool_creation.py`](./dynamic_tool_creation.py) - Creating and registering tools dynamically at runtime -- [`raw_function_description.py`](./raw_function_description.py) - Using raw JSON schema definitions for tool descriptions -- [`silent_function_call.py`](./silent_function_call.py) - Executing function calls without verbal responses to user -- [`long_running_function.py`](./long_running_function.py) - Handling long running function calls with interruption support +However, LiveKit's default Voice Activity Detection (VAD) treats ALL user speech as potential interruptions, causing the agent to stop mid-sentence when hearing these fillers. -### ⚡ Real-time Models +**Requirements:** +1. **When agent is speaking + user says filler** → Agent continues uninterrupted +2. **When agent is speaking + user says command** → Agent stops immediately +3. **When agent is silent** → All user speech is valid input +4. **Mixed input** → Commands always take priority over fillers -- [`weather_agent.py`](./weather_agent.py) - OpenAI Realtime API with function calls for weather information -- [`realtime_video_agent.py`](./realtime_video_agent.py) - Google Gemini with multimodal video and voice capabilities -- [`realtime_joke_teller.py`](./realtime_joke_teller.py) - Amazon Nova Sonic real-time model with function calls -- [`realtime_load_chat_history.py`](./realtime_load_chat_history.py) - Loading previous chat history into real-time models -- [`realtime_turn_detector.py`](./realtime_turn_detector.py) - Using LiveKit's turn detection with real-time models -- [`realtime_with_tts.py`](./realtime_with_tts.py) - Combining external TTS providers with real-time models +--- -### 🎯 Pipeline Nodes & Hooks +## The Core Problem: Timing -- [`fast-preresponse.py`](./fast-preresponse.py) - Generating quick responses using the `on_user_turn_completed` node -- [`flush_llm_node.py`](./flush_llm_node.py) - Flushing partial LLM output to TTS in `llm_node` -- [`structured_output.py`](./structured_output.py) - Structured data and JSON outputs from agent responses -- [`speedup_output_audio.py`](./speedup_output_audio.py) - Dynamically adjusting agent audio playback speed -- [`timed_agent_transcript.py`](./timed_agent_transcript.py) - Reading timestamped transcripts from `transcription_node` -- [`inactive_user.py`](./inactive_user.py) - Handling inactive users with the `user_state_changed` event hook -- [`resume_interrupted_agent.py`](./resume_interrupted_agent.py) - Resuming agent speech after false interruption detection -- [`toggle_io.py`](./toggle_io.py) - Dynamically toggling audio input/output during conversations +The fundamental challenge is **VAD interrupts BEFORE transcripts arrive**: -### 🤖 Multi-agent & AgentTask Use Cases +``` +Time 0.0s: User starts saying "yeah" +Time 0.3s: VAD detects speech → Interrupts agent +Time 0.5s: User finishes saying "yeah" +Time 0.8s: Transcript arrives → "Yeah." +``` -- [`restaurant_agent.py`](./restaurant_agent.py) - Multi-agent system for restaurant ordering and reservation management -- [`multi_agent.py`](./multi_agent.py) - Collaborative storytelling with multiple specialized agents -- [`email_example.py`](./email_example.py) - Using AgentTask to collect and validate email addresses +By the time we know it was a filler word, the agent has already stopped! -### 🔗 MCP & External Integrations +--- -- [`web_search.py`](./web_search.py) - Integrating web search capabilities into voice agents -- [`langgraph_agent.py`](./langgraph_agent.py) - LangGraph integration -- [`mcp/`](./mcp/) - Model Context Protocol (MCP) integration examples - - [`mcp-agent.py`](./mcp/mcp-agent.py) - MCP agent integration - - [`server.py`](./mcp/server.py) - MCP server example -- [`zapier_mcp_integration.py`](./zapier_mcp_integration.py) - Automating workflows with Zapier through MCP +## The Solution: Hybrid Approach -### 💾 RAG & Knowledge Management +We use a **three-layer defense system**: -- [`llamaindex-rag/`](./llamaindex-rag/) - Complete RAG implementation with LlamaIndex - - [`chat_engine.py`](./llamaindex-rag/chat_engine.py) - Chat engine integration - - [`query_engine.py`](./llamaindex-rag/query_engine.py) - Query engine used in a function tool - - [`retrieval.py`](./llamaindex-rag/retrieval.py) - Document retrieval +### Layer 1: Medium VAD Thresholds +```python +min_interruption_duration=0.6, # Requires 0.6 seconds of speech +min_interruption_words=2, # Requires at least 2 words +``` -### 🎵 Specialized Use Cases +**Purpose:** Filters out very quick, single-word fillers ("yeah!", "okay!") -- [`background_audio.py`](./background_audio.py) - Playing background audio or ambient sounds during conversations -- [`push_to_talk.py`](./push_to_talk.py) - Push-to-talk interaction -- [`tts_text_pacing.py`](./tts_text_pacing.py) - Pacing control for TTS requests -- [`speaker_id_multi_speaker.py`](./speaker_id_multi_speaker.py) - Multi-speaker identification +**Tradeoff:** Longer fillers (1.5s "okaaaay") can still slip through -### 📊 Tracing & Error Handling +--- -- [`langfuse_trace.py`](./langfuse_trace.py) - LangFuse integration for conversation tracing -- [`error_callback.py`](./error_callback.py) - Error handling callback -- [`session_close_callback.py`](./session_close_callback.py) - Session lifecycle management +### Layer 2: Automatic Resume on False Interruptions +```python +resume_false_interruption=True, +false_interruption_timeout=1.0, +``` -## 📖 Additional Resources +**Purpose:** If VAD interrupts the agent, LiveKit waits 1 second for more user speech. If nothing substantial comes, it automatically resumes the agent's speech. -- [LiveKit Agents Documentation](https://docs.livekit.io/agents/) -- [Agents Starter Example](https://github.com/livekit-examples/agent-starter-python) -- [More Agents Examples](https://github.com/livekit-examples/python-agents-examples) +**How it helps:** When a slow filler ("okaaaay") interrupts the agent, this mechanism resumes automatically within 1 second. + +--- + +### Layer 3: Transcript-Based Manual Control +The most important layer — our custom logic that analyzes transcripts: + +```python +@session.on("user_input_transcribed") +def on_user_input_transcribed(ev): + # Analyze what the user actually said + if contains_command(text): + session.interrupt() # Force stop + elif is_filler_input(text): + return # Ignore completely + else: + # Real input - allow processing +``` + +This handles three cases: + +#### Case 1: Agent Was Just Interrupted by VAD +```python +if kelly.was_interrupted_by_vad: + if contains_command(text): + # Real command - stay stopped + elif is_filler_input(text): + # False alarm - resume_false_interruption handles it + else: + # Real input - process normally +``` + +#### Case 2: Agent Is Currently Speaking (VAD Hasn't Triggered Yet) +```python +if kelly.is_speaking: + if contains_command(text): + session.interrupt() # Force interrupt NOW + elif is_filler_input(text): + return # Completely ignore + else: + session.interrupt() # Real input - allow interrupt +``` + +#### Case 3: Agent Is Idle +```python +if not kelly.is_speaking: + if is_filler_input(text): + return # Suppress from LLM + # Otherwise process normally +``` + +--- + +## Key Code Changes + +### 1. Word Lists Configuration + +**Filler Words** (acknowledgments to ignore): +```python +FILLER_WORDS = { + "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup", + "hmm", "right", "uh", "um", "ah", "gotit", "isee", "ok", + # ... more +} + +FILLER_PHRASES = { + "all right", "got it", "i see", "uh huh", "oh okay" +} +``` + +**Command Words** (explicit stop requests): +```python +STOP_WORDS = { + "wait", "stop", "finish", "hold", "pause", "halt" +} +``` + +### 2. Detection Functions + +**`is_filler_input(transcript)`** — Returns `True` if input is purely acknowledgment: +- Removes punctuation +- Checks against filler word/phrase lists +- Validates all words are filler tokens + +**`contains_command(transcript)`** — Returns `True` if input contains stop command: +- Checks if sentence starts with stop word +- Detects "filler + command" patterns ("yeah wait", "okay stop") +- Avoids false positives in longer sentences + +### 3. State Tracking + +```python +class IntelligentAgent(Agent): + def __init__(self): + self.is_speaking = False # Currently generating speech + self.was_interrupted_by_vad = False # Just got interrupted by VAD + self.last_speech_content = "" # Content being spoken +``` + +### 4. Event Handlers + +**`on_speech_created`** — Tracks when agent starts speaking: +```python +@session.on("speech_created") +def on_speech_created(ev): + kelly.is_speaking = True + kelly.was_interrupted_by_vad = False +``` + +**`on_agent_state_changed`** — Detects interruptions: +```python +if ev.old_state == "speaking" and ev.new_state == "listening": + if kelly.is_speaking: + kelly.was_interrupted_by_vad = True +``` + +**`on_user_input_transcribed`** — Main interruption logic (see Layer 3 above) + +--- + +## Configuration Parameters + +### AgentSession Settings + +| Parameter | Value | Purpose | +|-----------|-------|---------| +| `allow_interruptions` | `True` | Enable VAD-based interruptions | +| `min_interruption_duration` | `0.6` | Require 0.6s of speech to interrupt | +| `min_interruption_words` | `2` | Require 2+ words to interrupt | +| `resume_false_interruption` | `True` | Auto-resume after false interruptions | +| `false_interruption_timeout` | `1.0` | Wait 1s before resuming | +| `preemptive_generation` | `False` | Disabled for more predictable flow | +| `min_endpointing_delay` | `0.5` | Min silence before turn ends | +| `max_endpointing_delay` | `2.5` | Max silence before turn ends | + +--- + +## How It All Works Together + +### Scenario 1: User says "yeah" (0.3s, quick acknowledgment) +1. ✅ **VAD Layer:** Too short (0.3s < 0.6s) → No interrupt +2. ✅ **Transcript Handler:** Detects filler while speaking → Ignores +3. ✅ **Result:** Agent continues speaking smoothly + +### Scenario 2: User says "okaaaay" (1.5s, slow filler) +1. ❌ **VAD Layer:** Long enough (1.5s > 0.6s) → Interrupts agent +2. ✅ **Resume Layer:** Waits 1s for more speech, nothing comes → Resumes +3. ✅ **Transcript Handler:** Marks as filler → Suppresses from LLM +4. ✅ **Result:** Brief pause (1s), then agent resumes + +### Scenario 3: User says "stop" (0.5s, quick command) +1. ✅ **VAD Layer:** Too short (0.5s < 0.6s) → No interrupt +2. ✅ **Transcript Handler:** Detects command → `session.interrupt()` +3. ✅ **Result:** Agent stops immediately via manual interrupt + +### Scenario 4: User says "wait a second" (1.2s, clear command) +1. ✅ **VAD Layer:** Long enough (1.2s > 0.6s) → Interrupts agent +2. ✅ **Transcript Handler:** Detects command → Stays stopped +3. ✅ **Result:** Agent stops, processes user's request + +--- + +## Testing the Solution + +### Test Cases + +1. **Filler while speaking:** + - Say "yeah", "okay", "hmm" while agent is talking + - **Expected:** Agent continues without stopping + +2. **Command while speaking:** + - Say "wait", "stop", "hold on" while agent is talking + - **Expected:** Agent stops immediately + +3. **Mixed input:** + - Say "yeah wait" while agent is talking + - **Expected:** Agent stops (command wins) + +4. **Filler while silent:** + - Say "okay" when agent is idle + - **Expected:** Ignored, doesn't trigger new response + +5. **Normal conversation:** + - Ask questions when agent is idle + - **Expected:** Normal response flow + +### Logs to Watch For + +``` +🎤 KELLY STARTED SPEAKING +📝 TRANSCRIPT: 'yeah' | Kelly speaking: True +🔇 FILLER while speaking: 'yeah' - completely ignored +``` + +``` +📝 TRANSCRIPT: 'wait' | Kelly speaking: True +🛑 STOP COMMAND while speaking: 'wait' - forcing interrupt NOW +``` + +``` +⚠️ KELLY INTERRUPTED - waiting for transcript... +📝 TRANSCRIPT: 'okay' | Just interrupted: True +🔄 FALSE INTERRUPT: 'okay' was just a filler - should resume +``` + +--- + +## Files Modified + +- **`basic_agent.py`** — Main implementation with all intelligent interruption logic + +## Dependencies + +No additional dependencies required beyond standard LiveKit Agents SDK. + +--- + +## Limitations + +1. **Brief pause on slow fillers:** If user says a filler slowly (>0.6s), there may be a ~1s pause before auto-resume +2. **Language-specific:** Word lists are currently English-focused (though some Hindi words are included) +3. **Context-unaware:** Doesn't understand semantic context (e.g., "no" as answer vs. "no" as stop command) + +--- + +## Future Improvements + +1. **Sentiment analysis:** Use LLM to determine if "no" is a stop command or an answer +2. **Adaptive thresholds:** Learn user's speech patterns and adjust thresholds +3. **Multi-language support:** Extended word lists for other languages +4. **Prosody analysis:** Use tone/pitch to distinguish acknowledgments from commands + +--- + +## Credits + +Implementation for the **LiveKit Intelligent Interruption Handling Challenge**. From 76fefa40ac2e8da1aabf7bebe5145549e07f976c Mon Sep 17 00:00:00 2001 From: Sirjan Singh Date: Mon, 2 Feb 2026 23:56:21 +0530 Subject: [PATCH 2/5] Enhance hybrid interruption handling in basic agent Refactor basic agent to implement a hybrid interruption strategy for better command recognition and filler handling. --- examples/voice_agents/basic_agent.py | 332 +++++++++++++++++++-------- 1 file changed, 231 insertions(+), 101 deletions(-) diff --git a/examples/voice_agents/basic_agent.py b/examples/voice_agents/basic_agent.py index f064dab5d7..472b198181 100644 --- a/examples/voice_agents/basic_agent.py +++ b/examples/voice_agents/basic_agent.py @@ -1,133 +1,263 @@ -import logging +""" +THE ACTUAL WORKING HYBRID SOLUTION -from dotenv import load_dotenv +Problem statement: +- "okay" said slowly (1.5s) should NOT interrupt +- "stop" said quickly (0.5s) SHOULD interrupt immediately +- Can't use duration-based filtering because it fails one of these cases + +Solution: +- Use MEDIUM thresholds (catches most fillers, but some slip through) +- Have transcript handler RESUME if interrupted by filler +- Have transcript handler FORCE INTERRUPT if command detected but VAD didn't trigger + +This way: +- Fast "stop" (0.5s) → Below threshold → VAD doesn't interrupt → Transcript handler forces interrupt ✅ +- Slow "okay" (1.5s) → Above threshold → VAD interrupts → Transcript handler RESUMES ✅ +- Fast "okay" (0.3s) → Below threshold → VAD doesn't interrupt → Transcript suppresses ✅ +""" +import logging +import re +import asyncio +from typing import Optional +from dotenv import load_dotenv from livekit.agents import ( - Agent, - AgentServer, - AgentSession, - JobContext, - JobProcess, - MetricsCollectedEvent, - RunContext, - cli, - metrics, - room_io, + Agent, AgentServer, AgentSession, JobContext, JobProcess, + cli, UserInputTranscribedEvent, AgentStateChangedEvent, + UserStateChangedEvent ) -from livekit.agents.llm import function_tool -from livekit.plugins import silero +from livekit.plugins import silero, deepgram, openai, cartesia from livekit.plugins.turn_detector.multilingual import MultilingualModel -# uncomment to enable Krisp background voice/noise cancellation -# from livekit.plugins import noise_cancellation +logger = logging.getLogger("intelligent-kelly") +logger.setLevel(logging.INFO) +load_dotenv() -logger = logging.getLogger("basic-agent") +# CONFIGURATION +STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt"} +FILLER_WORDS = { + "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup", + "hmm", "right", "uh", "um", "ah", "gotit", "isee", "ok", "k", + "sure", "yes", "interesting", "really", "wow", "ohh", "ooh", + "aha", "mhmm", "gotcha", "nice", "oh", "all", "got", "it", "i", "see" +} +FILLER_PHRASES = {"all right", "got it", "i see", "uh huh", "oh okay", "oh ok"} -load_dotenv() +def is_filler_input(transcript: str) -> bool: + """Check if transcript is purely a filler acknowledgment""" + clean = transcript.lower().strip() + clean_no_punct = re.sub(r'[^\w\s]', '', clean) + + if clean_no_punct in FILLER_PHRASES: + return True + if clean_no_punct.replace(" ", "") in FILLER_WORDS: + return True + + words = clean_no_punct.split() + if words and all(word in FILLER_WORDS for word in words): + return True + return False +def contains_command(transcript: str) -> bool: + """Check if transcript contains an explicit stop command""" + clean = transcript.lower().strip() + clean_no_punct = re.sub(r'[^\w\s]', '', clean) + words = clean_no_punct.split() + + if not words: + return False + + # Direct command (starts with stop word) + if words[0] in STOP_WORDS: + return True + + # Command after brief acknowledgment: "yeah wait", "okay stop" + if len(words) >= 2: + for i in range(len(words) - 1): + if words[i] in FILLER_WORDS and words[i + 1] in STOP_WORDS: + return True + if words[i] in {"but", "and"} and words[i + 1] in STOP_WORDS: + return True + + # Avoid false positives in longer sentences + # "I have no idea" should NOT be a command + if len(words) > 3 and any(w in STOP_WORDS for w in words): + # Only treat as command if stop word is in first 2 positions + return any(words[i] in STOP_WORDS for i in range(min(2, len(words)))) + + return False -class MyAgent(Agent): +class IntelligentAgent(Agent): def __init__(self) -> None: super().__init__( - instructions="Your name is Kelly. You would interact with users via voice." - "with that in mind keep your responses concise and to the point." - "do not use emojis, asterisks, markdown, or other special characters in your responses." - "You are curious and friendly, and have a sense of humor." - "you will speak english to the user", + instructions=( + "Your name is Kelly. Keep responses concise and witty. " + "When users say things like 'yeah' or 'okay' while you're speaking, " + "it means they're listening - keep going! " + "Only stop if they explicitly say 'wait', 'stop', or 'hold on'." + ), ) - + self.is_speaking = False + self.was_interrupted_by_vad = False + self.last_speech_content = "" + async def on_enter(self): - # when the agent is added to the session, it'll generate a reply - # according to its instructions - self.session.generate_reply() - - # all functions annotated with @function_tool will be passed to the LLM when this - # agent is active - @function_tool - async def lookup_weather( - self, context: RunContext, location: str, latitude: str, longitude: str - ): - """Called when the user asks for weather related information. - Ensure the user's location (city or region) is provided. - When given a location, please estimate the latitude and longitude of the location and - do not ask the user for them. - - Args: - location: The location they are asking for - latitude: The latitude of the location, do not ask user for it - longitude: The longitude of the location, do not ask user for it - """ - - logger.info(f"Looking up weather for {location}") - - return "sunny with a temperature of 70 degrees." - + await self.session.generate_reply() server = AgentServer() - def prewarm(proc: JobProcess): proc.userdata["vad"] = silero.VAD.load() - server.setup_fnc = prewarm - @server.rtc_session() async def entrypoint(ctx: JobContext): - # each log entry will include these fields - ctx.log_context_fields = { - "room": ctx.room.name, - } session = AgentSession( - # Speech-to-text (STT) is your agent's ears, turning the user's speech into text that the LLM can understand - # See all available models at https://docs.livekit.io/agents/models/stt/ stt="deepgram/nova-3", - # A Large Language Model (LLM) is your agent's brain, processing user input and generating a response - # See all available models at https://docs.livekit.io/agents/models/llm/ - llm="openai/gpt-4.1-mini", - # Text-to-speech (TTS) is your agent's voice, turning the LLM's text into speech that the user can hear - # See all available models as well as voice selections at https://docs.livekit.io/agents/models/tts/ + llm="openai/gpt-4o-mini", tts="cartesia/sonic-2:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc", - # VAD and turn detection are used to determine when the user is speaking and when the agent should respond - # See more at https://docs.livekit.io/agents/build/turns - turn_detection=MultilingualModel(), vad=ctx.proc.userdata["vad"], - # allow the LLM to generate a response while waiting for the end of turn - # See more at https://docs.livekit.io/agents/build/audio/#preemptive-generation - preemptive_generation=True, - # sometimes background noise could interrupt the agent session, these are considered false positive interruptions - # when it's detected, you may resume the agent's speech - resume_false_interruption=True, - false_interruption_timeout=1.0, - ) - - # log metrics as they are emitted, and total usage after session is over - usage_collector = metrics.UsageCollector() - - @session.on("metrics_collected") - def _on_metrics_collected(ev: MetricsCollectedEvent): - metrics.log_metrics(ev.metrics) - usage_collector.collect(ev.metrics) - - async def log_usage(): - summary = usage_collector.get_summary() - logger.info(f"Usage: {summary}") - - # shutdown callbacks are triggered when the session is over - ctx.add_shutdown_callback(log_usage) - - await session.start( - agent=MyAgent(), - room=ctx.room, - room_options=room_io.RoomOptions( - audio_input=room_io.AudioInputOptions( - # uncomment to enable the Krisp BVC noise cancellation - # noise_cancellation=noise_cancellation.BVC(), - ), - ), + turn_detection=MultilingualModel(), + + # === HYBRID STRATEGY === + # Medium-low threshold: Catches most fillers but allows quick commands + allow_interruptions=True, + min_interruption_duration=0.6, # 0.6s - faster than most fillers, slower than most commands + min_interruption_words=2, # Require 2 words minimum + + # Enable auto-resume for false positives + false_interruption_timeout=1.0, # Wait 1s for transcript + resume_false_interruption=True, # Auto-resume if false positive + + preemptive_generation=False, + min_endpointing_delay=0.5, + max_endpointing_delay=2.5, ) - + + kelly = IntelligentAgent() + + logger.info("=" * 80) + logger.info("🚀 HYBRID INTELLIGENT INTERRUPTION HANDLER") + logger.info("⚙️ Strategy:") + logger.info(" - Medium VAD thresholds (0.6s, 2 words)") + logger.info(" - Auto-resume on false interruptions") + logger.info(" - Manual interrupt on commands that slip through") + logger.info(" - Transcript suppression for fillers") + logger.info("=" * 80) + + # Track interruption state + vad_just_interrupted = False + + @session.on("speech_created") + def on_speech_created(ev): + nonlocal vad_just_interrupted + kelly.is_speaking = True + kelly.was_interrupted_by_vad = False + vad_just_interrupted = False + + # Store what Kelly is saying for potential resume + if hasattr(ev, 'speech_handle') and hasattr(ev.speech_handle, 'text'): + kelly.last_speech_content = ev.speech_handle.text + + logger.info("🎤 KELLY STARTED SPEAKING") + + @session.on("agent_state_changed") + def on_agent_state_changed(ev): + nonlocal vad_just_interrupted + + logger.info(f"🎭 AGENT STATE: {ev.old_state} → {ev.new_state}") + + # Detect if Kelly was interrupted while speaking + if ev.old_state == "speaking" and ev.new_state == "listening": + if kelly.is_speaking: + kelly.was_interrupted_by_vad = True + vad_just_interrupted = True + logger.info("⚠️ KELLY INTERRUPTED - waiting for transcript to decide action...") + + if ev.new_state == "listening": + kelly.is_speaking = False + + @session.on("user_state_changed") + def on_user_state_changed(ev): + logger.info(f"👤 USER STATE: {ev.old_state} → {ev.new_state}") + + # Try to register false interruption handler + try: + @session.on("agent_false_interruption") + def on_false_interruption(ev): + if hasattr(ev, 'resumed') and ev.resumed: + logger.info("✅ FALSE INTERRUPTION AUTO-RESUMED by LiveKit") + except: + logger.warning("⚠️ False interruption event not available in this LiveKit version") + + @session.on("user_input_transcribed") + def on_user_input_transcribed(ev): + nonlocal vad_just_interrupted + + if not ev.is_final or not ev.transcript: + return + + clean_text = re.sub(r'[^\w\s]', '', ev.transcript.lower()).strip() + + logger.info(f"📝 TRANSCRIPT: '{clean_text}' | Kelly speaking: {kelly.is_speaking} | Just interrupted: {vad_just_interrupted}") + + # === CASE 1: Kelly was just interrupted by VAD === + if kelly.was_interrupted_by_vad or vad_just_interrupted: + + if contains_command(clean_text): + logger.info(f"🛑 REAL COMMAND after VAD interrupt: '{clean_text}' - staying stopped") + kelly.was_interrupted_by_vad = False + vad_just_interrupted = False + # Allow normal processing - the interrupt was correct + return + + elif is_filler_input(clean_text): + logger.info(f"🔄 FALSE INTERRUPT: '{clean_text}' was just a filler - should resume") + kelly.was_interrupted_by_vad = False + vad_just_interrupted = False + + # LiveKit's resume_false_interruption should handle this automatically + # But we still suppress the transcript from reaching LLM + return + + else: + logger.info(f"✅ REAL INPUT after interrupt: '{clean_text}' - valid interruption") + kelly.was_interrupted_by_vad = False + vad_just_interrupted = False + # Allow normal processing + return + + # === CASE 2: Kelly is currently speaking (VAD didn't interrupt yet) === + if kelly.is_speaking: + + if contains_command(clean_text): + logger.info(f"🛑 STOP COMMAND while speaking: '{clean_text}' - forcing interrupt NOW") + session.interrupt() + return + + elif is_filler_input(clean_text): + logger.info(f"🔇 FILLER while speaking: '{clean_text}' - completely ignored") + # Don't interrupt, don't pass to LLM + return + + else: + logger.info(f"💬 REAL INPUT while speaking: '{clean_text}' - allowing interrupt") + session.interrupt() + return + + # === CASE 3: Kelly is idle === + if not kelly.is_speaking: + + if is_filler_input(clean_text): + logger.info(f"🍃 FILLER while idle: '{clean_text}' - suppressed") + return + + logger.info(f"✅ VALID INPUT while idle: '{clean_text}'") + # Normal processing + + await session.start(agent=kelly, room=ctx.room) if __name__ == "__main__": cli.run_app(server) From 8118f124e05b9c34105f172ac5146b218abe21d7 Mon Sep 17 00:00:00 2001 From: Sirjan Singh Date: Tue, 3 Feb 2026 00:03:13 +0530 Subject: [PATCH 3/5] Update hybrid interruption handling strategy Refine the hybrid interruption handling strategy to better distinguish between filler words and commands based on timing. --- examples/voice_agents/basic_agent.py | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/examples/voice_agents/basic_agent.py b/examples/voice_agents/basic_agent.py index 472b198181..3a8298684c 100644 --- a/examples/voice_agents/basic_agent.py +++ b/examples/voice_agents/basic_agent.py @@ -1,20 +1,20 @@ """ -THE ACTUAL WORKING HYBRID SOLUTION +HYBRID INTERRUPTION HANDLING STRATEGY -Problem statement: -- "okay" said slowly (1.5s) should NOT interrupt -- "stop" said quickly (0.5s) SHOULD interrupt immediately -- Can't use duration-based filtering because it fails one of these cases +Challenge: +- Slow filler words (e.g., a 1.5s "okay") should NOT trigger an interruption. +- Quick commands (e.g., a 0.5s "stop") MUST trigger an immediate interruption. +- Pure duration-based filtering is insufficient as it cannot distinguish these cases reliably. -Solution: -- Use MEDIUM thresholds (catches most fillers, but some slip through) -- Have transcript handler RESUME if interrupted by filler -- Have transcript handler FORCE INTERRUPT if command detected but VAD didn't trigger +Implementation Strategy: +- Configure VAD with MEDIUM sensitivity: Catches most valid speech but may allow some fillers. +- Auto-Resume on Fillers: If a filler triggers an interruption, the transcript handler will resume the agent. +- Force Interrupt on Commands: If a quick command is missed by VAD, the transcript handler will enforce an interrupt. -This way: -- Fast "stop" (0.5s) → Below threshold → VAD doesn't interrupt → Transcript handler forces interrupt ✅ -- Slow "okay" (1.5s) → Above threshold → VAD interrupts → Transcript handler RESUMES ✅ -- Fast "okay" (0.3s) → Below threshold → VAD doesn't interrupt → Transcript suppresses ✅ +Outcome: +- Quick "stop" (0.5s): Ignored by VAD (too short) → Transcript Handler detects command and interrupts. ✅ +- Slow "okay" (1.5s): Triggered by VAD → Transcript Handler identifies filler and resumes speech. ✅ +- Quick "okay" (0.3s): Ignored by VAD → Transcript Handler identifies filler and suppresses it. ✅ """ import logging From cef1b4bbdaa4f80cc5c3f4dd9bdc39adb94f9c89 Mon Sep 17 00:00:00 2001 From: Sirjan Singh Date: Tue, 3 Feb 2026 00:28:57 +0530 Subject: [PATCH 4/5] Refactor IntelligentAgent for better command handling Refactor IntelligentAgent to improve command and filler detection logic, update logging for clarity, and streamline state management. --- examples/voice_agents/basic_agent.py | 389 +++++++++++++++++---------- 1 file changed, 248 insertions(+), 141 deletions(-) diff --git a/examples/voice_agents/basic_agent.py b/examples/voice_agents/basic_agent.py index 3a8298684c..8bd2d34160 100644 --- a/examples/voice_agents/basic_agent.py +++ b/examples/voice_agents/basic_agent.py @@ -19,75 +19,144 @@ import logging import re -import asyncio -from typing import Optional from dotenv import load_dotenv from livekit.agents import ( - Agent, AgentServer, AgentSession, JobContext, JobProcess, - cli, UserInputTranscribedEvent, AgentStateChangedEvent, - UserStateChangedEvent + Agent, AgentServer, AgentSession, JobContext, JobProcess, cli ) -from livekit.plugins import silero, deepgram, openai, cartesia +from livekit.plugins import silero from livekit.plugins.turn_detector.multilingual import MultilingualModel logger = logging.getLogger("intelligent-kelly") logger.setLevel(logging.INFO) load_dotenv() -# CONFIGURATION -STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt"} +# ============================================================================= +# CONFIGURATION - Command and Filler Detection +# ============================================================================= + +# Single words that mean "stop" as a command +STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt", "enough", "quiet"} + +# Multi-word command phrases (normalized, no spaces) +STOP_PHRASES = { + "holdon", "holdonthat", "waitasec", "waitasecond", "waitaminute", + "stopit", "stopthat", "stopnow", "pausethat", "onemoment" +} + +# Words that can precede a stop word to form a command +COMMAND_PREFIXES = {"no", "but", "and", "okay", "ok", "yeah", "yes", "hey", "please"} + +# Pure filler/acknowledgment words (no overlap with meaningful words) FILLER_WORDS = { "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup", - "hmm", "right", "uh", "um", "ah", "gotit", "isee", "ok", "k", - "sure", "yes", "interesting", "really", "wow", "ohh", "ooh", - "aha", "mhmm", "gotcha", "nice", "oh", "all", "got", "it", "i", "see" + "hmm", "right", "uh", "um", "ah", "ok", "k", "sure", "yes", + "interesting", "really", "wow", "ohh", "ooh", "aha", "mhmm", + "gotcha", "nice", "oh", "no", "nah", "nope", "cool", "great" } -FILLER_PHRASES = {"all right", "got it", "i see", "uh huh", "oh okay", "oh ok"} -def is_filler_input(transcript: str) -> bool: - """Check if transcript is purely a filler acknowledgment""" +# Multi-word filler phrases (normalized with spaces for matching) +FILLER_PHRASES = { + "all right", "got it", "i see", "uh huh", "oh okay", "oh ok", + "oh really", "oh wow", "oh nice", "sounds good", "makes sense", + "i understand", "mm hmm", "uh huh" +} + + +def normalize_text(transcript: str) -> str: + """Normalize transcript for consistent matching.""" clean = transcript.lower().strip() - clean_no_punct = re.sub(r'[^\w\s]', '', clean) - - if clean_no_punct in FILLER_PHRASES: - return True - if clean_no_punct.replace(" ", "") in FILLER_WORDS: - return True - - words = clean_no_punct.split() - if words and all(word in FILLER_WORDS for word in words): - return True - return False + clean = re.sub(r'[^\w\s]', '', clean) # Remove punctuation + clean = re.sub(r'\s+', ' ', clean) # Collapse whitespace + return clean.strip() + def contains_command(transcript: str) -> bool: - """Check if transcript contains an explicit stop command""" - clean = transcript.lower().strip() - clean_no_punct = re.sub(r'[^\w\s]', '', clean) - words = clean_no_punct.split() + """ + Check if transcript contains an explicit stop command. + MUST be checked BEFORE is_filler_input() to avoid false negatives. + """ + text = normalize_text(transcript) + words = text.split() if not words: return False - # Direct command (starts with stop word) + # Check for exact stop phrase match (e.g., "hold on") + text_no_spaces = text.replace(" ", "") + if text_no_spaces in STOP_PHRASES: + return True + + # Check for stop phrase at start (e.g., "hold on a second please") + for phrase in STOP_PHRASES: + if text_no_spaces.startswith(phrase): + return True + + # Direct command: first word is a stop word (e.g., "stop", "wait") if words[0] in STOP_WORDS: return True - # Command after brief acknowledgment: "yeah wait", "okay stop" - if len(words) >= 2: - for i in range(len(words) - 1): - if words[i] in FILLER_WORDS and words[i + 1] in STOP_WORDS: + # Command after prefix: "yeah wait", "okay stop", "no hold on", "but wait" + # Check first 3 words for pattern: [prefix] + [stop_word] + for i in range(min(3, len(words))): + if words[i] in STOP_WORDS: + # If stop word is in first 3 positions, it's likely a command + # Unless it's a long sentence where stop word is incidental + if len(words) <= 5: return True - if words[i] in {"but", "and"} and words[i + 1] in STOP_WORDS: + # For longer sentences, only count if stop word is in first 2 positions + if i < 2: return True - # Avoid false positives in longer sentences - # "I have no idea" should NOT be a command - if len(words) > 3 and any(w in STOP_WORDS for w in words): - # Only treat as command if stop word is in first 2 positions - return any(words[i] in STOP_WORDS for i in range(min(2, len(words)))) + # Pattern: prefix + stop word anywhere in first 4 words + # e.g., "okay wait a second", "no hold on please" + if len(words) >= 2: + for i in range(min(3, len(words) - 1)): + if words[i] in COMMAND_PREFIXES and words[i + 1] in STOP_WORDS: + return True + + return False + + +def is_filler_input(transcript: str) -> bool: + """ + Check if transcript is purely a filler acknowledgment. + Only returns True if it's DEFINITELY a filler (no command content). + """ + text = normalize_text(transcript) + + # CRITICAL: Command always takes priority - check first! + if contains_command(transcript): + return False + + # Empty or very short + if not text: + return True + + # Exact filler phrase match + if text in FILLER_PHRASES: + return True + + # Single word in filler set + words = text.split() + if len(words) == 1 and words[0] in FILLER_WORDS: + return True + + # All words are fillers (e.g., "yeah yeah", "okay um", "oh really") + if len(words) <= 3 and all(word in FILLER_WORDS for word in words): + return True + + # Compound filler check (e.g., "uhhuh" -> "uh huh") + text_no_spaces = text.replace(" ", "") + if text_no_spaces in FILLER_WORDS: + return True return False + +# ============================================================================= +# AGENT DEFINITION +# ============================================================================= + class IntelligentAgent(Agent): def __init__(self) -> None: super().__init__( @@ -98,20 +167,57 @@ def __init__(self) -> None: "Only stop if they explicitly say 'wait', 'stop', or 'hold on'." ), ) - self.is_speaking = False - self.was_interrupted_by_vad = False - self.last_speech_content = "" - + # Simplified state: only track if agent is currently speaking + self._is_speaking = False + # Track if VAD just interrupted (waiting for transcript to classify) + self._interrupted_by_vad = False + + @property + def is_speaking(self) -> bool: + return self._is_speaking + + @is_speaking.setter + def is_speaking(self, value: bool) -> None: + self._is_speaking = value + + @property + def interrupted_by_vad(self) -> bool: + return self._interrupted_by_vad + + @interrupted_by_vad.setter + def interrupted_by_vad(self, value: bool) -> None: + self._interrupted_by_vad = value + async def on_enter(self): - await self.session.generate_reply() + # Wait for user to speak first (no preemptive greeting) + pass + + +# ============================================================================= +# SERVER SETUP +# ============================================================================= server = AgentServer() + def prewarm(proc: JobProcess): proc.userdata["vad"] = silero.VAD.load() + server.setup_fnc = prewarm + +def try_clear_user_turn(session: AgentSession) -> bool: + """Safely attempt to clear user turn to suppress LLM processing.""" + if hasattr(session, 'clear_user_turn'): + try: + session.clear_user_turn() + return True + except Exception as e: + logger.debug(f"clear_user_turn failed: {e}") + return False + + @server.rtc_session() async def entrypoint(ctx: JobContext): session = AgentSession( @@ -122,142 +228,143 @@ async def entrypoint(ctx: JobContext): turn_detection=MultilingualModel(), # === HYBRID STRATEGY === - # Medium-low threshold: Catches most fillers but allows quick commands + # Medium threshold: catches most fillers but allows quick commands through allow_interruptions=True, - min_interruption_duration=0.6, # 0.6s - faster than most fillers, slower than most commands - min_interruption_words=2, # Require 2 words minimum + min_interruption_duration=0.6, # 0.6s - slower than most commands + min_interruption_words=2, # Require 2+ words - # Enable auto-resume for false positives - false_interruption_timeout=1.0, # Wait 1s for transcript - resume_false_interruption=True, # Auto-resume if false positive + # Enable auto-resume for false positives (LiveKit handles this) + false_interruption_timeout=1.0, + resume_false_interruption=True, preemptive_generation=False, min_endpointing_delay=0.5, max_endpointing_delay=2.5, ) - kelly = IntelligentAgent() - - logger.info("=" * 80) - logger.info("🚀 HYBRID INTELLIGENT INTERRUPTION HANDLER") - logger.info("⚙️ Strategy:") - logger.info(" - Medium VAD thresholds (0.6s, 2 words)") - logger.info(" - Auto-resume on false interruptions") - logger.info(" - Manual interrupt on commands that slip through") - logger.info(" - Transcript suppression for fillers") - logger.info("=" * 80) + agent = IntelligentAgent() - # Track interruption state - vad_just_interrupted = False + logger.info("=" * 70) + logger.info("🚀 HYBRID INTELLIGENT INTERRUPTION HANDLER v2") + logger.info(" Strategy: VAD(0.6s, 2words) + Transcript Classification") + logger.info("=" * 70) + # ------------------------------------------------------------------------- + # EVENT: Agent starts speaking + # ------------------------------------------------------------------------- @session.on("speech_created") def on_speech_created(ev): - nonlocal vad_just_interrupted - kelly.is_speaking = True - kelly.was_interrupted_by_vad = False - vad_just_interrupted = False - - # Store what Kelly is saying for potential resume - if hasattr(ev, 'speech_handle') and hasattr(ev.speech_handle, 'text'): - kelly.last_speech_content = ev.speech_handle.text - - logger.info("🎤 KELLY STARTED SPEAKING") + agent.is_speaking = True + agent.interrupted_by_vad = False + logger.info("🎤 Agent started speaking") + # ------------------------------------------------------------------------- + # EVENT: Agent state changes + # ------------------------------------------------------------------------- @session.on("agent_state_changed") def on_agent_state_changed(ev): - nonlocal vad_just_interrupted + logger.debug(f"🎭 Agent: {ev.old_state} → {ev.new_state}") - logger.info(f"🎭 AGENT STATE: {ev.old_state} → {ev.new_state}") - - # Detect if Kelly was interrupted while speaking + # Detect VAD interruption: speaking → listening transition if ev.old_state == "speaking" and ev.new_state == "listening": - if kelly.is_speaking: - kelly.was_interrupted_by_vad = True - vad_just_interrupted = True - logger.info("⚠️ KELLY INTERRUPTED - waiting for transcript to decide action...") + if agent.is_speaking: + agent.interrupted_by_vad = True + logger.info("⚠️ VAD interrupted - waiting for transcript...") - if ev.new_state == "listening": - kelly.is_speaking = False + # Update speaking state + if ev.new_state in ("listening", "thinking"): + agent.is_speaking = False + elif ev.new_state == "speaking": + agent.is_speaking = True + # ------------------------------------------------------------------------- + # EVENT: User state changes (for logging only) + # ------------------------------------------------------------------------- @session.on("user_state_changed") def on_user_state_changed(ev): - logger.info(f"👤 USER STATE: {ev.old_state} → {ev.new_state}") - - # Try to register false interruption handler - try: - @session.on("agent_false_interruption") - def on_false_interruption(ev): - if hasattr(ev, 'resumed') and ev.resumed: - logger.info("✅ FALSE INTERRUPTION AUTO-RESUMED by LiveKit") - except: - logger.warning("⚠️ False interruption event not available in this LiveKit version") + logger.debug(f"👤 User: {ev.old_state} → {ev.new_state}") + # ------------------------------------------------------------------------- + # EVENT: Transcript received - MAIN LOGIC + # ------------------------------------------------------------------------- @session.on("user_input_transcribed") def on_user_input_transcribed(ev): - nonlocal vad_just_interrupted - + # Only process final transcripts if not ev.is_final or not ev.transcript: return - clean_text = re.sub(r'[^\w\s]', '', ev.transcript.lower()).strip() + text = normalize_text(ev.transcript) + if not text: + return + + # Classify the input + has_command = contains_command(text) + is_filler = is_filler_input(text) - logger.info(f"📝 TRANSCRIPT: '{clean_text}' | Kelly speaking: {kelly.is_speaking} | Just interrupted: {vad_just_interrupted}") + logger.info( + f"📝 '{text}' | speaking={agent.is_speaking} | " + f"vad_interrupted={agent.interrupted_by_vad} | " + f"cmd={has_command} | filler={is_filler}" + ) - # === CASE 1: Kelly was just interrupted by VAD === - if kelly.was_interrupted_by_vad or vad_just_interrupted: + # ================================================================= + # CASE 1: VAD just interrupted - classify and decide + # ================================================================= + if agent.interrupted_by_vad: + agent.interrupted_by_vad = False # Reset flag - if contains_command(clean_text): - logger.info(f"🛑 REAL COMMAND after VAD interrupt: '{clean_text}' - staying stopped") - kelly.was_interrupted_by_vad = False - vad_just_interrupted = False - # Allow normal processing - the interrupt was correct - return + if has_command: + # Real command - interruption was correct, let LLM process + logger.info(f"🛑 COMMAND after VAD: '{text}' - valid interrupt") + return # Allow normal LLM processing - elif is_filler_input(clean_text): - logger.info(f"🔄 FALSE INTERRUPT: '{clean_text}' was just a filler - should resume") - kelly.was_interrupted_by_vad = False - vad_just_interrupted = False - - # LiveKit's resume_false_interruption should handle this automatically - # But we still suppress the transcript from reaching LLM + if is_filler: + # False positive - LiveKit's resume_false_interruption handles resume + # Suppress transcript from LLM + logger.info(f"🔄 FILLER after VAD: '{text}' - suppressing") + try_clear_user_turn(session) return - else: - logger.info(f"✅ REAL INPUT after interrupt: '{clean_text}' - valid interruption") - kelly.was_interrupted_by_vad = False - vad_just_interrupted = False - # Allow normal processing - return + # Real input (not command, not filler) - valid interruption + logger.info(f"✅ REAL INPUT after VAD: '{text}'") + return # Allow normal LLM processing - # === CASE 2: Kelly is currently speaking (VAD didn't interrupt yet) === - if kelly.is_speaking: - - if contains_command(clean_text): - logger.info(f"🛑 STOP COMMAND while speaking: '{clean_text}' - forcing interrupt NOW") + # ================================================================= + # CASE 2: Agent is currently speaking (no VAD interrupt yet) + # ================================================================= + if agent.is_speaking: + if has_command: + # Force interrupt on command that VAD missed + logger.info(f"🛑 COMMAND while speaking: '{text}' - forcing interrupt") session.interrupt() - return + return # Allow LLM to process the command - elif is_filler_input(clean_text): - logger.info(f"🔇 FILLER while speaking: '{clean_text}' - completely ignored") - # Don't interrupt, don't pass to LLM + if is_filler: + # Ignore filler - don't interrupt, don't pass to LLM + logger.info(f"🔇 FILLER while speaking: '{text}' - ignored") + try_clear_user_turn(session) return - else: - logger.info(f"💬 REAL INPUT while speaking: '{clean_text}' - allowing interrupt") - session.interrupt() - return + # Real input - interrupt and let LLM process + logger.info(f"💬 INPUT while speaking: '{text}' - interrupting") + session.interrupt() + return - # === CASE 3: Kelly is idle === - if not kelly.is_speaking: - - if is_filler_input(clean_text): - logger.info(f"🍃 FILLER while idle: '{clean_text}' - suppressed") - return - - logger.info(f"✅ VALID INPUT while idle: '{clean_text}'") - # Normal processing + # ================================================================= + # CASE 3: Agent is idle (not speaking) + # ================================================================= + if is_filler: + # Suppress lone fillers when idle + logger.info(f"🍃 FILLER while idle: '{text}' - suppressed") + try_clear_user_turn(session) + return + + # Normal input - let LLM process + logger.info(f"✅ INPUT while idle: '{text}'") + # Allow normal processing - await session.start(agent=kelly, room=ctx.room) + await session.start(agent=agent, room=ctx.room) + if __name__ == "__main__": cli.run_app(server) From 4ad08caa08e9a4b5a3305e3203e0e11043869e46 Mon Sep 17 00:00:00 2001 From: Sirjan Singh Date: Tue, 3 Feb 2026 01:14:32 +0530 Subject: [PATCH 5/5] Enhance README with student info and clarification Added student details and improved documentation on intelligent interruption handling in voice agents. --- examples/voice_agents/README.md | 265 ++++++++++---------------------- 1 file changed, 84 insertions(+), 181 deletions(-) diff --git a/examples/voice_agents/README.md b/examples/voice_agents/README.md index 50bbe91911..a1d37bcc46 100644 --- a/examples/voice_agents/README.md +++ b/examples/voice_agents/README.md @@ -4,6 +4,12 @@ This document explains the modifications made to `basic_agent.py` to implement intelligent interruption handling that distinguishes between **filler words** (acknowledgments like "yeah", "okay") and **command words** (interruptions like "stop", "wait"). +--- +## Student Details +- **Name:** Sirjan Singh +- **College Roll Number:** 23UCS715 +- **Demo Video Link:** [Drive Link](https://drive.google.com/drive/folders/1LXnojdfCtswc14PxWH60ZqynbLN03F3J?usp=sharing) + --- ## The Challenge @@ -16,7 +22,7 @@ However, LiveKit's default Voice Activity Detection (VAD) treats ALL user speech 1. **When agent is speaking + user says filler** → Agent continues uninterrupted 2. **When agent is speaking + user says command** → Agent stops immediately 3. **When agent is silent** → All user speech is valid input -4. **Mixed input** → Commands always take priority over fillers +4. **Mixed input** → Commands always take priority over fillers (e.g., "yeah wait" is a command) --- @@ -63,236 +69,133 @@ false_interruption_timeout=1.0, --- -### Layer 3: Transcript-Based Manual Control -The most important layer — our custom logic that analyzes transcripts: +### Layer 3: Transcript-Based Classification (The Brain) +The most important layer — our custom logic that analyzes transcripts. This layer enforces strict priority: **Commands > Real Input > Fillers**. +#### Key Logic Flow: ```python @session.on("user_input_transcribed") def on_user_input_transcribed(ev): - # Analyze what the user actually said + text = normalize_text(ev.transcript) + + # 1. CHECK COMMANDS FIRST (Priority!) if contains_command(text): - session.interrupt() # Force stop - elif is_filler_input(text): - return # Ignore completely - else: - # Real input - allow processing + if agent.is_speaking: + session.interrupt() # Force stop if VAD missed it + return # Let LLM process the command + + # 2. CHECK FILLERS SECOND + if is_filler_input(text): + # Suppress from LLM so agent doesn't respond to "yeah" + try_clear_user_turn(session) + return + + # 3. REAL INPUT (Questions, conversation) + # Process normally ``` This handles three cases: #### Case 1: Agent Was Just Interrupted by VAD -```python -if kelly.was_interrupted_by_vad: - if contains_command(text): - # Real command - stay stopped - elif is_filler_input(text): - # False alarm - resume_false_interruption handles it - else: - # Real input - process normally -``` +- **Command:** Valid interruption, let LLM respond. +- **Filler:** False alarm! `resume_false_interruption` will auto-resume speech. We call `clear_user_turn()` so the LLM doesn't hear "yeah". +- **Real Input:** Valid interruption. #### Case 2: Agent Is Currently Speaking (VAD Hasn't Triggered Yet) -```python -if kelly.is_speaking: - if contains_command(text): - session.interrupt() # Force interrupt NOW - elif is_filler_input(text): - return # Completely ignore - else: - session.interrupt() # Real input - allow interrupt -``` +- **Command:** Force immediate interrupt (`session.interrupt()`). +- **Filler:** Ignore completely (`clear_user_turn()`). +- **Real Input:** Allow interrupt (`session.interrupt()`). #### Case 3: Agent Is Idle -```python -if not kelly.is_speaking: - if is_filler_input(text): - return # Suppress from LLM - # Otherwise process normally -``` +- **Command/Real Input:** Process normally. +- **Filler:** Suppress (don't wake up LLM for just "okay"). --- -## Key Code Changes +## Key Code Changes (Refactored) -### 1. Word Lists Configuration +### 1. Robust Word Lists -**Filler Words** (acknowledgments to ignore): +**Command Detection** (Stop Phrases & Prefixes): ```python -FILLER_WORDS = { - "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup", - "hmm", "right", "uh", "um", "ah", "gotit", "isee", "ok", - # ... more -} +# Single words +STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt", ...} -FILLER_PHRASES = { - "all right", "got it", "i see", "uh huh", "oh okay" -} +# Multi-word phrases (normalized) +STOP_PHRASES = {"holdon", "waitasecond", "stopit", "waitaminute", ...} + +# Prefixes that can precede commands +COMMAND_PREFIXES = {"no", "but", "and", "okay", "please", "hey"} ``` +*Now catches:* `"no wait"`, `"hold on"`, `"wait a second"`, `"yeah stop"` -**Command Words** (explicit stop requests): +**Filler Words** (Strict filtering): ```python -STOP_WORDS = { - "wait", "stop", "finish", "hold", "pause", "halt" +FILLER_WORDS = { + "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup", + "hmm", "right", "uh", "um", "ah", "cool", "great", "no", "nah" + # Removed generic words like "i", "see", "all" to avoid false positives } ``` ### 2. Detection Functions -**`is_filler_input(transcript)`** — Returns `True` if input is purely acknowledgment: -- Removes punctuation -- Checks against filler word/phrase lists -- Validates all words are filler tokens - -**`contains_command(transcript)`** — Returns `True` if input contains stop command: -- Checks if sentence starts with stop word -- Detects "filler + command" patterns ("yeah wait", "okay stop") -- Avoids false positives in longer sentences - -### 3. State Tracking +**`contains_command(transcript)`**: +- Checks for multi-word phrases (`"hold on"`). +- Checks for prefixes (`"no wait"`). +- Checks priority positions (first 3 words). -```python -class IntelligentAgent(Agent): - def __init__(self): - self.is_speaking = False # Currently generating speech - self.was_interrupted_by_vad = False # Just got interrupted by VAD - self.last_speech_content = "" # Content being spoken -``` +**`is_filler_input(transcript)`**: +- **CRITICAL:** Calls `contains_command()` first! If it's a command, it is NOT a filler. +- Only matches if input is *purely* filler words/phrases. -### 4. Event Handlers - -**`on_speech_created`** — Tracks when agent starts speaking: +### 3. Transcript Suppression +We use a helper to prevent the LLM from responding to fillers: ```python -@session.on("speech_created") -def on_speech_created(ev): - kelly.is_speaking = True - kelly.was_interrupted_by_vad = False +def try_clear_user_turn(session): + if hasattr(session, 'clear_user_turn'): + session.clear_user_turn() ``` -**`on_agent_state_changed`** — Detects interruptions: -```python -if ev.old_state == "speaking" and ev.new_state == "listening": - if kelly.is_speaking: - kelly.was_interrupted_by_vad = True -``` - -**`on_user_input_transcribed`** — Main interruption logic (see Layer 3 above) - --- -## Configuration Parameters - -### AgentSession Settings - -| Parameter | Value | Purpose | -|-----------|-------|---------| -| `allow_interruptions` | `True` | Enable VAD-based interruptions | -| `min_interruption_duration` | `0.6` | Require 0.6s of speech to interrupt | -| `min_interruption_words` | `2` | Require 2+ words to interrupt | -| `resume_false_interruption` | `True` | Auto-resume after false interruptions | -| `false_interruption_timeout` | `1.0` | Wait 1s before resuming | -| `preemptive_generation` | `False` | Disabled for more predictable flow | -| `min_endpointing_delay` | `0.5` | Min silence before turn ends | -| `max_endpointing_delay` | `2.5` | Max silence before turn ends | - ---- - -## How It All Works Together +## How It All Works Together (Examples) ### Scenario 1: User says "yeah" (0.3s, quick acknowledgment) -1. ✅ **VAD Layer:** Too short (0.3s < 0.6s) → No interrupt -2. ✅ **Transcript Handler:** Detects filler while speaking → Ignores -3. ✅ **Result:** Agent continues speaking smoothly +1. ✅ **VAD Layer:** Too short (< 0.6s) → No interrupt +2. ✅ **Transcript Layer:** `is_filler_input` = True. `try_clear_user_turn()` called. +3. ✅ **Result:** Agent continues speaking. LLM sees nothing. ### Scenario 2: User says "okaaaay" (1.5s, slow filler) -1. ❌ **VAD Layer:** Long enough (1.5s > 0.6s) → Interrupts agent -2. ✅ **Resume Layer:** Waits 1s for more speech, nothing comes → Resumes -3. ✅ **Transcript Handler:** Marks as filler → Suppresses from LLM -4. ✅ **Result:** Brief pause (1s), then agent resumes - -### Scenario 3: User says "stop" (0.5s, quick command) -1. ✅ **VAD Layer:** Too short (0.5s < 0.6s) → No interrupt -2. ✅ **Transcript Handler:** Detects command → `session.interrupt()` -3. ✅ **Result:** Agent stops immediately via manual interrupt - -### Scenario 4: User says "wait a second" (1.2s, clear command) -1. ✅ **VAD Layer:** Long enough (1.2s > 0.6s) → Interrupts agent -2. ✅ **Transcript Handler:** Detects command → Stays stopped -3. ✅ **Result:** Agent stops, processes user's request - ---- - -## Testing the Solution - -### Test Cases - -1. **Filler while speaking:** - - Say "yeah", "okay", "hmm" while agent is talking - - **Expected:** Agent continues without stopping - -2. **Command while speaking:** - - Say "wait", "stop", "hold on" while agent is talking - - **Expected:** Agent stops immediately - -3. **Mixed input:** - - Say "yeah wait" while agent is talking - - **Expected:** Agent stops (command wins) - -4. **Filler while silent:** - - Say "okay" when agent is idle - - **Expected:** Ignored, doesn't trigger new response - -5. **Normal conversation:** - - Ask questions when agent is idle - - **Expected:** Normal response flow - -### Logs to Watch For - -``` -🎤 KELLY STARTED SPEAKING -📝 TRANSCRIPT: 'yeah' | Kelly speaking: True -🔇 FILLER while speaking: 'yeah' - completely ignored -``` - -``` -📝 TRANSCRIPT: 'wait' | Kelly speaking: True -🛑 STOP COMMAND while speaking: 'wait' - forcing interrupt NOW -``` - -``` -⚠️ KELLY INTERRUPTED - waiting for transcript... -📝 TRANSCRIPT: 'okay' | Just interrupted: True -🔄 FALSE INTERRUPT: 'okay' was just a filler - should resume -``` +1. ❌ **VAD Layer:** Long enough (> 0.6s) → Interrupts agent +2. ✅ **Resume Layer:** Waits 1s, decides it's a false interrupt → Resumes +3. ✅ **Transcript Layer:** `is_filler_input` = True. Suppresses transcript. +4. ✅ **Result:** Brief pause (1s), then agent resumes. + +### Scenario 3: User says "no wait" (Quick command) +1. ❌ **VAD Layer:** Might be too short or missed. +2. ✅ **Transcript Layer:** `contains_command` = True (catches "no" + "wait"). +3. ✅ **Action:** `session.interrupt()` forced immediately. +4. ✅ **Result:** Agent stops. LLM processes "no wait". + +### Scenario 4: User says "I have a question" +1. ✅ **Transcript Layer:** Not a command, not a filler. +2. ✅ **Action:** Real input. Interrupts agent. +3. ✅ **Result:** Standard conversation flow. --- ## Files Modified -- **`basic_agent.py`** — Main implementation with all intelligent interruption logic +- **`basic_agent.py`** — Main implementation with all intelligent interruption logic. ## Dependencies -No additional dependencies required beyond standard LiveKit Agents SDK. - ---- - -## Limitations - -1. **Brief pause on slow fillers:** If user says a filler slowly (>0.6s), there may be a ~1s pause before auto-resume -2. **Language-specific:** Word lists are currently English-focused (though some Hindi words are included) -3. **Context-unaware:** Doesn't understand semantic context (e.g., "no" as answer vs. "no" as stop command) +No additional dependencies required. Uses standard Python `re` and LiveKit Agents SDK. --- ## Future Improvements -1. **Sentiment analysis:** Use LLM to determine if "no" is a stop command or an answer -2. **Adaptive thresholds:** Learn user's speech patterns and adjust thresholds -3. **Multi-language support:** Extended word lists for other languages -4. **Prosody analysis:** Use tone/pitch to distinguish acknowledgments from commands - ---- - -## Credits - -Implementation for the **LiveKit Intelligent Interruption Handling Challenge**. +1. **Semantic Analysis:** Use a small NPU/LLM model to determine if "right" means "correct" (answer) or "continue" (filler). +2. **Prosody Analysis:** Differentiate "stop?" (question) from "STOP!" (command) based on pitch/volume.