diff --git a/examples/voice_agents/README.md b/examples/voice_agents/README.md index aa401505d1..a1d37bcc46 100644 --- a/examples/voice_agents/README.md +++ b/examples/voice_agents/README.md @@ -1,78 +1,201 @@ -# Voice Agents Examples +# Intelligent Interruption Handling for LiveKit Voice Agent -This directory contains a comprehensive collection of voice-based agent examples demonstrating various capabilities and integrations with the LiveKit Agents framework. +## Overview -## 📋 Table of Contents +This document explains the modifications made to `basic_agent.py` to implement intelligent interruption handling that distinguishes between **filler words** (acknowledgments like "yeah", "okay") and **command words** (interruptions like "stop", "wait"). -### 🚀 Getting Started +--- +## Student Details +- **Name:** Sirjan Singh +- **College Roll Number:** 23UCS715 +- **Demo Video Link:** [Drive Link](https://drive.google.com/drive/folders/1LXnojdfCtswc14PxWH60ZqynbLN03F3J?usp=sharing) + +--- -- [`basic_agent.py`](./basic_agent.py) - A fundamental voice agent with metrics collection +## The Challenge -### 🛠️ Tool Integration & Function Calling +In a natural voice conversation, users often say acknowledgment words like "yeah", "okay", or "hmm" while the agent is speaking. These are **backchannel responses** that mean "I'm listening, continue" — not "stop talking." -- [`annotated_tool_args.py`](./annotated_tool_args.py) - Using Python type annotations for tool arguments -- [`dynamic_tool_creation.py`](./dynamic_tool_creation.py) - Creating and registering tools dynamically at runtime -- [`raw_function_description.py`](./raw_function_description.py) - Using raw JSON schema definitions for tool descriptions -- [`silent_function_call.py`](./silent_function_call.py) - Executing function calls without verbal responses to user -- [`long_running_function.py`](./long_running_function.py) - Handling long running function calls with interruption support +However, LiveKit's default Voice Activity Detection (VAD) treats ALL user speech as potential interruptions, causing the agent to stop mid-sentence when hearing these fillers. -### ⚡ Real-time Models +**Requirements:** +1. **When agent is speaking + user says filler** → Agent continues uninterrupted +2. **When agent is speaking + user says command** → Agent stops immediately +3. **When agent is silent** → All user speech is valid input +4. **Mixed input** → Commands always take priority over fillers (e.g., "yeah wait" is a command) -- [`weather_agent.py`](./weather_agent.py) - OpenAI Realtime API with function calls for weather information -- [`realtime_video_agent.py`](./realtime_video_agent.py) - Google Gemini with multimodal video and voice capabilities -- [`realtime_joke_teller.py`](./realtime_joke_teller.py) - Amazon Nova Sonic real-time model with function calls -- [`realtime_load_chat_history.py`](./realtime_load_chat_history.py) - Loading previous chat history into real-time models -- [`realtime_turn_detector.py`](./realtime_turn_detector.py) - Using LiveKit's turn detection with real-time models -- [`realtime_with_tts.py`](./realtime_with_tts.py) - Combining external TTS providers with real-time models +--- -### 🎯 Pipeline Nodes & Hooks +## The Core Problem: Timing -- [`fast-preresponse.py`](./fast-preresponse.py) - Generating quick responses using the `on_user_turn_completed` node -- [`flush_llm_node.py`](./flush_llm_node.py) - Flushing partial LLM output to TTS in `llm_node` -- [`structured_output.py`](./structured_output.py) - Structured data and JSON outputs from agent responses -- [`speedup_output_audio.py`](./speedup_output_audio.py) - Dynamically adjusting agent audio playback speed -- [`timed_agent_transcript.py`](./timed_agent_transcript.py) - Reading timestamped transcripts from `transcription_node` -- [`inactive_user.py`](./inactive_user.py) - Handling inactive users with the `user_state_changed` event hook -- [`resume_interrupted_agent.py`](./resume_interrupted_agent.py) - Resuming agent speech after false interruption detection -- [`toggle_io.py`](./toggle_io.py) - Dynamically toggling audio input/output during conversations +The fundamental challenge is **VAD interrupts BEFORE transcripts arrive**: -### 🤖 Multi-agent & AgentTask Use Cases +``` +Time 0.0s: User starts saying "yeah" +Time 0.3s: VAD detects speech → Interrupts agent +Time 0.5s: User finishes saying "yeah" +Time 0.8s: Transcript arrives → "Yeah." +``` -- [`restaurant_agent.py`](./restaurant_agent.py) - Multi-agent system for restaurant ordering and reservation management -- [`multi_agent.py`](./multi_agent.py) - Collaborative storytelling with multiple specialized agents -- [`email_example.py`](./email_example.py) - Using AgentTask to collect and validate email addresses +By the time we know it was a filler word, the agent has already stopped! -### 🔗 MCP & External Integrations +--- -- [`web_search.py`](./web_search.py) - Integrating web search capabilities into voice agents -- [`langgraph_agent.py`](./langgraph_agent.py) - LangGraph integration -- [`mcp/`](./mcp/) - Model Context Protocol (MCP) integration examples - - [`mcp-agent.py`](./mcp/mcp-agent.py) - MCP agent integration - - [`server.py`](./mcp/server.py) - MCP server example -- [`zapier_mcp_integration.py`](./zapier_mcp_integration.py) - Automating workflows with Zapier through MCP +## The Solution: Hybrid Approach -### 💾 RAG & Knowledge Management +We use a **three-layer defense system**: -- [`llamaindex-rag/`](./llamaindex-rag/) - Complete RAG implementation with LlamaIndex - - [`chat_engine.py`](./llamaindex-rag/chat_engine.py) - Chat engine integration - - [`query_engine.py`](./llamaindex-rag/query_engine.py) - Query engine used in a function tool - - [`retrieval.py`](./llamaindex-rag/retrieval.py) - Document retrieval +### Layer 1: Medium VAD Thresholds +```python +min_interruption_duration=0.6, # Requires 0.6 seconds of speech +min_interruption_words=2, # Requires at least 2 words +``` -### 🎵 Specialized Use Cases +**Purpose:** Filters out very quick, single-word fillers ("yeah!", "okay!") -- [`background_audio.py`](./background_audio.py) - Playing background audio or ambient sounds during conversations -- [`push_to_talk.py`](./push_to_talk.py) - Push-to-talk interaction -- [`tts_text_pacing.py`](./tts_text_pacing.py) - Pacing control for TTS requests -- [`speaker_id_multi_speaker.py`](./speaker_id_multi_speaker.py) - Multi-speaker identification +**Tradeoff:** Longer fillers (1.5s "okaaaay") can still slip through -### 📊 Tracing & Error Handling +--- -- [`langfuse_trace.py`](./langfuse_trace.py) - LangFuse integration for conversation tracing -- [`error_callback.py`](./error_callback.py) - Error handling callback -- [`session_close_callback.py`](./session_close_callback.py) - Session lifecycle management +### Layer 2: Automatic Resume on False Interruptions +```python +resume_false_interruption=True, +false_interruption_timeout=1.0, +``` -## 📖 Additional Resources +**Purpose:** If VAD interrupts the agent, LiveKit waits 1 second for more user speech. If nothing substantial comes, it automatically resumes the agent's speech. -- [LiveKit Agents Documentation](https://docs.livekit.io/agents/) -- [Agents Starter Example](https://github.com/livekit-examples/agent-starter-python) -- [More Agents Examples](https://github.com/livekit-examples/python-agents-examples) +**How it helps:** When a slow filler ("okaaaay") interrupts the agent, this mechanism resumes automatically within 1 second. + +--- + +### Layer 3: Transcript-Based Classification (The Brain) +The most important layer — our custom logic that analyzes transcripts. This layer enforces strict priority: **Commands > Real Input > Fillers**. + +#### Key Logic Flow: +```python +@session.on("user_input_transcribed") +def on_user_input_transcribed(ev): + text = normalize_text(ev.transcript) + + # 1. CHECK COMMANDS FIRST (Priority!) + if contains_command(text): + if agent.is_speaking: + session.interrupt() # Force stop if VAD missed it + return # Let LLM process the command + + # 2. CHECK FILLERS SECOND + if is_filler_input(text): + # Suppress from LLM so agent doesn't respond to "yeah" + try_clear_user_turn(session) + return + + # 3. REAL INPUT (Questions, conversation) + # Process normally +``` + +This handles three cases: + +#### Case 1: Agent Was Just Interrupted by VAD +- **Command:** Valid interruption, let LLM respond. +- **Filler:** False alarm! `resume_false_interruption` will auto-resume speech. We call `clear_user_turn()` so the LLM doesn't hear "yeah". +- **Real Input:** Valid interruption. + +#### Case 2: Agent Is Currently Speaking (VAD Hasn't Triggered Yet) +- **Command:** Force immediate interrupt (`session.interrupt()`). +- **Filler:** Ignore completely (`clear_user_turn()`). +- **Real Input:** Allow interrupt (`session.interrupt()`). + +#### Case 3: Agent Is Idle +- **Command/Real Input:** Process normally. +- **Filler:** Suppress (don't wake up LLM for just "okay"). + +--- + +## Key Code Changes (Refactored) + +### 1. Robust Word Lists + +**Command Detection** (Stop Phrases & Prefixes): +```python +# Single words +STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt", ...} + +# Multi-word phrases (normalized) +STOP_PHRASES = {"holdon", "waitasecond", "stopit", "waitaminute", ...} + +# Prefixes that can precede commands +COMMAND_PREFIXES = {"no", "but", "and", "okay", "please", "hey"} +``` +*Now catches:* `"no wait"`, `"hold on"`, `"wait a second"`, `"yeah stop"` + +**Filler Words** (Strict filtering): +```python +FILLER_WORDS = { + "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup", + "hmm", "right", "uh", "um", "ah", "cool", "great", "no", "nah" + # Removed generic words like "i", "see", "all" to avoid false positives +} +``` + +### 2. Detection Functions + +**`contains_command(transcript)`**: +- Checks for multi-word phrases (`"hold on"`). +- Checks for prefixes (`"no wait"`). +- Checks priority positions (first 3 words). + +**`is_filler_input(transcript)`**: +- **CRITICAL:** Calls `contains_command()` first! If it's a command, it is NOT a filler. +- Only matches if input is *purely* filler words/phrases. + +### 3. Transcript Suppression +We use a helper to prevent the LLM from responding to fillers: +```python +def try_clear_user_turn(session): + if hasattr(session, 'clear_user_turn'): + session.clear_user_turn() +``` + +--- + +## How It All Works Together (Examples) + +### Scenario 1: User says "yeah" (0.3s, quick acknowledgment) +1. ✅ **VAD Layer:** Too short (< 0.6s) → No interrupt +2. ✅ **Transcript Layer:** `is_filler_input` = True. `try_clear_user_turn()` called. +3. ✅ **Result:** Agent continues speaking. LLM sees nothing. + +### Scenario 2: User says "okaaaay" (1.5s, slow filler) +1. ❌ **VAD Layer:** Long enough (> 0.6s) → Interrupts agent +2. ✅ **Resume Layer:** Waits 1s, decides it's a false interrupt → Resumes +3. ✅ **Transcript Layer:** `is_filler_input` = True. Suppresses transcript. +4. ✅ **Result:** Brief pause (1s), then agent resumes. + +### Scenario 3: User says "no wait" (Quick command) +1. ❌ **VAD Layer:** Might be too short or missed. +2. ✅ **Transcript Layer:** `contains_command` = True (catches "no" + "wait"). +3. ✅ **Action:** `session.interrupt()` forced immediately. +4. ✅ **Result:** Agent stops. LLM processes "no wait". + +### Scenario 4: User says "I have a question" +1. ✅ **Transcript Layer:** Not a command, not a filler. +2. ✅ **Action:** Real input. Interrupts agent. +3. ✅ **Result:** Standard conversation flow. + +--- + +## Files Modified + +- **`basic_agent.py`** — Main implementation with all intelligent interruption logic. + +## Dependencies + +No additional dependencies required. Uses standard Python `re` and LiveKit Agents SDK. + +--- + +## Future Improvements + +1. **Semantic Analysis:** Use a small NPU/LLM model to determine if "right" means "correct" (answer) or "continue" (filler). +2. **Prosody Analysis:** Differentiate "stop?" (question) from "STOP!" (command) based on pitch/volume. diff --git a/examples/voice_agents/basic_agent.py b/examples/voice_agents/basic_agent.py index f064dab5d7..8bd2d34160 100644 --- a/examples/voice_agents/basic_agent.py +++ b/examples/voice_agents/basic_agent.py @@ -1,67 +1,201 @@ -import logging +""" +HYBRID INTERRUPTION HANDLING STRATEGY -from dotenv import load_dotenv +Challenge: +- Slow filler words (e.g., a 1.5s "okay") should NOT trigger an interruption. +- Quick commands (e.g., a 0.5s "stop") MUST trigger an immediate interruption. +- Pure duration-based filtering is insufficient as it cannot distinguish these cases reliably. + +Implementation Strategy: +- Configure VAD with MEDIUM sensitivity: Catches most valid speech but may allow some fillers. +- Auto-Resume on Fillers: If a filler triggers an interruption, the transcript handler will resume the agent. +- Force Interrupt on Commands: If a quick command is missed by VAD, the transcript handler will enforce an interrupt. +Outcome: +- Quick "stop" (0.5s): Ignored by VAD (too short) → Transcript Handler detects command and interrupts. ✅ +- Slow "okay" (1.5s): Triggered by VAD → Transcript Handler identifies filler and resumes speech. ✅ +- Quick "okay" (0.3s): Ignored by VAD → Transcript Handler identifies filler and suppresses it. ✅ +""" + +import logging +import re +from dotenv import load_dotenv from livekit.agents import ( - Agent, - AgentServer, - AgentSession, - JobContext, - JobProcess, - MetricsCollectedEvent, - RunContext, - cli, - metrics, - room_io, + Agent, AgentServer, AgentSession, JobContext, JobProcess, cli ) -from livekit.agents.llm import function_tool from livekit.plugins import silero from livekit.plugins.turn_detector.multilingual import MultilingualModel -# uncomment to enable Krisp background voice/noise cancellation -# from livekit.plugins import noise_cancellation +logger = logging.getLogger("intelligent-kelly") +logger.setLevel(logging.INFO) +load_dotenv() -logger = logging.getLogger("basic-agent") +# ============================================================================= +# CONFIGURATION - Command and Filler Detection +# ============================================================================= + +# Single words that mean "stop" as a command +STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt", "enough", "quiet"} + +# Multi-word command phrases (normalized, no spaces) +STOP_PHRASES = { + "holdon", "holdonthat", "waitasec", "waitasecond", "waitaminute", + "stopit", "stopthat", "stopnow", "pausethat", "onemoment" +} + +# Words that can precede a stop word to form a command +COMMAND_PREFIXES = {"no", "but", "and", "okay", "ok", "yeah", "yes", "hey", "please"} + +# Pure filler/acknowledgment words (no overlap with meaningful words) +FILLER_WORDS = { + "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup", + "hmm", "right", "uh", "um", "ah", "ok", "k", "sure", "yes", + "interesting", "really", "wow", "ohh", "ooh", "aha", "mhmm", + "gotcha", "nice", "oh", "no", "nah", "nope", "cool", "great" +} + +# Multi-word filler phrases (normalized with spaces for matching) +FILLER_PHRASES = { + "all right", "got it", "i see", "uh huh", "oh okay", "oh ok", + "oh really", "oh wow", "oh nice", "sounds good", "makes sense", + "i understand", "mm hmm", "uh huh" +} + + +def normalize_text(transcript: str) -> str: + """Normalize transcript for consistent matching.""" + clean = transcript.lower().strip() + clean = re.sub(r'[^\w\s]', '', clean) # Remove punctuation + clean = re.sub(r'\s+', ' ', clean) # Collapse whitespace + return clean.strip() + + +def contains_command(transcript: str) -> bool: + """ + Check if transcript contains an explicit stop command. + MUST be checked BEFORE is_filler_input() to avoid false negatives. + """ + text = normalize_text(transcript) + words = text.split() + + if not words: + return False + + # Check for exact stop phrase match (e.g., "hold on") + text_no_spaces = text.replace(" ", "") + if text_no_spaces in STOP_PHRASES: + return True + + # Check for stop phrase at start (e.g., "hold on a second please") + for phrase in STOP_PHRASES: + if text_no_spaces.startswith(phrase): + return True + + # Direct command: first word is a stop word (e.g., "stop", "wait") + if words[0] in STOP_WORDS: + return True + + # Command after prefix: "yeah wait", "okay stop", "no hold on", "but wait" + # Check first 3 words for pattern: [prefix] + [stop_word] + for i in range(min(3, len(words))): + if words[i] in STOP_WORDS: + # If stop word is in first 3 positions, it's likely a command + # Unless it's a long sentence where stop word is incidental + if len(words) <= 5: + return True + # For longer sentences, only count if stop word is in first 2 positions + if i < 2: + return True + + # Pattern: prefix + stop word anywhere in first 4 words + # e.g., "okay wait a second", "no hold on please" + if len(words) >= 2: + for i in range(min(3, len(words) - 1)): + if words[i] in COMMAND_PREFIXES and words[i + 1] in STOP_WORDS: + return True + + return False + + +def is_filler_input(transcript: str) -> bool: + """ + Check if transcript is purely a filler acknowledgment. + Only returns True if it's DEFINITELY a filler (no command content). + """ + text = normalize_text(transcript) + + # CRITICAL: Command always takes priority - check first! + if contains_command(transcript): + return False + + # Empty or very short + if not text: + return True + + # Exact filler phrase match + if text in FILLER_PHRASES: + return True + + # Single word in filler set + words = text.split() + if len(words) == 1 and words[0] in FILLER_WORDS: + return True + + # All words are fillers (e.g., "yeah yeah", "okay um", "oh really") + if len(words) <= 3 and all(word in FILLER_WORDS for word in words): + return True + + # Compound filler check (e.g., "uhhuh" -> "uh huh") + text_no_spaces = text.replace(" ", "") + if text_no_spaces in FILLER_WORDS: + return True + + return False -load_dotenv() +# ============================================================================= +# AGENT DEFINITION +# ============================================================================= -class MyAgent(Agent): +class IntelligentAgent(Agent): def __init__(self) -> None: super().__init__( - instructions="Your name is Kelly. You would interact with users via voice." - "with that in mind keep your responses concise and to the point." - "do not use emojis, asterisks, markdown, or other special characters in your responses." - "You are curious and friendly, and have a sense of humor." - "you will speak english to the user", + instructions=( + "Your name is Kelly. Keep responses concise and witty. " + "When users say things like 'yeah' or 'okay' while you're speaking, " + "it means they're listening - keep going! " + "Only stop if they explicitly say 'wait', 'stop', or 'hold on'." + ), ) + # Simplified state: only track if agent is currently speaking + self._is_speaking = False + # Track if VAD just interrupted (waiting for transcript to classify) + self._interrupted_by_vad = False + + @property + def is_speaking(self) -> bool: + return self._is_speaking + + @is_speaking.setter + def is_speaking(self, value: bool) -> None: + self._is_speaking = value + + @property + def interrupted_by_vad(self) -> bool: + return self._interrupted_by_vad + + @interrupted_by_vad.setter + def interrupted_by_vad(self, value: bool) -> None: + self._interrupted_by_vad = value async def on_enter(self): - # when the agent is added to the session, it'll generate a reply - # according to its instructions - self.session.generate_reply() - - # all functions annotated with @function_tool will be passed to the LLM when this - # agent is active - @function_tool - async def lookup_weather( - self, context: RunContext, location: str, latitude: str, longitude: str - ): - """Called when the user asks for weather related information. - Ensure the user's location (city or region) is provided. - When given a location, please estimate the latitude and longitude of the location and - do not ask the user for them. - - Args: - location: The location they are asking for - latitude: The latitude of the location, do not ask user for it - longitude: The longitude of the location, do not ask user for it - """ + # Wait for user to speak first (no preemptive greeting) + pass - logger.info(f"Looking up weather for {location}") - - return "sunny with a temperature of 70 degrees." +# ============================================================================= +# SERVER SETUP +# ============================================================================= server = AgentServer() @@ -73,60 +207,163 @@ def prewarm(proc: JobProcess): server.setup_fnc = prewarm +def try_clear_user_turn(session: AgentSession) -> bool: + """Safely attempt to clear user turn to suppress LLM processing.""" + if hasattr(session, 'clear_user_turn'): + try: + session.clear_user_turn() + return True + except Exception as e: + logger.debug(f"clear_user_turn failed: {e}") + return False + + @server.rtc_session() async def entrypoint(ctx: JobContext): - # each log entry will include these fields - ctx.log_context_fields = { - "room": ctx.room.name, - } session = AgentSession( - # Speech-to-text (STT) is your agent's ears, turning the user's speech into text that the LLM can understand - # See all available models at https://docs.livekit.io/agents/models/stt/ stt="deepgram/nova-3", - # A Large Language Model (LLM) is your agent's brain, processing user input and generating a response - # See all available models at https://docs.livekit.io/agents/models/llm/ - llm="openai/gpt-4.1-mini", - # Text-to-speech (TTS) is your agent's voice, turning the LLM's text into speech that the user can hear - # See all available models as well as voice selections at https://docs.livekit.io/agents/models/tts/ + llm="openai/gpt-4o-mini", tts="cartesia/sonic-2:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc", - # VAD and turn detection are used to determine when the user is speaking and when the agent should respond - # See more at https://docs.livekit.io/agents/build/turns - turn_detection=MultilingualModel(), vad=ctx.proc.userdata["vad"], - # allow the LLM to generate a response while waiting for the end of turn - # See more at https://docs.livekit.io/agents/build/audio/#preemptive-generation - preemptive_generation=True, - # sometimes background noise could interrupt the agent session, these are considered false positive interruptions - # when it's detected, you may resume the agent's speech - resume_false_interruption=True, + turn_detection=MultilingualModel(), + + # === HYBRID STRATEGY === + # Medium threshold: catches most fillers but allows quick commands through + allow_interruptions=True, + min_interruption_duration=0.6, # 0.6s - slower than most commands + min_interruption_words=2, # Require 2+ words + + # Enable auto-resume for false positives (LiveKit handles this) false_interruption_timeout=1.0, + resume_false_interruption=True, + + preemptive_generation=False, + min_endpointing_delay=0.5, + max_endpointing_delay=2.5, ) - - # log metrics as they are emitted, and total usage after session is over - usage_collector = metrics.UsageCollector() - - @session.on("metrics_collected") - def _on_metrics_collected(ev: MetricsCollectedEvent): - metrics.log_metrics(ev.metrics) - usage_collector.collect(ev.metrics) - - async def log_usage(): - summary = usage_collector.get_summary() - logger.info(f"Usage: {summary}") - - # shutdown callbacks are triggered when the session is over - ctx.add_shutdown_callback(log_usage) - - await session.start( - agent=MyAgent(), - room=ctx.room, - room_options=room_io.RoomOptions( - audio_input=room_io.AudioInputOptions( - # uncomment to enable the Krisp BVC noise cancellation - # noise_cancellation=noise_cancellation.BVC(), - ), - ), - ) + + agent = IntelligentAgent() + + logger.info("=" * 70) + logger.info("🚀 HYBRID INTELLIGENT INTERRUPTION HANDLER v2") + logger.info(" Strategy: VAD(0.6s, 2words) + Transcript Classification") + logger.info("=" * 70) + + # ------------------------------------------------------------------------- + # EVENT: Agent starts speaking + # ------------------------------------------------------------------------- + @session.on("speech_created") + def on_speech_created(ev): + agent.is_speaking = True + agent.interrupted_by_vad = False + logger.info("🎤 Agent started speaking") + + # ------------------------------------------------------------------------- + # EVENT: Agent state changes + # ------------------------------------------------------------------------- + @session.on("agent_state_changed") + def on_agent_state_changed(ev): + logger.debug(f"🎭 Agent: {ev.old_state} → {ev.new_state}") + + # Detect VAD interruption: speaking → listening transition + if ev.old_state == "speaking" and ev.new_state == "listening": + if agent.is_speaking: + agent.interrupted_by_vad = True + logger.info("⚠️ VAD interrupted - waiting for transcript...") + + # Update speaking state + if ev.new_state in ("listening", "thinking"): + agent.is_speaking = False + elif ev.new_state == "speaking": + agent.is_speaking = True + + # ------------------------------------------------------------------------- + # EVENT: User state changes (for logging only) + # ------------------------------------------------------------------------- + @session.on("user_state_changed") + def on_user_state_changed(ev): + logger.debug(f"👤 User: {ev.old_state} → {ev.new_state}") + + # ------------------------------------------------------------------------- + # EVENT: Transcript received - MAIN LOGIC + # ------------------------------------------------------------------------- + @session.on("user_input_transcribed") + def on_user_input_transcribed(ev): + # Only process final transcripts + if not ev.is_final or not ev.transcript: + return + + text = normalize_text(ev.transcript) + if not text: + return + + # Classify the input + has_command = contains_command(text) + is_filler = is_filler_input(text) + + logger.info( + f"📝 '{text}' | speaking={agent.is_speaking} | " + f"vad_interrupted={agent.interrupted_by_vad} | " + f"cmd={has_command} | filler={is_filler}" + ) + + # ================================================================= + # CASE 1: VAD just interrupted - classify and decide + # ================================================================= + if agent.interrupted_by_vad: + agent.interrupted_by_vad = False # Reset flag + + if has_command: + # Real command - interruption was correct, let LLM process + logger.info(f"🛑 COMMAND after VAD: '{text}' - valid interrupt") + return # Allow normal LLM processing + + if is_filler: + # False positive - LiveKit's resume_false_interruption handles resume + # Suppress transcript from LLM + logger.info(f"🔄 FILLER after VAD: '{text}' - suppressing") + try_clear_user_turn(session) + return + + # Real input (not command, not filler) - valid interruption + logger.info(f"✅ REAL INPUT after VAD: '{text}'") + return # Allow normal LLM processing + + # ================================================================= + # CASE 2: Agent is currently speaking (no VAD interrupt yet) + # ================================================================= + if agent.is_speaking: + if has_command: + # Force interrupt on command that VAD missed + logger.info(f"🛑 COMMAND while speaking: '{text}' - forcing interrupt") + session.interrupt() + return # Allow LLM to process the command + + if is_filler: + # Ignore filler - don't interrupt, don't pass to LLM + logger.info(f"🔇 FILLER while speaking: '{text}' - ignored") + try_clear_user_turn(session) + return + + # Real input - interrupt and let LLM process + logger.info(f"💬 INPUT while speaking: '{text}' - interrupting") + session.interrupt() + return + + # ================================================================= + # CASE 3: Agent is idle (not speaking) + # ================================================================= + if is_filler: + # Suppress lone fillers when idle + logger.info(f"🍃 FILLER while idle: '{text}' - suppressed") + try_clear_user_turn(session) + return + + # Normal input - let LLM process + logger.info(f"✅ INPUT while idle: '{text}'") + # Allow normal processing + + await session.start(agent=agent, room=ctx.room) if __name__ == "__main__":