From 78718729490b5e3f12dabe5c276833152c822a43 Mon Sep 17 00:00:00 2001
From: Sirjan Singh <sirjan.singh036@gmail.com>
Date: Mon, 2 Feb 2026 23:33:17 +0530
Subject: [PATCH 1/5] Revise README for intelligent interruption handling

Updated README to reflect intelligent interruption handling implementation for LiveKit Voice Agent, detailing challenges, solutions, and key code changes.
---
 examples/voice_agents/README.md | 332 ++++++++++++++++++++++++++------
 1 file changed, 276 insertions(+), 56 deletions(-)

diff --git a/examples/voice_agents/README.md b/examples/voice_agents/README.md
index aa401505d1..50bbe91911 100644
--- a/examples/voice_agents/README.md
+++ b/examples/voice_agents/README.md
@@ -1,78 +1,298 @@
-# Voice Agents Examples
+# Intelligent Interruption Handling for LiveKit Voice Agent
 
-This directory contains a comprehensive collection of voice-based agent examples demonstrating various capabilities and integrations with the LiveKit Agents framework.
+## Overview
 
-## 📋 Table of Contents
+This document explains the modifications made to `basic_agent.py` to implement intelligent interruption handling that distinguishes between **filler words** (acknowledgments like "yeah", "okay") and **command words** (interruptions like "stop", "wait").
 
-### 🚀 Getting Started
+---
 
-- [`basic_agent.py`](./basic_agent.py) - A fundamental voice agent with metrics collection
+## The Challenge
 
-### 🛠️ Tool Integration & Function Calling
+In a natural voice conversation, users often say acknowledgment words like "yeah", "okay", or "hmm" while the agent is speaking. These are **backchannel responses** that mean "I'm listening, continue" — not "stop talking."
 
-- [`annotated_tool_args.py`](./annotated_tool_args.py) - Using Python type annotations for tool arguments
-- [`dynamic_tool_creation.py`](./dynamic_tool_creation.py) - Creating and registering tools dynamically at runtime
-- [`raw_function_description.py`](./raw_function_description.py) - Using raw JSON schema definitions for tool descriptions
-- [`silent_function_call.py`](./silent_function_call.py) - Executing function calls without verbal responses to user
-- [`long_running_function.py`](./long_running_function.py) - Handling long running function calls with interruption support
+However, LiveKit's default Voice Activity Detection (VAD) treats ALL user speech as potential interruptions, causing the agent to stop mid-sentence when hearing these fillers.
 
-### ⚡ Real-time Models
+**Requirements:**
+1. **When agent is speaking + user says filler** → Agent continues uninterrupted
+2. **When agent is speaking + user says command** → Agent stops immediately  
+3. **When agent is silent** → All user speech is valid input
+4. **Mixed input** → Commands always take priority over fillers
 
-- [`weather_agent.py`](./weather_agent.py) - OpenAI Realtime API with function calls for weather information
-- [`realtime_video_agent.py`](./realtime_video_agent.py) - Google Gemini with multimodal video and voice capabilities
-- [`realtime_joke_teller.py`](./realtime_joke_teller.py) - Amazon Nova Sonic real-time model with function calls
-- [`realtime_load_chat_history.py`](./realtime_load_chat_history.py) - Loading previous chat history into real-time models
-- [`realtime_turn_detector.py`](./realtime_turn_detector.py) - Using LiveKit's turn detection with real-time models
-- [`realtime_with_tts.py`](./realtime_with_tts.py) - Combining external TTS providers with real-time models
+---
 
-### 🎯 Pipeline Nodes & Hooks
+## The Core Problem: Timing
 
-- [`fast-preresponse.py`](./fast-preresponse.py) - Generating quick responses using the `on_user_turn_completed` node
-- [`flush_llm_node.py`](./flush_llm_node.py) - Flushing partial LLM output to TTS in `llm_node`
-- [`structured_output.py`](./structured_output.py) - Structured data and JSON outputs from agent responses
-- [`speedup_output_audio.py`](./speedup_output_audio.py) - Dynamically adjusting agent audio playback speed
-- [`timed_agent_transcript.py`](./timed_agent_transcript.py) - Reading timestamped transcripts from `transcription_node`
-- [`inactive_user.py`](./inactive_user.py) - Handling inactive users with the `user_state_changed` event hook
-- [`resume_interrupted_agent.py`](./resume_interrupted_agent.py) - Resuming agent speech after false interruption detection
-- [`toggle_io.py`](./toggle_io.py) - Dynamically toggling audio input/output during conversations
+The fundamental challenge is **VAD interrupts BEFORE transcripts arrive**:
 
-### 🤖 Multi-agent & AgentTask Use Cases
+```
+Time 0.0s: User starts saying "yeah"
+Time 0.3s: VAD detects speech → Interrupts agent
+Time 0.5s: User finishes saying "yeah"  
+Time 0.8s: Transcript arrives → "Yeah."
+```
 
-- [`restaurant_agent.py`](./restaurant_agent.py) - Multi-agent system for restaurant ordering and reservation management
-- [`multi_agent.py`](./multi_agent.py) - Collaborative storytelling with multiple specialized agents
-- [`email_example.py`](./email_example.py) - Using AgentTask to collect and validate email addresses
+By the time we know it was a filler word, the agent has already stopped!
 
-### 🔗 MCP & External Integrations
+---
 
-- [`web_search.py`](./web_search.py) - Integrating web search capabilities into voice agents
-- [`langgraph_agent.py`](./langgraph_agent.py) - LangGraph integration
-- [`mcp/`](./mcp/) - Model Context Protocol (MCP) integration examples
-  - [`mcp-agent.py`](./mcp/mcp-agent.py) - MCP agent integration
-  - [`server.py`](./mcp/server.py) - MCP server example
-- [`zapier_mcp_integration.py`](./zapier_mcp_integration.py) - Automating workflows with Zapier through MCP
+## The Solution: Hybrid Approach
 
-### 💾 RAG & Knowledge Management
+We use a **three-layer defense system**:
 
-- [`llamaindex-rag/`](./llamaindex-rag/) - Complete RAG implementation with LlamaIndex
-  - [`chat_engine.py`](./llamaindex-rag/chat_engine.py) - Chat engine integration
-  - [`query_engine.py`](./llamaindex-rag/query_engine.py) - Query engine used in a function tool
-  - [`retrieval.py`](./llamaindex-rag/retrieval.py) - Document retrieval
+### Layer 1: Medium VAD Thresholds
+```python
+min_interruption_duration=0.6,  # Requires 0.6 seconds of speech
+min_interruption_words=2,        # Requires at least 2 words
+```
 
-### 🎵 Specialized Use Cases
+**Purpose:** Filters out very quick, single-word fillers ("yeah!", "okay!")
 
-- [`background_audio.py`](./background_audio.py) - Playing background audio or ambient sounds during conversations
-- [`push_to_talk.py`](./push_to_talk.py) - Push-to-talk interaction
-- [`tts_text_pacing.py`](./tts_text_pacing.py) - Pacing control for TTS requests
-- [`speaker_id_multi_speaker.py`](./speaker_id_multi_speaker.py) - Multi-speaker identification
+**Tradeoff:** Longer fillers (1.5s "okaaaay") can still slip through
 
-### 📊 Tracing & Error Handling
+---
 
-- [`langfuse_trace.py`](./langfuse_trace.py) - LangFuse integration for conversation tracing
-- [`error_callback.py`](./error_callback.py) - Error handling callback
-- [`session_close_callback.py`](./session_close_callback.py) - Session lifecycle management
+### Layer 2: Automatic Resume on False Interruptions
+```python
+resume_false_interruption=True,
+false_interruption_timeout=1.0,
+```
 
-## 📖 Additional Resources
+**Purpose:** If VAD interrupts the agent, LiveKit waits 1 second for more user speech. If nothing substantial comes, it automatically resumes the agent's speech.
 
-- [LiveKit Agents Documentation](https://docs.livekit.io/agents/)
-- [Agents Starter Example](https://github.com/livekit-examples/agent-starter-python)
-- [More Agents Examples](https://github.com/livekit-examples/python-agents-examples)
+**How it helps:** When a slow filler ("okaaaay") interrupts the agent, this mechanism resumes automatically within 1 second.
+
+---
+
+### Layer 3: Transcript-Based Manual Control
+The most important layer — our custom logic that analyzes transcripts:
+
+```python
+@session.on("user_input_transcribed")
+def on_user_input_transcribed(ev):
+    # Analyze what the user actually said
+    if contains_command(text):
+        session.interrupt()  # Force stop
+    elif is_filler_input(text):
+        return  # Ignore completely
+    else:
+        # Real input - allow processing
+```
+
+This handles three cases:
+
+#### Case 1: Agent Was Just Interrupted by VAD
+```python
+if kelly.was_interrupted_by_vad:
+    if contains_command(text):
+        # Real command - stay stopped
+    elif is_filler_input(text):
+        # False alarm - resume_false_interruption handles it
+    else:
+        # Real input - process normally
+```
+
+#### Case 2: Agent Is Currently Speaking (VAD Hasn't Triggered Yet)
+```python
+if kelly.is_speaking:
+    if contains_command(text):
+        session.interrupt()  # Force interrupt NOW
+    elif is_filler_input(text):
+        return  # Completely ignore
+    else:
+        session.interrupt()  # Real input - allow interrupt
+```
+
+#### Case 3: Agent Is Idle
+```python
+if not kelly.is_speaking:
+    if is_filler_input(text):
+        return  # Suppress from LLM
+    # Otherwise process normally
+```
+
+---
+
+## Key Code Changes
+
+### 1. Word Lists Configuration
+
+**Filler Words** (acknowledgments to ignore):
+```python
+FILLER_WORDS = {
+    "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup",
+    "hmm", "right", "uh", "um", "ah", "gotit", "isee", "ok",
+    # ... more
+}
+
+FILLER_PHRASES = {
+    "all right", "got it", "i see", "uh huh", "oh okay"
+}
+```
+
+**Command Words** (explicit stop requests):
+```python
+STOP_WORDS = {
+    "wait", "stop", "finish", "hold", "pause", "halt"
+}
+```
+
+### 2. Detection Functions
+
+**`is_filler_input(transcript)`** — Returns `True` if input is purely acknowledgment:
+- Removes punctuation
+- Checks against filler word/phrase lists
+- Validates all words are filler tokens
+
+**`contains_command(transcript)`** — Returns `True` if input contains stop command:
+- Checks if sentence starts with stop word
+- Detects "filler + command" patterns ("yeah wait", "okay stop")
+- Avoids false positives in longer sentences
+
+### 3. State Tracking
+
+```python
+class IntelligentAgent(Agent):
+    def __init__(self):
+        self.is_speaking = False           # Currently generating speech
+        self.was_interrupted_by_vad = False  # Just got interrupted by VAD
+        self.last_speech_content = ""      # Content being spoken
+```
+
+### 4. Event Handlers
+
+**`on_speech_created`** — Tracks when agent starts speaking:
+```python
+@session.on("speech_created")
+def on_speech_created(ev):
+    kelly.is_speaking = True
+    kelly.was_interrupted_by_vad = False
+```
+
+**`on_agent_state_changed`** — Detects interruptions:
+```python
+if ev.old_state == "speaking" and ev.new_state == "listening":
+    if kelly.is_speaking:
+        kelly.was_interrupted_by_vad = True
+```
+
+**`on_user_input_transcribed`** — Main interruption logic (see Layer 3 above)
+
+---
+
+## Configuration Parameters
+
+### AgentSession Settings
+
+| Parameter | Value | Purpose |
+|-----------|-------|---------|
+| `allow_interruptions` | `True` | Enable VAD-based interruptions |
+| `min_interruption_duration` | `0.6` | Require 0.6s of speech to interrupt |
+| `min_interruption_words` | `2` | Require 2+ words to interrupt |
+| `resume_false_interruption` | `True` | Auto-resume after false interruptions |
+| `false_interruption_timeout` | `1.0` | Wait 1s before resuming |
+| `preemptive_generation` | `False` | Disabled for more predictable flow |
+| `min_endpointing_delay` | `0.5` | Min silence before turn ends |
+| `max_endpointing_delay` | `2.5` | Max silence before turn ends |
+
+---
+
+## How It All Works Together
+
+### Scenario 1: User says "yeah" (0.3s, quick acknowledgment)
+1. ✅ **VAD Layer:** Too short (0.3s < 0.6s) → No interrupt
+2. ✅ **Transcript Handler:** Detects filler while speaking → Ignores
+3. ✅ **Result:** Agent continues speaking smoothly
+
+### Scenario 2: User says "okaaaay" (1.5s, slow filler)
+1. ❌ **VAD Layer:** Long enough (1.5s > 0.6s) → Interrupts agent
+2. ✅ **Resume Layer:** Waits 1s for more speech, nothing comes → Resumes
+3. ✅ **Transcript Handler:** Marks as filler → Suppresses from LLM
+4. ✅ **Result:** Brief pause (1s), then agent resumes
+
+### Scenario 3: User says "stop" (0.5s, quick command)
+1. ✅ **VAD Layer:** Too short (0.5s < 0.6s) → No interrupt
+2. ✅ **Transcript Handler:** Detects command → `session.interrupt()`
+3. ✅ **Result:** Agent stops immediately via manual interrupt
+
+### Scenario 4: User says "wait a second" (1.2s, clear command)
+1. ✅ **VAD Layer:** Long enough (1.2s > 0.6s) → Interrupts agent
+2. ✅ **Transcript Handler:** Detects command → Stays stopped
+3. ✅ **Result:** Agent stops, processes user's request
+
+---
+
+## Testing the Solution
+
+### Test Cases
+
+1. **Filler while speaking:**
+   - Say "yeah", "okay", "hmm" while agent is talking
+   - **Expected:** Agent continues without stopping
+
+2. **Command while speaking:**
+   - Say "wait", "stop", "hold on" while agent is talking
+   - **Expected:** Agent stops immediately
+
+3. **Mixed input:**
+   - Say "yeah wait" while agent is talking
+   - **Expected:** Agent stops (command wins)
+
+4. **Filler while silent:**
+   - Say "okay" when agent is idle
+   - **Expected:** Ignored, doesn't trigger new response
+
+5. **Normal conversation:**
+   - Ask questions when agent is idle
+   - **Expected:** Normal response flow
+
+### Logs to Watch For
+
+```
+🎤 KELLY STARTED SPEAKING
+📝 TRANSCRIPT: 'yeah' | Kelly speaking: True
+🔇 FILLER while speaking: 'yeah' - completely ignored
+```
+
+```
+📝 TRANSCRIPT: 'wait' | Kelly speaking: True  
+🛑 STOP COMMAND while speaking: 'wait' - forcing interrupt NOW
+```
+
+```
+⚠️ KELLY INTERRUPTED - waiting for transcript...
+📝 TRANSCRIPT: 'okay' | Just interrupted: True
+🔄 FALSE INTERRUPT: 'okay' was just a filler - should resume
+```
+
+---
+
+## Files Modified
+
+- **`basic_agent.py`** — Main implementation with all intelligent interruption logic
+
+## Dependencies
+
+No additional dependencies required beyond standard LiveKit Agents SDK.
+
+---
+
+## Limitations
+
+1. **Brief pause on slow fillers:** If user says a filler slowly (>0.6s), there may be a ~1s pause before auto-resume
+2. **Language-specific:** Word lists are currently English-focused (though some Hindi words are included)
+3. **Context-unaware:** Doesn't understand semantic context (e.g., "no" as answer vs. "no" as stop command)
+
+---
+
+## Future Improvements
+
+1. **Sentiment analysis:** Use LLM to determine if "no" is a stop command or an answer
+2. **Adaptive thresholds:** Learn user's speech patterns and adjust thresholds
+3. **Multi-language support:** Extended word lists for other languages
+4. **Prosody analysis:** Use tone/pitch to distinguish acknowledgments from commands
+
+---
+
+## Credits
+
+Implementation for the **LiveKit Intelligent Interruption Handling Challenge**.

From 76fefa40ac2e8da1aabf7bebe5145549e07f976c Mon Sep 17 00:00:00 2001
From: Sirjan Singh <sirjan.singh036@gmail.com>
Date: Mon, 2 Feb 2026 23:56:21 +0530
Subject: [PATCH 2/5] Enhance hybrid interruption handling in basic agent

Refactor basic agent to implement a hybrid interruption strategy for better command recognition and filler handling.
---
 examples/voice_agents/basic_agent.py | 332 +++++++++++++++++++--------
 1 file changed, 231 insertions(+), 101 deletions(-)

diff --git a/examples/voice_agents/basic_agent.py b/examples/voice_agents/basic_agent.py
index f064dab5d7..472b198181 100644
--- a/examples/voice_agents/basic_agent.py
+++ b/examples/voice_agents/basic_agent.py
@@ -1,133 +1,263 @@
-import logging
+"""
+THE ACTUAL WORKING HYBRID SOLUTION
 
-from dotenv import load_dotenv
+Problem statement:
+- "okay" said slowly (1.5s) should NOT interrupt
+- "stop" said quickly (0.5s) SHOULD interrupt immediately
+- Can't use duration-based filtering because it fails one of these cases
+
+Solution: 
+- Use MEDIUM thresholds (catches most fillers, but some slip through)
+- Have transcript handler RESUME if interrupted by filler
+- Have transcript handler FORCE INTERRUPT if command detected but VAD didn't trigger
+
+This way:
+- Fast "stop" (0.5s) → Below threshold → VAD doesn't interrupt → Transcript handler forces interrupt ✅
+- Slow "okay" (1.5s) → Above threshold → VAD interrupts → Transcript handler RESUMES ✅
+- Fast "okay" (0.3s) → Below threshold → VAD doesn't interrupt → Transcript suppresses ✅
+"""
 
+import logging
+import re
+import asyncio
+from typing import Optional
+from dotenv import load_dotenv
 from livekit.agents import (
-    Agent,
-    AgentServer,
-    AgentSession,
-    JobContext,
-    JobProcess,
-    MetricsCollectedEvent,
-    RunContext,
-    cli,
-    metrics,
-    room_io,
+    Agent, AgentServer, AgentSession, JobContext, JobProcess,
+    cli, UserInputTranscribedEvent, AgentStateChangedEvent,
+    UserStateChangedEvent
 )
-from livekit.agents.llm import function_tool
-from livekit.plugins import silero
+from livekit.plugins import silero, deepgram, openai, cartesia
 from livekit.plugins.turn_detector.multilingual import MultilingualModel
 
-# uncomment to enable Krisp background voice/noise cancellation
-# from livekit.plugins import noise_cancellation
+logger = logging.getLogger("intelligent-kelly")
+logger.setLevel(logging.INFO)
+load_dotenv()
 
-logger = logging.getLogger("basic-agent")
+# CONFIGURATION
+STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt"}
+FILLER_WORDS = {
+    "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup",
+    "hmm", "right", "uh", "um", "ah", "gotit", "isee", "ok", "k",
+    "sure", "yes", "interesting", "really", "wow", "ohh", "ooh",
+    "aha", "mhmm", "gotcha", "nice", "oh", "all", "got", "it", "i", "see"
+}
+FILLER_PHRASES = {"all right", "got it", "i see", "uh huh", "oh okay", "oh ok"}
 
-load_dotenv()
+def is_filler_input(transcript: str) -> bool:
+    """Check if transcript is purely a filler acknowledgment"""
+    clean = transcript.lower().strip()
+    clean_no_punct = re.sub(r'[^\w\s]', '', clean)
+    
+    if clean_no_punct in FILLER_PHRASES:
+        return True
+    if clean_no_punct.replace(" ", "") in FILLER_WORDS:
+        return True
+    
+    words = clean_no_punct.split()
+    if words and all(word in FILLER_WORDS for word in words):
+        return True
+    return False
 
+def contains_command(transcript: str) -> bool:
+    """Check if transcript contains an explicit stop command"""
+    clean = transcript.lower().strip()
+    clean_no_punct = re.sub(r'[^\w\s]', '', clean)
+    words = clean_no_punct.split()
+    
+    if not words:
+        return False
+    
+    # Direct command (starts with stop word)
+    if words[0] in STOP_WORDS:
+        return True
+    
+    # Command after brief acknowledgment: "yeah wait", "okay stop"
+    if len(words) >= 2:
+        for i in range(len(words) - 1):
+            if words[i] in FILLER_WORDS and words[i + 1] in STOP_WORDS:
+                return True
+            if words[i] in {"but", "and"} and words[i + 1] in STOP_WORDS:
+                return True
+    
+    # Avoid false positives in longer sentences
+    # "I have no idea" should NOT be a command
+    if len(words) > 3 and any(w in STOP_WORDS for w in words):
+        # Only treat as command if stop word is in first 2 positions
+        return any(words[i] in STOP_WORDS for i in range(min(2, len(words))))
+    
+    return False
 
-class MyAgent(Agent):
+class IntelligentAgent(Agent):
     def __init__(self) -> None:
         super().__init__(
-            instructions="Your name is Kelly. You would interact with users via voice."
-            "with that in mind keep your responses concise and to the point."
-            "do not use emojis, asterisks, markdown, or other special characters in your responses."
-            "You are curious and friendly, and have a sense of humor."
-            "you will speak english to the user",
+            instructions=(
+                "Your name is Kelly. Keep responses concise and witty. "
+                "When users say things like 'yeah' or 'okay' while you're speaking, "
+                "it means they're listening - keep going! "
+                "Only stop if they explicitly say 'wait', 'stop', or 'hold on'."
+            ),
         )
-
+        self.is_speaking = False
+        self.was_interrupted_by_vad = False
+        self.last_speech_content = ""
+        
     async def on_enter(self):
-        # when the agent is added to the session, it'll generate a reply
-        # according to its instructions
-        self.session.generate_reply()
-
-    # all functions annotated with @function_tool will be passed to the LLM when this
-    # agent is active
-    @function_tool
-    async def lookup_weather(
-        self, context: RunContext, location: str, latitude: str, longitude: str
-    ):
-        """Called when the user asks for weather related information.
-        Ensure the user's location (city or region) is provided.
-        When given a location, please estimate the latitude and longitude of the location and
-        do not ask the user for them.
-
-        Args:
-            location: The location they are asking for
-            latitude: The latitude of the location, do not ask user for it
-            longitude: The longitude of the location, do not ask user for it
-        """
-
-        logger.info(f"Looking up weather for {location}")
-
-        return "sunny with a temperature of 70 degrees."
-
+        await self.session.generate_reply()
 
 server = AgentServer()
 
-
 def prewarm(proc: JobProcess):
     proc.userdata["vad"] = silero.VAD.load()
 
-
 server.setup_fnc = prewarm
 
-
 @server.rtc_session()
 async def entrypoint(ctx: JobContext):
-    # each log entry will include these fields
-    ctx.log_context_fields = {
-        "room": ctx.room.name,
-    }
     session = AgentSession(
-        # Speech-to-text (STT) is your agent's ears, turning the user's speech into text that the LLM can understand
-        # See all available models at https://docs.livekit.io/agents/models/stt/
         stt="deepgram/nova-3",
-        # A Large Language Model (LLM) is your agent's brain, processing user input and generating a response
-        # See all available models at https://docs.livekit.io/agents/models/llm/
-        llm="openai/gpt-4.1-mini",
-        # Text-to-speech (TTS) is your agent's voice, turning the LLM's text into speech that the user can hear
-        # See all available models as well as voice selections at https://docs.livekit.io/agents/models/tts/
+        llm="openai/gpt-4o-mini",
         tts="cartesia/sonic-2:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
-        # VAD and turn detection are used to determine when the user is speaking and when the agent should respond
-        # See more at https://docs.livekit.io/agents/build/turns
-        turn_detection=MultilingualModel(),
         vad=ctx.proc.userdata["vad"],
-        # allow the LLM to generate a response while waiting for the end of turn
-        # See more at https://docs.livekit.io/agents/build/audio/#preemptive-generation
-        preemptive_generation=True,
-        # sometimes background noise could interrupt the agent session, these are considered false positive interruptions
-        # when it's detected, you may resume the agent's speech
-        resume_false_interruption=True,
-        false_interruption_timeout=1.0,
-    )
-
-    # log metrics as they are emitted, and total usage after session is over
-    usage_collector = metrics.UsageCollector()
-
-    @session.on("metrics_collected")
-    def _on_metrics_collected(ev: MetricsCollectedEvent):
-        metrics.log_metrics(ev.metrics)
-        usage_collector.collect(ev.metrics)
-
-    async def log_usage():
-        summary = usage_collector.get_summary()
-        logger.info(f"Usage: {summary}")
-
-    # shutdown callbacks are triggered when the session is over
-    ctx.add_shutdown_callback(log_usage)
-
-    await session.start(
-        agent=MyAgent(),
-        room=ctx.room,
-        room_options=room_io.RoomOptions(
-            audio_input=room_io.AudioInputOptions(
-                # uncomment to enable the Krisp BVC noise cancellation
-                # noise_cancellation=noise_cancellation.BVC(),
-            ),
-        ),
+        turn_detection=MultilingualModel(),
+        
+        # === HYBRID STRATEGY ===
+        # Medium-low threshold: Catches most fillers but allows quick commands
+        allow_interruptions=True,
+        min_interruption_duration=0.6,  # 0.6s - faster than most fillers, slower than most commands
+        min_interruption_words=2,        # Require 2 words minimum
+        
+        # Enable auto-resume for false positives
+        false_interruption_timeout=1.0,  # Wait 1s for transcript
+        resume_false_interruption=True,  # Auto-resume if false positive
+        
+        preemptive_generation=False,
+        min_endpointing_delay=0.5,
+        max_endpointing_delay=2.5,
     )
-
+    
+    kelly = IntelligentAgent()
+    
+    logger.info("=" * 80)
+    logger.info("🚀 HYBRID INTELLIGENT INTERRUPTION HANDLER")
+    logger.info("⚙️  Strategy:")
+    logger.info("   - Medium VAD thresholds (0.6s, 2 words)")
+    logger.info("   - Auto-resume on false interruptions")
+    logger.info("   - Manual interrupt on commands that slip through")
+    logger.info("   - Transcript suppression for fillers")
+    logger.info("=" * 80)
+    
+    # Track interruption state
+    vad_just_interrupted = False
+    
+    @session.on("speech_created")
+    def on_speech_created(ev):
+        nonlocal vad_just_interrupted
+        kelly.is_speaking = True
+        kelly.was_interrupted_by_vad = False
+        vad_just_interrupted = False
+        
+        # Store what Kelly is saying for potential resume
+        if hasattr(ev, 'speech_handle') and hasattr(ev.speech_handle, 'text'):
+            kelly.last_speech_content = ev.speech_handle.text
+        
+        logger.info("🎤 KELLY STARTED SPEAKING")
+    
+    @session.on("agent_state_changed")
+    def on_agent_state_changed(ev):
+        nonlocal vad_just_interrupted
+        
+        logger.info(f"🎭 AGENT STATE: {ev.old_state} → {ev.new_state}")
+        
+        # Detect if Kelly was interrupted while speaking
+        if ev.old_state == "speaking" and ev.new_state == "listening":
+            if kelly.is_speaking:
+                kelly.was_interrupted_by_vad = True
+                vad_just_interrupted = True
+                logger.info("⚠️ KELLY INTERRUPTED - waiting for transcript to decide action...")
+        
+        if ev.new_state == "listening":
+            kelly.is_speaking = False
+    
+    @session.on("user_state_changed")
+    def on_user_state_changed(ev):
+        logger.info(f"👤 USER STATE: {ev.old_state} → {ev.new_state}")
+    
+    # Try to register false interruption handler
+    try:
+        @session.on("agent_false_interruption")
+        def on_false_interruption(ev):
+            if hasattr(ev, 'resumed') and ev.resumed:
+                logger.info("✅ FALSE INTERRUPTION AUTO-RESUMED by LiveKit")
+    except:
+        logger.warning("⚠️ False interruption event not available in this LiveKit version")
+    
+    @session.on("user_input_transcribed")
+    def on_user_input_transcribed(ev):
+        nonlocal vad_just_interrupted
+        
+        if not ev.is_final or not ev.transcript:
+            return
+        
+        clean_text = re.sub(r'[^\w\s]', '', ev.transcript.lower()).strip()
+        
+        logger.info(f"📝 TRANSCRIPT: '{clean_text}' | Kelly speaking: {kelly.is_speaking} | Just interrupted: {vad_just_interrupted}")
+        
+        # === CASE 1: Kelly was just interrupted by VAD ===
+        if kelly.was_interrupted_by_vad or vad_just_interrupted:
+            
+            if contains_command(clean_text):
+                logger.info(f"🛑 REAL COMMAND after VAD interrupt: '{clean_text}' - staying stopped")
+                kelly.was_interrupted_by_vad = False
+                vad_just_interrupted = False
+                # Allow normal processing - the interrupt was correct
+                return
+            
+            elif is_filler_input(clean_text):
+                logger.info(f"🔄 FALSE INTERRUPT: '{clean_text}' was just a filler - should resume")
+                kelly.was_interrupted_by_vad = False
+                vad_just_interrupted = False
+                
+                # LiveKit's resume_false_interruption should handle this automatically
+                # But we still suppress the transcript from reaching LLM
+                return
+            
+            else:
+                logger.info(f"✅ REAL INPUT after interrupt: '{clean_text}' - valid interruption")
+                kelly.was_interrupted_by_vad = False
+                vad_just_interrupted = False
+                # Allow normal processing
+                return
+        
+        # === CASE 2: Kelly is currently speaking (VAD didn't interrupt yet) ===
+        if kelly.is_speaking:
+            
+            if contains_command(clean_text):
+                logger.info(f"🛑 STOP COMMAND while speaking: '{clean_text}' - forcing interrupt NOW")
+                session.interrupt()
+                return
+            
+            elif is_filler_input(clean_text):
+                logger.info(f"🔇 FILLER while speaking: '{clean_text}' - completely ignored")
+                # Don't interrupt, don't pass to LLM
+                return
+            
+            else:
+                logger.info(f"💬 REAL INPUT while speaking: '{clean_text}' - allowing interrupt")
+                session.interrupt()
+                return
+        
+        # === CASE 3: Kelly is idle ===
+        if not kelly.is_speaking:
+            
+            if is_filler_input(clean_text):
+                logger.info(f"🍃 FILLER while idle: '{clean_text}' - suppressed")
+                return
+            
+            logger.info(f"✅ VALID INPUT while idle: '{clean_text}'")
+            # Normal processing
+    
+    await session.start(agent=kelly, room=ctx.room)
 
 if __name__ == "__main__":
     cli.run_app(server)

From 8118f124e05b9c34105f172ac5146b218abe21d7 Mon Sep 17 00:00:00 2001
From: Sirjan Singh <sirjan.singh036@gmail.com>
Date: Tue, 3 Feb 2026 00:03:13 +0530
Subject: [PATCH 3/5] Update hybrid interruption handling strategy

Refine the hybrid interruption handling strategy to better distinguish between filler words and commands based on timing.
---
 examples/voice_agents/basic_agent.py | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/examples/voice_agents/basic_agent.py b/examples/voice_agents/basic_agent.py
index 472b198181..3a8298684c 100644
--- a/examples/voice_agents/basic_agent.py
+++ b/examples/voice_agents/basic_agent.py
@@ -1,20 +1,20 @@
 """
-THE ACTUAL WORKING HYBRID SOLUTION
+HYBRID INTERRUPTION HANDLING STRATEGY
 
-Problem statement:
-- "okay" said slowly (1.5s) should NOT interrupt
-- "stop" said quickly (0.5s) SHOULD interrupt immediately
-- Can't use duration-based filtering because it fails one of these cases
+Challenge:
+- Slow filler words (e.g., a 1.5s "okay") should NOT trigger an interruption.
+- Quick commands (e.g., a 0.5s "stop") MUST trigger an immediate interruption.
+- Pure duration-based filtering is insufficient as it cannot distinguish these cases reliably.
 
-Solution: 
-- Use MEDIUM thresholds (catches most fillers, but some slip through)
-- Have transcript handler RESUME if interrupted by filler
-- Have transcript handler FORCE INTERRUPT if command detected but VAD didn't trigger
+Implementation Strategy:
+- Configure VAD with MEDIUM sensitivity: Catches most valid speech but may allow some fillers.
+- Auto-Resume on Fillers: If a filler triggers an interruption, the transcript handler will resume the agent.
+- Force Interrupt on Commands: If a quick command is missed by VAD, the transcript handler will enforce an interrupt.
 
-This way:
-- Fast "stop" (0.5s) → Below threshold → VAD doesn't interrupt → Transcript handler forces interrupt ✅
-- Slow "okay" (1.5s) → Above threshold → VAD interrupts → Transcript handler RESUMES ✅
-- Fast "okay" (0.3s) → Below threshold → VAD doesn't interrupt → Transcript suppresses ✅
+Outcome:
+- Quick "stop" (0.5s): Ignored by VAD (too short) → Transcript Handler detects command and interrupts. ✅
+- Slow "okay" (1.5s): Triggered by VAD → Transcript Handler identifies filler and resumes speech. ✅
+- Quick "okay" (0.3s): Ignored by VAD → Transcript Handler identifies filler and suppresses it. ✅
 """
 
 import logging

From cef1b4bbdaa4f80cc5c3f4dd9bdc39adb94f9c89 Mon Sep 17 00:00:00 2001
From: Sirjan Singh <sirjan.singh036@gmail.com>
Date: Tue, 3 Feb 2026 00:28:57 +0530
Subject: [PATCH 4/5] Refactor IntelligentAgent for better command handling

Refactor IntelligentAgent to improve command and filler detection logic, update logging for clarity, and streamline state management.
---
 examples/voice_agents/basic_agent.py | 389 +++++++++++++++++----------
 1 file changed, 248 insertions(+), 141 deletions(-)

diff --git a/examples/voice_agents/basic_agent.py b/examples/voice_agents/basic_agent.py
index 3a8298684c..8bd2d34160 100644
--- a/examples/voice_agents/basic_agent.py
+++ b/examples/voice_agents/basic_agent.py
@@ -19,75 +19,144 @@
 
 import logging
 import re
-import asyncio
-from typing import Optional
 from dotenv import load_dotenv
 from livekit.agents import (
-    Agent, AgentServer, AgentSession, JobContext, JobProcess,
-    cli, UserInputTranscribedEvent, AgentStateChangedEvent,
-    UserStateChangedEvent
+    Agent, AgentServer, AgentSession, JobContext, JobProcess, cli
 )
-from livekit.plugins import silero, deepgram, openai, cartesia
+from livekit.plugins import silero
 from livekit.plugins.turn_detector.multilingual import MultilingualModel
 
 logger = logging.getLogger("intelligent-kelly")
 logger.setLevel(logging.INFO)
 load_dotenv()
 
-# CONFIGURATION
-STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt"}
+# =============================================================================
+# CONFIGURATION - Command and Filler Detection
+# =============================================================================
+
+# Single words that mean "stop" as a command
+STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt", "enough", "quiet"}
+
+# Multi-word command phrases (normalized, no spaces)
+STOP_PHRASES = {
+    "holdon", "holdonthat", "waitasec", "waitasecond", "waitaminute",
+    "stopit", "stopthat", "stopnow", "pausethat", "onemoment"
+}
+
+# Words that can precede a stop word to form a command
+COMMAND_PREFIXES = {"no", "but", "and", "okay", "ok", "yeah", "yes", "hey", "please"}
+
+# Pure filler/acknowledgment words (no overlap with meaningful words)
 FILLER_WORDS = {
     "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup",
-    "hmm", "right", "uh", "um", "ah", "gotit", "isee", "ok", "k",
-    "sure", "yes", "interesting", "really", "wow", "ohh", "ooh",
-    "aha", "mhmm", "gotcha", "nice", "oh", "all", "got", "it", "i", "see"
+    "hmm", "right", "uh", "um", "ah", "ok", "k", "sure", "yes",
+    "interesting", "really", "wow", "ohh", "ooh", "aha", "mhmm",
+    "gotcha", "nice", "oh", "no", "nah", "nope", "cool", "great"
 }
-FILLER_PHRASES = {"all right", "got it", "i see", "uh huh", "oh okay", "oh ok"}
 
-def is_filler_input(transcript: str) -> bool:
-    """Check if transcript is purely a filler acknowledgment"""
+# Multi-word filler phrases (normalized with spaces for matching)
+FILLER_PHRASES = {
+    "all right", "got it", "i see", "uh huh", "oh okay", "oh ok",
+    "oh really", "oh wow", "oh nice", "sounds good", "makes sense",
+    "i understand", "mm hmm", "uh huh"
+}
+
+
+def normalize_text(transcript: str) -> str:
+    """Normalize transcript for consistent matching."""
     clean = transcript.lower().strip()
-    clean_no_punct = re.sub(r'[^\w\s]', '', clean)
-    
-    if clean_no_punct in FILLER_PHRASES:
-        return True
-    if clean_no_punct.replace(" ", "") in FILLER_WORDS:
-        return True
-    
-    words = clean_no_punct.split()
-    if words and all(word in FILLER_WORDS for word in words):
-        return True
-    return False
+    clean = re.sub(r'[^\w\s]', '', clean)  # Remove punctuation
+    clean = re.sub(r'\s+', ' ', clean)      # Collapse whitespace
+    return clean.strip()
+
 
 def contains_command(transcript: str) -> bool:
-    """Check if transcript contains an explicit stop command"""
-    clean = transcript.lower().strip()
-    clean_no_punct = re.sub(r'[^\w\s]', '', clean)
-    words = clean_no_punct.split()
+    """
+    Check if transcript contains an explicit stop command.
+    MUST be checked BEFORE is_filler_input() to avoid false negatives.
+    """
+    text = normalize_text(transcript)
+    words = text.split()
     
     if not words:
         return False
     
-    # Direct command (starts with stop word)
+    # Check for exact stop phrase match (e.g., "hold on")
+    text_no_spaces = text.replace(" ", "")
+    if text_no_spaces in STOP_PHRASES:
+        return True
+    
+    # Check for stop phrase at start (e.g., "hold on a second please")
+    for phrase in STOP_PHRASES:
+        if text_no_spaces.startswith(phrase):
+            return True
+    
+    # Direct command: first word is a stop word (e.g., "stop", "wait")
     if words[0] in STOP_WORDS:
         return True
     
-    # Command after brief acknowledgment: "yeah wait", "okay stop"
-    if len(words) >= 2:
-        for i in range(len(words) - 1):
-            if words[i] in FILLER_WORDS and words[i + 1] in STOP_WORDS:
+    # Command after prefix: "yeah wait", "okay stop", "no hold on", "but wait"
+    # Check first 3 words for pattern: [prefix] + [stop_word]
+    for i in range(min(3, len(words))):
+        if words[i] in STOP_WORDS:
+            # If stop word is in first 3 positions, it's likely a command
+            # Unless it's a long sentence where stop word is incidental
+            if len(words) <= 5:
                 return True
-            if words[i] in {"but", "and"} and words[i + 1] in STOP_WORDS:
+            # For longer sentences, only count if stop word is in first 2 positions
+            if i < 2:
                 return True
     
-    # Avoid false positives in longer sentences
-    # "I have no idea" should NOT be a command
-    if len(words) > 3 and any(w in STOP_WORDS for w in words):
-        # Only treat as command if stop word is in first 2 positions
-        return any(words[i] in STOP_WORDS for i in range(min(2, len(words))))
+    # Pattern: prefix + stop word anywhere in first 4 words
+    # e.g., "okay wait a second", "no hold on please"
+    if len(words) >= 2:
+        for i in range(min(3, len(words) - 1)):
+            if words[i] in COMMAND_PREFIXES and words[i + 1] in STOP_WORDS:
+                return True
+    
+    return False
+
+
+def is_filler_input(transcript: str) -> bool:
+    """
+    Check if transcript is purely a filler acknowledgment.
+    Only returns True if it's DEFINITELY a filler (no command content).
+    """
+    text = normalize_text(transcript)
+    
+    # CRITICAL: Command always takes priority - check first!
+    if contains_command(transcript):
+        return False
+    
+    # Empty or very short
+    if not text:
+        return True
+    
+    # Exact filler phrase match
+    if text in FILLER_PHRASES:
+        return True
+    
+    # Single word in filler set
+    words = text.split()
+    if len(words) == 1 and words[0] in FILLER_WORDS:
+        return True
+    
+    # All words are fillers (e.g., "yeah yeah", "okay um", "oh really")
+    if len(words) <= 3 and all(word in FILLER_WORDS for word in words):
+        return True
+    
+    # Compound filler check (e.g., "uhhuh" -> "uh huh")
+    text_no_spaces = text.replace(" ", "")
+    if text_no_spaces in FILLER_WORDS:
+        return True
     
     return False
 
+
+# =============================================================================
+# AGENT DEFINITION
+# =============================================================================
+
 class IntelligentAgent(Agent):
     def __init__(self) -> None:
         super().__init__(
@@ -98,20 +167,57 @@ def __init__(self) -> None:
                 "Only stop if they explicitly say 'wait', 'stop', or 'hold on'."
             ),
         )
-        self.is_speaking = False
-        self.was_interrupted_by_vad = False
-        self.last_speech_content = ""
-        
+        # Simplified state: only track if agent is currently speaking
+        self._is_speaking = False
+        # Track if VAD just interrupted (waiting for transcript to classify)
+        self._interrupted_by_vad = False
+    
+    @property
+    def is_speaking(self) -> bool:
+        return self._is_speaking
+    
+    @is_speaking.setter
+    def is_speaking(self, value: bool) -> None:
+        self._is_speaking = value
+    
+    @property
+    def interrupted_by_vad(self) -> bool:
+        return self._interrupted_by_vad
+    
+    @interrupted_by_vad.setter
+    def interrupted_by_vad(self, value: bool) -> None:
+        self._interrupted_by_vad = value
+
     async def on_enter(self):
-        await self.session.generate_reply()
+        # Wait for user to speak first (no preemptive greeting)
+        pass
+
+
+# =============================================================================
+# SERVER SETUP
+# =============================================================================
 
 server = AgentServer()
 
+
 def prewarm(proc: JobProcess):
     proc.userdata["vad"] = silero.VAD.load()
 
+
 server.setup_fnc = prewarm
 
+
+def try_clear_user_turn(session: AgentSession) -> bool:
+    """Safely attempt to clear user turn to suppress LLM processing."""
+    if hasattr(session, 'clear_user_turn'):
+        try:
+            session.clear_user_turn()
+            return True
+        except Exception as e:
+            logger.debug(f"clear_user_turn failed: {e}")
+    return False
+
+
 @server.rtc_session()
 async def entrypoint(ctx: JobContext):
     session = AgentSession(
@@ -122,142 +228,143 @@ async def entrypoint(ctx: JobContext):
         turn_detection=MultilingualModel(),
         
         # === HYBRID STRATEGY ===
-        # Medium-low threshold: Catches most fillers but allows quick commands
+        # Medium threshold: catches most fillers but allows quick commands through
         allow_interruptions=True,
-        min_interruption_duration=0.6,  # 0.6s - faster than most fillers, slower than most commands
-        min_interruption_words=2,        # Require 2 words minimum
+        min_interruption_duration=0.6,   # 0.6s - slower than most commands
+        min_interruption_words=2,         # Require 2+ words
         
-        # Enable auto-resume for false positives
-        false_interruption_timeout=1.0,  # Wait 1s for transcript
-        resume_false_interruption=True,  # Auto-resume if false positive
+        # Enable auto-resume for false positives (LiveKit handles this)
+        false_interruption_timeout=1.0,
+        resume_false_interruption=True,
         
         preemptive_generation=False,
         min_endpointing_delay=0.5,
         max_endpointing_delay=2.5,
     )
     
-    kelly = IntelligentAgent()
-    
-    logger.info("=" * 80)
-    logger.info("🚀 HYBRID INTELLIGENT INTERRUPTION HANDLER")
-    logger.info("⚙️  Strategy:")
-    logger.info("   - Medium VAD thresholds (0.6s, 2 words)")
-    logger.info("   - Auto-resume on false interruptions")
-    logger.info("   - Manual interrupt on commands that slip through")
-    logger.info("   - Transcript suppression for fillers")
-    logger.info("=" * 80)
+    agent = IntelligentAgent()
     
-    # Track interruption state
-    vad_just_interrupted = False
+    logger.info("=" * 70)
+    logger.info("🚀 HYBRID INTELLIGENT INTERRUPTION HANDLER v2")
+    logger.info("   Strategy: VAD(0.6s, 2words) + Transcript Classification")
+    logger.info("=" * 70)
     
+    # -------------------------------------------------------------------------
+    # EVENT: Agent starts speaking
+    # -------------------------------------------------------------------------
     @session.on("speech_created")
     def on_speech_created(ev):
-        nonlocal vad_just_interrupted
-        kelly.is_speaking = True
-        kelly.was_interrupted_by_vad = False
-        vad_just_interrupted = False
-        
-        # Store what Kelly is saying for potential resume
-        if hasattr(ev, 'speech_handle') and hasattr(ev.speech_handle, 'text'):
-            kelly.last_speech_content = ev.speech_handle.text
-        
-        logger.info("🎤 KELLY STARTED SPEAKING")
+        agent.is_speaking = True
+        agent.interrupted_by_vad = False
+        logger.info("🎤 Agent started speaking")
     
+    # -------------------------------------------------------------------------
+    # EVENT: Agent state changes
+    # -------------------------------------------------------------------------
     @session.on("agent_state_changed")
     def on_agent_state_changed(ev):
-        nonlocal vad_just_interrupted
+        logger.debug(f"🎭 Agent: {ev.old_state} → {ev.new_state}")
         
-        logger.info(f"🎭 AGENT STATE: {ev.old_state} → {ev.new_state}")
-        
-        # Detect if Kelly was interrupted while speaking
+        # Detect VAD interruption: speaking → listening transition
         if ev.old_state == "speaking" and ev.new_state == "listening":
-            if kelly.is_speaking:
-                kelly.was_interrupted_by_vad = True
-                vad_just_interrupted = True
-                logger.info("⚠️ KELLY INTERRUPTED - waiting for transcript to decide action...")
+            if agent.is_speaking:
+                agent.interrupted_by_vad = True
+                logger.info("⚠️ VAD interrupted - waiting for transcript...")
         
-        if ev.new_state == "listening":
-            kelly.is_speaking = False
+        # Update speaking state
+        if ev.new_state in ("listening", "thinking"):
+            agent.is_speaking = False
+        elif ev.new_state == "speaking":
+            agent.is_speaking = True
     
+    # -------------------------------------------------------------------------
+    # EVENT: User state changes (for logging only)
+    # -------------------------------------------------------------------------
     @session.on("user_state_changed")
     def on_user_state_changed(ev):
-        logger.info(f"👤 USER STATE: {ev.old_state} → {ev.new_state}")
-    
-    # Try to register false interruption handler
-    try:
-        @session.on("agent_false_interruption")
-        def on_false_interruption(ev):
-            if hasattr(ev, 'resumed') and ev.resumed:
-                logger.info("✅ FALSE INTERRUPTION AUTO-RESUMED by LiveKit")
-    except:
-        logger.warning("⚠️ False interruption event not available in this LiveKit version")
+        logger.debug(f"👤 User: {ev.old_state} → {ev.new_state}")
     
+    # -------------------------------------------------------------------------
+    # EVENT: Transcript received - MAIN LOGIC
+    # -------------------------------------------------------------------------
     @session.on("user_input_transcribed")
     def on_user_input_transcribed(ev):
-        nonlocal vad_just_interrupted
-        
+        # Only process final transcripts
         if not ev.is_final or not ev.transcript:
             return
         
-        clean_text = re.sub(r'[^\w\s]', '', ev.transcript.lower()).strip()
+        text = normalize_text(ev.transcript)
+        if not text:
+            return
+        
+        # Classify the input
+        has_command = contains_command(text)
+        is_filler = is_filler_input(text)
         
-        logger.info(f"📝 TRANSCRIPT: '{clean_text}' | Kelly speaking: {kelly.is_speaking} | Just interrupted: {vad_just_interrupted}")
+        logger.info(
+            f"📝 '{text}' | speaking={agent.is_speaking} | "
+            f"vad_interrupted={agent.interrupted_by_vad} | "
+            f"cmd={has_command} | filler={is_filler}"
+        )
         
-        # === CASE 1: Kelly was just interrupted by VAD ===
-        if kelly.was_interrupted_by_vad or vad_just_interrupted:
+        # =================================================================
+        # CASE 1: VAD just interrupted - classify and decide
+        # =================================================================
+        if agent.interrupted_by_vad:
+            agent.interrupted_by_vad = False  # Reset flag
             
-            if contains_command(clean_text):
-                logger.info(f"🛑 REAL COMMAND after VAD interrupt: '{clean_text}' - staying stopped")
-                kelly.was_interrupted_by_vad = False
-                vad_just_interrupted = False
-                # Allow normal processing - the interrupt was correct
-                return
+            if has_command:
+                # Real command - interruption was correct, let LLM process
+                logger.info(f"🛑 COMMAND after VAD: '{text}' - valid interrupt")
+                return  # Allow normal LLM processing
             
-            elif is_filler_input(clean_text):
-                logger.info(f"🔄 FALSE INTERRUPT: '{clean_text}' was just a filler - should resume")
-                kelly.was_interrupted_by_vad = False
-                vad_just_interrupted = False
-                
-                # LiveKit's resume_false_interruption should handle this automatically
-                # But we still suppress the transcript from reaching LLM
+            if is_filler:
+                # False positive - LiveKit's resume_false_interruption handles resume
+                # Suppress transcript from LLM
+                logger.info(f"🔄 FILLER after VAD: '{text}' - suppressing")
+                try_clear_user_turn(session)
                 return
             
-            else:
-                logger.info(f"✅ REAL INPUT after interrupt: '{clean_text}' - valid interruption")
-                kelly.was_interrupted_by_vad = False
-                vad_just_interrupted = False
-                # Allow normal processing
-                return
+            # Real input (not command, not filler) - valid interruption
+            logger.info(f"✅ REAL INPUT after VAD: '{text}'")
+            return  # Allow normal LLM processing
         
-        # === CASE 2: Kelly is currently speaking (VAD didn't interrupt yet) ===
-        if kelly.is_speaking:
-            
-            if contains_command(clean_text):
-                logger.info(f"🛑 STOP COMMAND while speaking: '{clean_text}' - forcing interrupt NOW")
+        # =================================================================
+        # CASE 2: Agent is currently speaking (no VAD interrupt yet)
+        # =================================================================
+        if agent.is_speaking:
+            if has_command:
+                # Force interrupt on command that VAD missed
+                logger.info(f"🛑 COMMAND while speaking: '{text}' - forcing interrupt")
                 session.interrupt()
-                return
+                return  # Allow LLM to process the command
             
-            elif is_filler_input(clean_text):
-                logger.info(f"🔇 FILLER while speaking: '{clean_text}' - completely ignored")
-                # Don't interrupt, don't pass to LLM
+            if is_filler:
+                # Ignore filler - don't interrupt, don't pass to LLM
+                logger.info(f"🔇 FILLER while speaking: '{text}' - ignored")
+                try_clear_user_turn(session)
                 return
             
-            else:
-                logger.info(f"💬 REAL INPUT while speaking: '{clean_text}' - allowing interrupt")
-                session.interrupt()
-                return
+            # Real input - interrupt and let LLM process
+            logger.info(f"💬 INPUT while speaking: '{text}' - interrupting")
+            session.interrupt()
+            return
         
-        # === CASE 3: Kelly is idle ===
-        if not kelly.is_speaking:
-            
-            if is_filler_input(clean_text):
-                logger.info(f"🍃 FILLER while idle: '{clean_text}' - suppressed")
-                return
-            
-            logger.info(f"✅ VALID INPUT while idle: '{clean_text}'")
-            # Normal processing
+        # =================================================================
+        # CASE 3: Agent is idle (not speaking)
+        # =================================================================
+        if is_filler:
+            # Suppress lone fillers when idle
+            logger.info(f"🍃 FILLER while idle: '{text}' - suppressed")
+            try_clear_user_turn(session)
+            return
+        
+        # Normal input - let LLM process
+        logger.info(f"✅ INPUT while idle: '{text}'")
+        # Allow normal processing
     
-    await session.start(agent=kelly, room=ctx.room)
+    await session.start(agent=agent, room=ctx.room)
+
 
 if __name__ == "__main__":
     cli.run_app(server)

From 4ad08caa08e9a4b5a3305e3203e0e11043869e46 Mon Sep 17 00:00:00 2001
From: Sirjan Singh <sirjan.singh036@gmail.com>
Date: Tue, 3 Feb 2026 01:14:32 +0530
Subject: [PATCH 5/5] Enhance README with student info and clarification

Added student details and improved documentation on intelligent interruption handling in voice agents.
---
 examples/voice_agents/README.md | 265 ++++++++++----------------------
 1 file changed, 84 insertions(+), 181 deletions(-)

diff --git a/examples/voice_agents/README.md b/examples/voice_agents/README.md
index 50bbe91911..a1d37bcc46 100644
--- a/examples/voice_agents/README.md
+++ b/examples/voice_agents/README.md
@@ -4,6 +4,12 @@
 
 This document explains the modifications made to `basic_agent.py` to implement intelligent interruption handling that distinguishes between **filler words** (acknowledgments like "yeah", "okay") and **command words** (interruptions like "stop", "wait").
 
+---
+## Student Details
+- **Name:** Sirjan Singh
+- **College Roll Number:** 23UCS715
+- **Demo Video Link:** [Drive Link](https://drive.google.com/drive/folders/1LXnojdfCtswc14PxWH60ZqynbLN03F3J?usp=sharing)
+  
 ---
 
 ## The Challenge
@@ -16,7 +22,7 @@ However, LiveKit's default Voice Activity Detection (VAD) treats ALL user speech
 1. **When agent is speaking + user says filler** → Agent continues uninterrupted
 2. **When agent is speaking + user says command** → Agent stops immediately  
 3. **When agent is silent** → All user speech is valid input
-4. **Mixed input** → Commands always take priority over fillers
+4. **Mixed input** → Commands always take priority over fillers (e.g., "yeah wait" is a command)
 
 ---
 
@@ -63,236 +69,133 @@ false_interruption_timeout=1.0,
 
 ---
 
-### Layer 3: Transcript-Based Manual Control
-The most important layer — our custom logic that analyzes transcripts:
+### Layer 3: Transcript-Based Classification (The Brain)
+The most important layer — our custom logic that analyzes transcripts. This layer enforces strict priority: **Commands > Real Input > Fillers**.
 
+#### Key Logic Flow:
 ```python
 @session.on("user_input_transcribed")
 def on_user_input_transcribed(ev):
-    # Analyze what the user actually said
+    text = normalize_text(ev.transcript)
+    
+    # 1. CHECK COMMANDS FIRST (Priority!)
     if contains_command(text):
-        session.interrupt()  # Force stop
-    elif is_filler_input(text):
-        return  # Ignore completely
-    else:
-        # Real input - allow processing
+        if agent.is_speaking:
+            session.interrupt()  # Force stop if VAD missed it
+        return # Let LLM process the command
+        
+    # 2. CHECK FILLERS SECOND
+    if is_filler_input(text):
+        # Suppress from LLM so agent doesn't respond to "yeah"
+        try_clear_user_turn(session) 
+        return
+        
+    # 3. REAL INPUT (Questions, conversation)
+    # Process normally
 ```
 
 This handles three cases:
 
 #### Case 1: Agent Was Just Interrupted by VAD
-```python
-if kelly.was_interrupted_by_vad:
-    if contains_command(text):
-        # Real command - stay stopped
-    elif is_filler_input(text):
-        # False alarm - resume_false_interruption handles it
-    else:
-        # Real input - process normally
-```
+- **Command:** Valid interruption, let LLM respond.
+- **Filler:** False alarm! `resume_false_interruption` will auto-resume speech. We call `clear_user_turn()` so the LLM doesn't hear "yeah".
+- **Real Input:** Valid interruption.
 
 #### Case 2: Agent Is Currently Speaking (VAD Hasn't Triggered Yet)
-```python
-if kelly.is_speaking:
-    if contains_command(text):
-        session.interrupt()  # Force interrupt NOW
-    elif is_filler_input(text):
-        return  # Completely ignore
-    else:
-        session.interrupt()  # Real input - allow interrupt
-```
+- **Command:** Force immediate interrupt (`session.interrupt()`).
+- **Filler:** Ignore completely (`clear_user_turn()`).
+- **Real Input:** Allow interrupt (`session.interrupt()`).
 
 #### Case 3: Agent Is Idle
-```python
-if not kelly.is_speaking:
-    if is_filler_input(text):
-        return  # Suppress from LLM
-    # Otherwise process normally
-```
+- **Command/Real Input:** Process normally.
+- **Filler:** Suppress (don't wake up LLM for just "okay").
 
 ---
 
-## Key Code Changes
+## Key Code Changes (Refactored)
 
-### 1. Word Lists Configuration
+### 1. Robust Word Lists
 
-**Filler Words** (acknowledgments to ignore):
+**Command Detection** (Stop Phrases & Prefixes):
 ```python
-FILLER_WORDS = {
-    "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup",
-    "hmm", "right", "uh", "um", "ah", "gotit", "isee", "ok",
-    # ... more
-}
+# Single words
+STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt", ...}
 
-FILLER_PHRASES = {
-    "all right", "got it", "i see", "uh huh", "oh okay"
-}
+# Multi-word phrases (normalized)
+STOP_PHRASES = {"holdon", "waitasecond", "stopit", "waitaminute", ...}
+
+# Prefixes that can precede commands
+COMMAND_PREFIXES = {"no", "but", "and", "okay", "please", "hey"}
 ```
+*Now catches:* `"no wait"`, `"hold on"`, `"wait a second"`, `"yeah stop"`
 
-**Command Words** (explicit stop requests):
+**Filler Words** (Strict filtering):
 ```python
-STOP_WORDS = {
-    "wait", "stop", "finish", "hold", "pause", "halt"
+FILLER_WORDS = {
+    "uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup",
+    "hmm", "right", "uh", "um", "ah", "cool", "great", "no", "nah"
+    # Removed generic words like "i", "see", "all" to avoid false positives
 }
 ```
 
 ### 2. Detection Functions
 
-**`is_filler_input(transcript)`** — Returns `True` if input is purely acknowledgment:
-- Removes punctuation
-- Checks against filler word/phrase lists
-- Validates all words are filler tokens
-
-**`contains_command(transcript)`** — Returns `True` if input contains stop command:
-- Checks if sentence starts with stop word
-- Detects "filler + command" patterns ("yeah wait", "okay stop")
-- Avoids false positives in longer sentences
-
-### 3. State Tracking
+**`contains_command(transcript)`**:
+- Checks for multi-word phrases (`"hold on"`).
+- Checks for prefixes (`"no wait"`).
+- Checks priority positions (first 3 words).
 
-```python
-class IntelligentAgent(Agent):
-    def __init__(self):
-        self.is_speaking = False           # Currently generating speech
-        self.was_interrupted_by_vad = False  # Just got interrupted by VAD
-        self.last_speech_content = ""      # Content being spoken
-```
+**`is_filler_input(transcript)`**:
+- **CRITICAL:** Calls `contains_command()` first! If it's a command, it is NOT a filler.
+- Only matches if input is *purely* filler words/phrases.
 
-### 4. Event Handlers
-
-**`on_speech_created`** — Tracks when agent starts speaking:
+### 3. Transcript Suppression
+We use a helper to prevent the LLM from responding to fillers:
 ```python
-@session.on("speech_created")
-def on_speech_created(ev):
-    kelly.is_speaking = True
-    kelly.was_interrupted_by_vad = False
+def try_clear_user_turn(session):
+    if hasattr(session, 'clear_user_turn'):
+        session.clear_user_turn()
 ```
 
-**`on_agent_state_changed`** — Detects interruptions:
-```python
-if ev.old_state == "speaking" and ev.new_state == "listening":
-    if kelly.is_speaking:
-        kelly.was_interrupted_by_vad = True
-```
-
-**`on_user_input_transcribed`** — Main interruption logic (see Layer 3 above)
-
 ---
 
-## Configuration Parameters
-
-### AgentSession Settings
-
-| Parameter | Value | Purpose |
-|-----------|-------|---------|
-| `allow_interruptions` | `True` | Enable VAD-based interruptions |
-| `min_interruption_duration` | `0.6` | Require 0.6s of speech to interrupt |
-| `min_interruption_words` | `2` | Require 2+ words to interrupt |
-| `resume_false_interruption` | `True` | Auto-resume after false interruptions |
-| `false_interruption_timeout` | `1.0` | Wait 1s before resuming |
-| `preemptive_generation` | `False` | Disabled for more predictable flow |
-| `min_endpointing_delay` | `0.5` | Min silence before turn ends |
-| `max_endpointing_delay` | `2.5` | Max silence before turn ends |
-
----
-
-## How It All Works Together
+## How It All Works Together (Examples)
 
 ### Scenario 1: User says "yeah" (0.3s, quick acknowledgment)
-1. ✅ **VAD Layer:** Too short (0.3s < 0.6s) → No interrupt
-2. ✅ **Transcript Handler:** Detects filler while speaking → Ignores
-3. ✅ **Result:** Agent continues speaking smoothly
+1. ✅ **VAD Layer:** Too short (< 0.6s) → No interrupt
+2. ✅ **Transcript Layer:** `is_filler_input` = True. `try_clear_user_turn()` called.
+3. ✅ **Result:** Agent continues speaking. LLM sees nothing.
 
 ### Scenario 2: User says "okaaaay" (1.5s, slow filler)
-1. ❌ **VAD Layer:** Long enough (1.5s > 0.6s) → Interrupts agent
-2. ✅ **Resume Layer:** Waits 1s for more speech, nothing comes → Resumes
-3. ✅ **Transcript Handler:** Marks as filler → Suppresses from LLM
-4. ✅ **Result:** Brief pause (1s), then agent resumes
-
-### Scenario 3: User says "stop" (0.5s, quick command)
-1. ✅ **VAD Layer:** Too short (0.5s < 0.6s) → No interrupt
-2. ✅ **Transcript Handler:** Detects command → `session.interrupt()`
-3. ✅ **Result:** Agent stops immediately via manual interrupt
-
-### Scenario 4: User says "wait a second" (1.2s, clear command)
-1. ✅ **VAD Layer:** Long enough (1.2s > 0.6s) → Interrupts agent
-2. ✅ **Transcript Handler:** Detects command → Stays stopped
-3. ✅ **Result:** Agent stops, processes user's request
-
----
-
-## Testing the Solution
-
-### Test Cases
-
-1. **Filler while speaking:**
-   - Say "yeah", "okay", "hmm" while agent is talking
-   - **Expected:** Agent continues without stopping
-
-2. **Command while speaking:**
-   - Say "wait", "stop", "hold on" while agent is talking
-   - **Expected:** Agent stops immediately
-
-3. **Mixed input:**
-   - Say "yeah wait" while agent is talking
-   - **Expected:** Agent stops (command wins)
-
-4. **Filler while silent:**
-   - Say "okay" when agent is idle
-   - **Expected:** Ignored, doesn't trigger new response
-
-5. **Normal conversation:**
-   - Ask questions when agent is idle
-   - **Expected:** Normal response flow
-
-### Logs to Watch For
-
-```
-🎤 KELLY STARTED SPEAKING
-📝 TRANSCRIPT: 'yeah' | Kelly speaking: True
-🔇 FILLER while speaking: 'yeah' - completely ignored
-```
-
-```
-📝 TRANSCRIPT: 'wait' | Kelly speaking: True  
-🛑 STOP COMMAND while speaking: 'wait' - forcing interrupt NOW
-```
-
-```
-⚠️ KELLY INTERRUPTED - waiting for transcript...
-📝 TRANSCRIPT: 'okay' | Just interrupted: True
-🔄 FALSE INTERRUPT: 'okay' was just a filler - should resume
-```
+1. ❌ **VAD Layer:** Long enough (> 0.6s) → Interrupts agent
+2. ✅ **Resume Layer:** Waits 1s, decides it's a false interrupt → Resumes
+3. ✅ **Transcript Layer:** `is_filler_input` = True. Suppresses transcript.
+4. ✅ **Result:** Brief pause (1s), then agent resumes.
+
+### Scenario 3: User says "no wait" (Quick command)
+1. ❌ **VAD Layer:** Might be too short or missed.
+2. ✅ **Transcript Layer:** `contains_command` = True (catches "no" + "wait").
+3. ✅ **Action:** `session.interrupt()` forced immediately.
+4. ✅ **Result:** Agent stops. LLM processes "no wait".
+
+### Scenario 4: User says "I have a question"
+1. ✅ **Transcript Layer:** Not a command, not a filler.
+2. ✅ **Action:** Real input. Interrupts agent.
+3. ✅ **Result:** Standard conversation flow.
 
 ---
 
 ## Files Modified
 
-- **`basic_agent.py`** — Main implementation with all intelligent interruption logic
+- **`basic_agent.py`** — Main implementation with all intelligent interruption logic.
 
 ## Dependencies
 
-No additional dependencies required beyond standard LiveKit Agents SDK.
-
----
-
-## Limitations
-
-1. **Brief pause on slow fillers:** If user says a filler slowly (>0.6s), there may be a ~1s pause before auto-resume
-2. **Language-specific:** Word lists are currently English-focused (though some Hindi words are included)
-3. **Context-unaware:** Doesn't understand semantic context (e.g., "no" as answer vs. "no" as stop command)
+No additional dependencies required. Uses standard Python `re` and LiveKit Agents SDK.
 
 ---
 
 ## Future Improvements
 
-1. **Sentiment analysis:** Use LLM to determine if "no" is a stop command or an answer
-2. **Adaptive thresholds:** Learn user's speech patterns and adjust thresholds
-3. **Multi-language support:** Extended word lists for other languages
-4. **Prosody analysis:** Use tone/pitch to distinguish acknowledgments from commands
-
----
-
-## Credits
-
-Implementation for the **LiveKit Intelligent Interruption Handling Challenge**.
+1. **Semantic Analysis:** Use a small NPU/LLM model to determine if "right" means "correct" (answer) or "continue" (filler).
+2. **Prosody Analysis:** Differentiate "stop?" (question) from "STOP!" (command) based on pitch/volume.