Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 179 additions & 56 deletions examples/voice_agents/README.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,201 @@
# Voice Agents Examples
# Intelligent Interruption Handling for LiveKit Voice Agent

This directory contains a comprehensive collection of voice-based agent examples demonstrating various capabilities and integrations with the LiveKit Agents framework.
## Overview

## 📋 Table of Contents
This document explains the modifications made to `basic_agent.py` to implement intelligent interruption handling that distinguishes between **filler words** (acknowledgments like "yeah", "okay") and **command words** (interruptions like "stop", "wait").
Comment on lines +1 to +5
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README now focuses solely on the intelligent interruption handler in basic_agent.py, but the examples/voice_agents directory still contains many other example scripts that were previously indexed here; consider restoring or relocating a lightweight table of contents so users can still discover the rest of the voice agent examples from this README.

Copilot uses AI. Check for mistakes.

### 🚀 Getting Started
---
## Student Details
- **Name:** Sirjan Singh
- **College Roll Number:** 23UCS715
- **Demo Video Link:** [Drive Link](https://drive.google.com/drive/folders/1LXnojdfCtswc14PxWH60ZqynbLN03F3J?usp=sharing)

---

- [`basic_agent.py`](./basic_agent.py) - A fundamental voice agent with metrics collection
## The Challenge

### 🛠️ Tool Integration & Function Calling
In a natural voice conversation, users often say acknowledgment words like "yeah", "okay", or "hmm" while the agent is speaking. These are **backchannel responses** that mean "I'm listening, continue" — not "stop talking."

- [`annotated_tool_args.py`](./annotated_tool_args.py) - Using Python type annotations for tool arguments
- [`dynamic_tool_creation.py`](./dynamic_tool_creation.py) - Creating and registering tools dynamically at runtime
- [`raw_function_description.py`](./raw_function_description.py) - Using raw JSON schema definitions for tool descriptions
- [`silent_function_call.py`](./silent_function_call.py) - Executing function calls without verbal responses to user
- [`long_running_function.py`](./long_running_function.py) - Handling long running function calls with interruption support
However, LiveKit's default Voice Activity Detection (VAD) treats ALL user speech as potential interruptions, causing the agent to stop mid-sentence when hearing these fillers.

### ⚡ Real-time Models
**Requirements:**
1. **When agent is speaking + user says filler** → Agent continues uninterrupted
2. **When agent is speaking + user says command** → Agent stops immediately
3. **When agent is silent** → All user speech is valid input
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirements section states that "When agent is silent → All user speech is valid input" (point 3), but later in the document and in basic_agent.py filler input while idle is explicitly suppressed; the requirements text should be updated to match the actual designed behavior (idle fillers are ignored).

Suggested change
3. **When agent is silent**All user speech is valid input
3. **When agent is silent**Non-filler user speech is valid input; idle fillers may be ignored

Copilot uses AI. Check for mistakes.
4. **Mixed input** → Commands always take priority over fillers (e.g., "yeah wait" is a command)

- [`weather_agent.py`](./weather_agent.py) - OpenAI Realtime API with function calls for weather information
- [`realtime_video_agent.py`](./realtime_video_agent.py) - Google Gemini with multimodal video and voice capabilities
- [`realtime_joke_teller.py`](./realtime_joke_teller.py) - Amazon Nova Sonic real-time model with function calls
- [`realtime_load_chat_history.py`](./realtime_load_chat_history.py) - Loading previous chat history into real-time models
- [`realtime_turn_detector.py`](./realtime_turn_detector.py) - Using LiveKit's turn detection with real-time models
- [`realtime_with_tts.py`](./realtime_with_tts.py) - Combining external TTS providers with real-time models
---

### 🎯 Pipeline Nodes & Hooks
## The Core Problem: Timing

- [`fast-preresponse.py`](./fast-preresponse.py) - Generating quick responses using the `on_user_turn_completed` node
- [`flush_llm_node.py`](./flush_llm_node.py) - Flushing partial LLM output to TTS in `llm_node`
- [`structured_output.py`](./structured_output.py) - Structured data and JSON outputs from agent responses
- [`speedup_output_audio.py`](./speedup_output_audio.py) - Dynamically adjusting agent audio playback speed
- [`timed_agent_transcript.py`](./timed_agent_transcript.py) - Reading timestamped transcripts from `transcription_node`
- [`inactive_user.py`](./inactive_user.py) - Handling inactive users with the `user_state_changed` event hook
- [`resume_interrupted_agent.py`](./resume_interrupted_agent.py) - Resuming agent speech after false interruption detection
- [`toggle_io.py`](./toggle_io.py) - Dynamically toggling audio input/output during conversations
The fundamental challenge is **VAD interrupts BEFORE transcripts arrive**:

### 🤖 Multi-agent & AgentTask Use Cases
```
Time 0.0s: User starts saying "yeah"
Time 0.3s: VAD detects speech → Interrupts agent
Time 0.5s: User finishes saying "yeah"
Time 0.8s: Transcript arrives → "Yeah."
```

- [`restaurant_agent.py`](./restaurant_agent.py) - Multi-agent system for restaurant ordering and reservation management
- [`multi_agent.py`](./multi_agent.py) - Collaborative storytelling with multiple specialized agents
- [`email_example.py`](./email_example.py) - Using AgentTask to collect and validate email addresses
By the time we know it was a filler word, the agent has already stopped!

### 🔗 MCP & External Integrations
---

- [`web_search.py`](./web_search.py) - Integrating web search capabilities into voice agents
- [`langgraph_agent.py`](./langgraph_agent.py) - LangGraph integration
- [`mcp/`](./mcp/) - Model Context Protocol (MCP) integration examples
- [`mcp-agent.py`](./mcp/mcp-agent.py) - MCP agent integration
- [`server.py`](./mcp/server.py) - MCP server example
- [`zapier_mcp_integration.py`](./zapier_mcp_integration.py) - Automating workflows with Zapier through MCP
## The Solution: Hybrid Approach

### 💾 RAG & Knowledge Management
We use a **three-layer defense system**:

- [`llamaindex-rag/`](./llamaindex-rag/) - Complete RAG implementation with LlamaIndex
- [`chat_engine.py`](./llamaindex-rag/chat_engine.py) - Chat engine integration
- [`query_engine.py`](./llamaindex-rag/query_engine.py) - Query engine used in a function tool
- [`retrieval.py`](./llamaindex-rag/retrieval.py) - Document retrieval
### Layer 1: Medium VAD Thresholds
```python
min_interruption_duration=0.6, # Requires 0.6 seconds of speech
min_interruption_words=2, # Requires at least 2 words
```

### 🎵 Specialized Use Cases
**Purpose:** Filters out very quick, single-word fillers ("yeah!", "okay!")

- [`background_audio.py`](./background_audio.py) - Playing background audio or ambient sounds during conversations
- [`push_to_talk.py`](./push_to_talk.py) - Push-to-talk interaction
- [`tts_text_pacing.py`](./tts_text_pacing.py) - Pacing control for TTS requests
- [`speaker_id_multi_speaker.py`](./speaker_id_multi_speaker.py) - Multi-speaker identification
**Tradeoff:** Longer fillers (1.5s "okaaaay") can still slip through

### 📊 Tracing & Error Handling
---

- [`langfuse_trace.py`](./langfuse_trace.py) - LangFuse integration for conversation tracing
- [`error_callback.py`](./error_callback.py) - Error handling callback
- [`session_close_callback.py`](./session_close_callback.py) - Session lifecycle management
### Layer 2: Automatic Resume on False Interruptions
```python
resume_false_interruption=True,
false_interruption_timeout=1.0,
```

## 📖 Additional Resources
**Purpose:** If VAD interrupts the agent, LiveKit waits 1 second for more user speech. If nothing substantial comes, it automatically resumes the agent's speech.

- [LiveKit Agents Documentation](https://docs.livekit.io/agents/)
- [Agents Starter Example](https://github.com/livekit-examples/agent-starter-python)
- [More Agents Examples](https://github.com/livekit-examples/python-agents-examples)
**How it helps:** When a slow filler ("okaaaay") interrupts the agent, this mechanism resumes automatically within 1 second.

---

### Layer 3: Transcript-Based Classification (The Brain)
The most important layer — our custom logic that analyzes transcripts. This layer enforces strict priority: **Commands > Real Input > Fillers**.

#### Key Logic Flow:
```python
@session.on("user_input_transcribed")
def on_user_input_transcribed(ev):
text = normalize_text(ev.transcript)

# 1. CHECK COMMANDS FIRST (Priority!)
if contains_command(text):
if agent.is_speaking:
session.interrupt() # Force stop if VAD missed it
return # Let LLM process the command

# 2. CHECK FILLERS SECOND
if is_filler_input(text):
# Suppress from LLM so agent doesn't respond to "yeah"
try_clear_user_turn(session)
return

# 3. REAL INPUT (Questions, conversation)
# Process normally
```

This handles three cases:

#### Case 1: Agent Was Just Interrupted by VAD
- **Command:** Valid interruption, let LLM respond.
- **Filler:** False alarm! `resume_false_interruption` will auto-resume speech. We call `clear_user_turn()` so the LLM doesn't hear "yeah".
- **Real Input:** Valid interruption.

#### Case 2: Agent Is Currently Speaking (VAD Hasn't Triggered Yet)
- **Command:** Force immediate interrupt (`session.interrupt()`).
- **Filler:** Ignore completely (`clear_user_turn()`).
- **Real Input:** Allow interrupt (`session.interrupt()`).

#### Case 3: Agent Is Idle
- **Command/Real Input:** Process normally.
- **Filler:** Suppress (don't wake up LLM for just "okay").

---

## Key Code Changes (Refactored)

### 1. Robust Word Lists

**Command Detection** (Stop Phrases & Prefixes):
```python
# Single words
STOP_WORDS = {"wait", "stop", "finish", "hold", "pause", "halt", ...}

# Multi-word phrases (normalized)
STOP_PHRASES = {"holdon", "waitasecond", "stopit", "waitaminute", ...}

# Prefixes that can precede commands
COMMAND_PREFIXES = {"no", "but", "and", "okay", "please", "hey"}
```
*Now catches:* `"no wait"`, `"hold on"`, `"wait a second"`, `"yeah stop"`

**Filler Words** (Strict filtering):
```python
FILLER_WORDS = {
"uhhuh", "okay", "alright", "mhm", "yeah", "yep", "yup",
"hmm", "right", "uh", "um", "ah", "cool", "great", "no", "nah"
# Removed generic words like "i", "see", "all" to avoid false positives
}
```

### 2. Detection Functions

**`contains_command(transcript)`**:
- Checks for multi-word phrases (`"hold on"`).
- Checks for prefixes (`"no wait"`).
- Checks priority positions (first 3 words).

**`is_filler_input(transcript)`**:
- **CRITICAL:** Calls `contains_command()` first! If it's a command, it is NOT a filler.
- Only matches if input is *purely* filler words/phrases.

### 3. Transcript Suppression
We use a helper to prevent the LLM from responding to fillers:
```python
def try_clear_user_turn(session):
if hasattr(session, 'clear_user_turn'):
session.clear_user_turn()
```

---

## How It All Works Together (Examples)

### Scenario 1: User says "yeah" (0.3s, quick acknowledgment)
1. ✅ **VAD Layer:** Too short (< 0.6s) → No interrupt
2. ✅ **Transcript Layer:** `is_filler_input` = True. `try_clear_user_turn()` called.
3. ✅ **Result:** Agent continues speaking. LLM sees nothing.

### Scenario 2: User says "okaaaay" (1.5s, slow filler)
1. ❌ **VAD Layer:** Long enough (> 0.6s) → Interrupts agent
2. ✅ **Resume Layer:** Waits 1s, decides it's a false interrupt → Resumes
3. ✅ **Transcript Layer:** `is_filler_input` = True. Suppresses transcript.
4. ✅ **Result:** Brief pause (1s), then agent resumes.

### Scenario 3: User says "no wait" (Quick command)
1. ❌ **VAD Layer:** Might be too short or missed.
2. ✅ **Transcript Layer:** `contains_command` = True (catches "no" + "wait").
3. ✅ **Action:** `session.interrupt()` forced immediately.
4. ✅ **Result:** Agent stops. LLM processes "no wait".

### Scenario 4: User says "I have a question"
1. ✅ **Transcript Layer:** Not a command, not a filler.
2. ✅ **Action:** Real input. Interrupts agent.
3. ✅ **Result:** Standard conversation flow.

---

## Files Modified

- **`basic_agent.py`** — Main implementation with all intelligent interruption logic.

## Dependencies

No additional dependencies required. Uses standard Python `re` and LiveKit Agents SDK.

---

## Future Improvements

1. **Semantic Analysis:** Use a small NPU/LLM model to determine if "right" means "correct" (answer) or "continue" (filler).
2. **Prosody Analysis:** Differentiate "stop?" (question) from "STOP!" (command) based on pitch/volume.
Loading