An intelligent AI voice agent that conducts natural phone conversations to collect customer feedback
Usually customers are very hesitant to give feedback online about their recent purchase of some product/service, for e.g. how often do people give a review on Amazon after using a product.
With latest developments in LLMs, Text to Speech services, and RAG systems, we can create a nice AI Caller that can be used to collect reviews/feedback from customers within a few seconds along with their suggestions, sentiment, etc.
- 3-Step LLM Processing: Analyze β Plan β Generate for structured, natural responses
- Smart Context Awareness: Remembers conversation history and adapts accordingly
- Emotional Intelligence: Detects sentiment and responds with appropriate empathy
- Role Consistency: Advanced role confusion detection prevents AI identity mix-ups
- Sarah Persona: Warm, professional customer service representative
- Structured Responses: Always follows Acknowledge β Empathize β Ask pattern
- Natural Pacing: Automatic pause insertion for human-like speech rhythm
- Topic Tracking: Avoids repetitive questions by remembering what's been discussed
- Real-time Processing: WebSocket-based bidirectional audio streaming
- Smart Format Handling: Automatic browser audio format detection and optimization
- Natural Voice: ElevenLabs Rachel voice with optimized settings for phone conversations
- Reliable STT: AssemblyAI with direct WebM support (no conversion needed)
βββββββββββββββββββ WebSocket ββββββββββββββββββββ
β Frontend βββββββββββββββββΊβ FastAPI β
β (Browser) β β Backend β
βββββββββββββββββββ ββββββββββββββββββββ
β β
β MediaRecorder β
β (WebM/Opus) β
βΌ βΌ
βββββββββββββββββββ ββββββββββββββββββββ
β Web Audio API β β 3-Step Pipeline β
β Audio Playback β β ββββββββββββββββ β
βββββββββββββββββββ β β1. Analyze β β
β β2. Plan β β
β β3. Generate β β
β ββββββββββββββββ β
ββββββββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
β AssemblyAI β β Google Gemini β β ElevenLabs β
β (STT) β β 2.0 Flash β β (TTS) β
ββββββββββββββββ β (LLM) β ββββββββββββββββ
ββββββββββββββββββββ
- Python 3.10+
- Modern web browser (Chrome/Edge recommended)
- API Keys: Google AI, ElevenLabs, AssemblyAI
-
Clone and Setup
git clone https://github.com/jaibhasin/AI-Caller-Review-Collector.git cd AI-Caller-Review-Collector python3 -m venv venv source venv/bin/activate # macOS/Linux pip install -r requirements.txt
-
Configure Environment
# Create .env file echo "SECRET_KEY_GOOGLE_AI=your_google_ai_key" > .env echo "ELEVEN_LABS_API_KEY=your_elevenlabs_key" >> .env echo "ASSEMBLYAI_API_KEY=your_assemblyai_key" >> .env
-
Launch
# Start backend uvicorn app.main:app --reload # Open frontend/index.html in browser # Click "Start Call" and begin conversation!
Sarah is designed to be the perfect customer service representative:
- Personality: Warm, professional, genuinely interested
- Communication Style: Natural, conversational, never rushed
- Intelligence: Understands context, emotions, and conversation flow
- Consistency: Always stays in character as the company representative
Sarah: "Hi there! This is Sarah calling from Lifelong. I hope you're
having a good day. I wanted to give you a quick call about the
pickleball set you got from us recently. Is this an okay time
to chat for just a minute?"
Customer: "Oh hi! Yeah, sure, I have a few minutes."
Sarah: "Wonderful! I'm so glad I caught you at a good time... How has
your experience been with the pickleball set so far?"
Customer: "It's been really great actually! The grip is so comfortable."
Sarah: "Oh that's fantastic to hear that you love the grip comfort!...
What specifically makes it feel so good to use?"
# Every customer response goes through:
1. ANALYZE β Extract sentiment, topic, keywords, emotion level
2. PLAN β Decide acknowledgment style, empathy approach, follow-up
3. GENERATE β Create natural Sarah response following structure
4. POST-PROCESS β Fix role confusion, add natural pacingconversation_state = {
"topics_covered": ["grip", "durability"],
"customer_sentiment": "positive",
"turn_count": 3,
"last_analysis": {...},
"last_plan": {...},
"conversation_history": [...]
}- Browser Compatibility: Automatic format detection (WebM β MP4 β fallback)
- Natural Speech: Optimized ElevenLabs settings with pauses and pacing
- Reliable Processing: Direct WebM support, no ffmpeg conversion needed
- Quality Control: Phone-optimized voice settings for clear communication
AI-Caller-Review-Collector/
βββ π― Core Application
β βββ app/
β β βββ main.py # FastAPI entry point
β β βββ api/agent_voice.py # 3-step pipeline WebSocket handler
β β βββ services/
β β βββ simple_stt_service.py # Optimized AssemblyAI integration
β βββ frontend/
β βββ index.html # Modern UI with audio visualization
β βββ script.js # WebSocket + Web Audio API
β βββ styles.css # Responsive design
βββ π Documentation
β βββ structured_response_pipeline.md
β βββ conversation_improvements.md
β βββ role_confusion_fix.md
βββ π§ͺ Testing
β βββ test_audio_formats.html
β βββ conversation_example.md
βββ βοΈ Configuration
βββ requirements.txt
βββ .env
βββ README.md
| Variable | Purpose | Example |
|---|---|---|
SECRET_KEY_GOOGLE_AI |
Gemini 2.0 Flash API access | AIzaSy... |
ELEVEN_LABS_API_KEY |
Rachel voice synthesis | sk_... |
ASSEMBLYAI_API_KEY |
Real-time speech recognition | a13c86... |
GET /- Health check and system statusGET /docs- Interactive API documentation (Swagger UI)
WS /api/agent/voice- Real-time voice conversation endpoint- Accepts: WebM/Opus audio chunks
- Returns: JSON conversation data + MP3 audio chunks
π€ Microphone Not Working
# Check browser permissions
# Ensure HTTPS or localhost
# Verify Web Audio API supportπ WebSocket Connection Failed
# Verify backend is running: http://localhost:8000
# Check firewall settings
# Confirm CORS configurationπ€ AI Role Confusion
# Automatic detection and fixing implemented
# Check console for "[DEBUG] Fixed role confusion" messages
# Review conversation_state in logsπ Audio Quality Issues
# Test browser audio format support: open test_audio_formats.html
# Check ElevenLabs API quota
# Verify voice settings in agent_voice.py# In agent_voice.py, modify:
VOICE_ID = "21m00Tcm4TlvDq8ikWAM" # Rachel (default)
# Or try: "pNInz6obpgDQGcFmaJgB" # Adam# Modify BASE_PROMPT in agent_voice.py for different:
# - Product types
# - Company personas
# - Conversation styles
# - Response structures# Adjust LLM settings:
temperature=0.8, # Creativity (0.1-1.0)
max_tokens=150, # Response length
top_p=0.9 # Response variety- Response Time: ~2-3 seconds (STT + LLM Pipeline + TTS)
- Audio Quality: 16kHz, optimized for voice clarity
- Conversation Length: Automatically managed (5-7 turns typical)
- Success Rate: 95%+ natural conversation flow
This is a portfolio project, but suggestions and improvements are welcome!
- Fork the repository
- Create a feature branch
- Make your improvements
- Submit a pull request
MIT License - Feel free to use this project for learning and development.
Jai Bhasin
- π GitHub: @jaibhasin
- π§ Email: bhasinjai@gmail.com
- Google AI - Gemini 2.0 Flash LLM
- ElevenLabs - Natural voice synthesis
- AssemblyAI - Real-time speech recognition
- FastAPI - Modern Python web framework
- LangChain - LLM application framework
Built with β€οΈ for natural human-AI conversation