┌─────────────────────────────────────────────────────────────┐ │ FRONTEND (HTML/JS) │ │ - Microphone capture │ │ - WebSocket client │ │ - Audio playback │ │ - UI state management │ └────────────────────┬────────────────────────────────────────┘ │ WebSocket Connection │ (bidirectional audio + control messages) ▼ ┌─────────────────────────────────────────────────────────────┐ │ FASTAPI BACKEND (main.py) │ │ - WebSocket endpoint (/ws/voice) │ │ - REST endpoints (/roles, /experience-levels) │ │ - Session management │ │ - CORS middleware │ └─────────┬───────────────────────────────┬───────────────────┘ │ │ ▼ ▼ ┌──────────────────────┐ ┌──────────────────────┐ │ SERVICE LAYER │ │ MODELS LAYER │ │ │ │ │ │ • STT Service │ │ • RoleType (enum) │ │ • TTS Service │ │ • ExperienceLevel │ │ • LLM Service │ │ • InterviewPhase │ │ • Interview Service │ │ • InterviewSession │ │ │ │ • InterviewConfig │ └──────────┬───────────┘ │ • Message │ │ │ • Evaluation │ │ └──────────────────────┘ ▼ ┌─────────────────────────────────────────────┐ │ EXTERNAL APIs (3rd Party) │ │ │ │ • Assembly AI → Speech-to-Text │ │ • Cartesia AI → Text-to-Speech │ │ • Groq → LLM (Llama 3.3 70B) │ └─────────────────────────────────────────────┘
Phase 1: User Speaks (Frontend → Backend)
- User clicks "Click to Speak" button ↓
- Browser captures microphone audio (MediaRecorder API) ↓
- Audio chunks collected in buffer (Blob array) ↓
- User clicks end button ↓
- Audio chunks combined into single Blob ↓
- Blob converted to base64 string ↓
- Sent via WebSocket: { "type": "audio", "audio": "base64_encoded_audio_data" }
Phase 2: Speech-to-Text (Backend Processing)
- WebSocket handler receives message ↓
- Base64 decoded → raw audio bytes ↓
- Audio bytes sent to Assembly AI STT service ↓
- Assembly AI returns transcribed text { "text": "I am a full stack developer..." } ↓
- Text stored in session message history: Message(role="user", content="I am a full stack...")
Phase 3: LLM Processing (Brain)
- Interview Service receives user text ↓
- Context built:
- Current interview phase (intro/technical/behavioral)
- User's selected role (Backend/Frontend/etc.)
- Experience level (Junior/Mid/Senior)
- Previous conversation history ↓
- Prompt constructed: SYSTEM: "You are conducting a {role} interview..." HISTORY: [previous messages...] USER: "I am a full stack developer..." ↓
- Sent to Groq LLM (via LangChain ChatGroq) ↓
- LLM generates contextual response: "That's great! Can you tell me about a recent project where you integrated frontend and backend?" ↓
- Response stored in session: Message(role="assistant", content="That's great...")
Phase 4: Text-to-Speech (Backend Processing)
- AI response text sent to Cartesia TTS service ↓
- Cartesia generates audio:
- Model: sonic-english
- Voice ID: a0e99841-438c-4a64-b679-ae501e7d6091
- Encoding: PCM 32-bit float
- Sample rate: 44100 Hz ↓
- Audio returned as chunks (iterator) ↓
- Chunks collected into complete audio buffer
Phase 5: Response Delivery (Backend → Frontend)
- Audio buffer converted to base64 ↓
- Sent via WebSocket: { "type": "audio", "audio": "base64_audio_data" } ↓
- Also send transcript for display: { "type": "transcript", "text": "That's great! Can you tell...", "phase": "technical" }
Phase 6: Audio Playback (Frontend)
- Browser receives WebSocket messages ↓
- Base64 decoded → Blob ↓
- Blob converted to Object URL ↓
- Audio element created dynamically ↓
- Audio.play() → User hears AI voice ↓
- Transcript displayed in chat UI
voice-interview-agent/ │ ├── backend/ │ ├── app/ │ │ ├── main.py # FastAPI app, WebSocket endpoint │ │ ├── config.py # Environment variables, settings │ │ │ │ │ ├── models/ │ │ │ └── interview.py # Pydantic models (data validation) │ │ │ │ │ ├── services/ │ │ │ ├── stt_service.py # Assembly AI integration │ │ │ ├── tts_service.py # Cartesia AI integration │ │ │ ├── llm_service.py # Groq LLM integration │ │ │ └── interview_service.py # Business logic coordinator │ │ │ │ │ └── routes/ │ │ └── interview.py # REST API endpoints │ │ │ └── tests/ │ ├── test_stt.py # STT testing in isolation │ ├── test_tts.py # TTS testing in isolation │ └── test_voice_pipeline.py # Full pipeline test │ ├── test_voice_client.html # Frontend UI ├── .env # API keys └── pyproject.toml # Dependencies (uv)