voice interview agent follows a microservices-style architecture with clear separation of concerns

┌─────────────────────────────────────────────────────────────┐ │ FRONTEND (HTML/JS) │ │ - Microphone capture │ │ - WebSocket client │ │ - Audio playback │ │ - UI state management │ └────────────────────┬────────────────────────────────────────┘ │ WebSocket Connection │ (bidirectional audio + control messages) ▼ ┌─────────────────────────────────────────────────────────────┐ │ FASTAPI BACKEND (main.py) │ │ - WebSocket endpoint (/ws/voice) │ │ - REST endpoints (/roles, /experience-levels) │ │ - Session management │ │ - CORS middleware │ └─────────┬───────────────────────────────┬───────────────────┘ │ │ ▼ ▼ ┌──────────────────────┐ ┌──────────────────────┐ │ SERVICE LAYER │ │ MODELS LAYER │ │ │ │ │ │ • STT Service │ │ • RoleType (enum) │ │ • TTS Service │ │ • ExperienceLevel │ │ • LLM Service │ │ • InterviewPhase │ │ • Interview Service │ │ • InterviewSession │ │ │ │ • InterviewConfig │ └──────────┬───────────┘ │ • Message │ │ │ • Evaluation │ │ └──────────────────────┘ ▼ ┌─────────────────────────────────────────────┐ │ EXTERNAL APIs (3rd Party) │ │ │ │ • Assembly AI → Speech-to-Text │ │ • Cartesia AI → Text-to-Speech │ │ • Groq → LLM (Llama 3.3 70B) │ └─────────────────────────────────────────────┘

Complete Data Flow Pipeline

Phase 1: User Speaks (Frontend → Backend)

User clicks "Click to Speak" button ↓
Browser captures microphone audio (MediaRecorder API) ↓
Audio chunks collected in buffer (Blob array) ↓
User clicks end button ↓
Audio chunks combined into single Blob ↓
Blob converted to base64 string ↓
Sent via WebSocket: { "type": "audio", "audio": "base64_encoded_audio_data" }

Phase 2: Speech-to-Text (Backend Processing)

WebSocket handler receives message ↓
Base64 decoded → raw audio bytes ↓
Audio bytes sent to Assembly AI STT service ↓
Assembly AI returns transcribed text { "text": "I am a full stack developer..." } ↓
Text stored in session message history: Message(role="user", content="I am a full stack...")

Phase 3: LLM Processing (Brain)

Interview Service receives user text ↓
Context built:
- Current interview phase (intro/technical/behavioral)
- User's selected role (Backend/Frontend/etc.)
- Experience level (Junior/Mid/Senior)
- Previous conversation history ↓
Prompt constructed: SYSTEM: "You are conducting a {role} interview..." HISTORY: [previous messages...] USER: "I am a full stack developer..." ↓
Sent to Groq LLM (via LangChain ChatGroq) ↓
LLM generates contextual response: "That's great! Can you tell me about a recent project where you integrated frontend and backend?" ↓
Response stored in session: Message(role="assistant", content="That's great...")

Phase 4: Text-to-Speech (Backend Processing)

AI response text sent to Cartesia TTS service ↓
Cartesia generates audio:
- Model: sonic-english
- Voice ID: a0e99841-438c-4a64-b679-ae501e7d6091
- Encoding: PCM 32-bit float
- Sample rate: 44100 Hz ↓
Audio returned as chunks (iterator) ↓
Chunks collected into complete audio buffer

Phase 5: Response Delivery (Backend → Frontend)

Audio buffer converted to base64 ↓
Sent via WebSocket: { "type": "audio", "audio": "base64_audio_data" } ↓
Also send transcript for display: { "type": "transcript", "text": "That's great! Can you tell...", "phase": "technical" }

Phase 6: Audio Playback (Frontend)

Browser receives WebSocket messages ↓
Base64 decoded → Blob ↓
Blob converted to Object URL ↓
Audio element created dynamically ↓
Audio.play() → User hears AI voice ↓
Transcript displayed in chat UI

File Structure & Responsibilities

voice-interview-agent/ │ ├── backend/ │ ├── app/ │ │ ├── main.py # FastAPI app, WebSocket endpoint │ │ ├── config.py # Environment variables, settings │ │ │ │ │ ├── models/ │ │ │ └── interview.py # Pydantic models (data validation) │ │ │ │ │ ├── services/ │ │ │ ├── stt_service.py # Assembly AI integration │ │ │ ├── tts_service.py # Cartesia AI integration │ │ │ ├── llm_service.py # Groq LLM integration │ │ │ └── interview_service.py # Business logic coordinator │ │ │ │ │ └── routes/ │ │ └── interview.py # REST API endpoints │ │ │ └── tests/ │ ├── test_stt.py # STT testing in isolation │ ├── test_tts.py # TTS testing in isolation │ └── test_voice_pipeline.py # Full pipeline test │ ├── test_voice_client.html # Frontend UI ├── .env # API keys └── pyproject.toml # Dependencies (uv)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pipeline_test.wav		pipeline_test.wav
pyproject.toml		pyproject.toml
test_output.wav		test_output.wav
test_voice_client.html		test_voice_client.html
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voice interview agent follows a microservices-style architecture with clear separation of concerns

Complete Data Flow Pipeline

File Structure & Responsibilities

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

voice interview agent follows a microservices-style architecture with clear separation of concerns

Complete Data Flow Pipeline

File Structure & Responsibilities

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages