A modular, voice-controlled AI agent that accepts audio input, converts speech to text, classifies intent using an LLM, executes local tools, and displays the full pipeline in a premium Streamlit UI.
┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ ┌──────────────┐
│ Audio Input │────▶│ Whisper │────▶│ Intent Classifier │────▶│ Tools │
│ (upload / │ │ (STT) │ │ (LLM / Ollama) │ │ Execution │
│ mic / text)│ └──────────────┘ └───────────────────┘ └──────┬───────┘
└──────────────┘ │
▼
┌──────────────┐
│ Streamlit UI │
│ (Results) │
└──────────────┘
| Module | Purpose |
|---|---|
app.py |
Streamlit UI — upload, record, type, view results |
main.py |
Pipeline orchestrator — ties all stages together |
stt/whisper_stt.py |
Speech-to-text with local Whisper / Groq / OpenAI fallback |
llm/intent_classifier.py |
Intent classification + text generation via Ollama / APIs |
tools/file_ops.py |
Safe file & folder creation inside /output |
tools/code_generator.py |
Code generation via LLM + auto-save |
tools/summarizer.py |
Text summarization via LLM |
utils/helpers.py |
Logging, path safety, audio temp files, env config |
- Python 3.10+
- Ollama installed and running locally (recommended) — Install Ollama
- Pull a model:
ollama pull llama3
- Pull a model:
- OR: Set
GROQ_API_KEYorOPENAI_API_KEYin a.envfile for cloud fallback - OR: Run with no setup at all — the system includes a Demo Mode that uses keyword-based intent classification and template code generation when no LLM backend is available
Hardware Note: If your machine cannot run Ollama or Whisper locally, the system gracefully falls back to cloud APIs (Groq/OpenAI) or demo mode. No external service is strictly required.
# 1. Clone the repository
git clone <your-repo-url>
cd voice_ai_agent
# 2. Create a virtual environment (recommended)
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
# 3. Install dependencies
pip install -r requirements.txt
# 4. (Optional) Create a .env file for API keys
echo GROQ_API_KEY=your_key_here > .env
echo OPENAI_API_KEY=your_key_here >> .env
# 5. Make sure Ollama is running (if using local LLM)
ollama servestreamlit run app.pyThe app will open at http://localhost:8501.
| Intent | Description | Example Command |
|---|---|---|
create_file |
Create a file or folder | "Create a file called notes.txt" |
write_code |
Generate code from description | "Write a Python function for retry logic" |
summarize |
Summarize provided text | "Summarize the following meeting notes..." |
chat |
General conversation | "What is the capital of France?" |
The agent supports compound commands — multiple intents in a single utterance:
"Create a Python file with a retry function"
→ Detects
write_code+create_file→ generates code → saves to/output/
"Summarize this text and save it to summary.txt"
→ Detects
summarize+create_file→ summarizes → saves output
All file writes are sandboxed to the /output directory within the project. Path traversal attacks (e.g., ../../etc/passwd) are blocked by utils/helpers.py:safe_path().
- Compound Commands — Multiple intents from a single utterance are detected and chained
- Confirmation Step — Optional UI toggle to confirm before file creation
- Session Memory — Full session history tracked in
st.session_state, visible in the sidebar - Graceful Error Handling — Invalid audio, failed transcription, unknown intents all produce user-friendly messages
| Component | Model / Tool | Why |
|---|---|---|
| STT | Whisper (local, base) | Free, runs on CPU, good accuracy for English |
| STT | Groq Whisper API | Cloud fallback — fast, generous free tier |
| LLM | Ollama (llama3) | Free, local, no data leaves machine |
| LLM | Groq / OpenAI | Cloud fallback if Ollama is unavailable |
| LLM | Demo Mode (keyword-based) | Zero-dependency fallback — works offline with no API keys |
| UI | Streamlit | Rapid prototyping, built-in audio/file widgets, session state |
- Local Whisper requires
ffmpeginstalled (pip install openai-whisperhandles most cases, butffmpegmust be on PATH) - Ollama must be running as a service for local LLM inference
- Microphone recording requires browser permissions and HTTPS in production (works on localhost)
- Large audio files may be slow to transcribe on CPU-only machines (consider using Groq API fallback)
- Intent classification accuracy depends on the LLM model used — larger models (llama3-70b) are more accurate
- Demo mode uses keyword matching and template code — functional but less flexible than real LLM-powered responses
voice_ai_agent/
├── app.py # Streamlit UI
├── main.py # Pipeline orchestrator
├── requirements.txt # Python dependencies
├── README.md # This file
├── .env # (Optional) API keys
├── output/ # Sandboxed output directory
├── llm/
│ ├── __init__.py
│ └── intent_classifier.py # Intent classification + text gen
├── stt/
│ ├── __init__.py
│ └── whisper_stt.py # Speech-to-text engine
├── tools/
│ ├── __init__.py
│ ├── file_ops.py # Safe file operations
│ ├── code_generator.py # Code generation
│ └── summarizer.py # Text summarization
└── utils/
├── __init__.py
└── helpers.py # Shared utilities
MIT License — see LICENSE for details.