🎙️ Voice-Controlled Local AI Agent

A modular, voice-controlled AI agent that accepts audio input, converts speech to text, classifies intent using an LLM, executes local tools, and displays the full pipeline in a premium Streamlit UI.

📐 Architecture

┌──────────────┐     ┌──────────────┐     ┌───────────────────┐     ┌──────────────┐
│  Audio Input │────▶│   Whisper    │────▶│ Intent Classifier │────▶│    Tools     │
│  (upload /   │     │   (STT)     │     │   (LLM / Ollama)  │     │  Execution   │
│   mic / text)│     └──────────────┘     └───────────────────┘     └──────┬───────┘
└──────────────┘                                                           │
                                                                           ▼
                                                                  ┌──────────────┐
                                                                  │ Streamlit UI │
                                                                  │  (Results)   │
                                                                  └──────────────┘

Component Breakdown

Module	Purpose
`app.py`	Streamlit UI — upload, record, type, view results
`main.py`	Pipeline orchestrator — ties all stages together
`stt/whisper_stt.py`	Speech-to-text with local Whisper / Groq / OpenAI fallback
`llm/intent_classifier.py`	Intent classification + text generation via Ollama / APIs
`tools/file_ops.py`	Safe file & folder creation inside `/output`
`tools/code_generator.py`	Code generation via LLM + auto-save
`tools/summarizer.py`	Text summarization via LLM
`utils/helpers.py`	Logging, path safety, audio temp files, env config

🚀 Setup Instructions

Prerequisites

Python 3.10+
Ollama installed and running locally (recommended) — Install Ollama
- Pull a model: ollama pull llama3
OR: Set GROQ_API_KEY or OPENAI_API_KEY in a .env file for cloud fallback
OR: Run with no setup at all — the system includes a Demo Mode that uses keyword-based intent classification and template code generation when no LLM backend is available

Hardware Note: If your machine cannot run Ollama or Whisper locally, the system gracefully falls back to cloud APIs (Groq/OpenAI) or demo mode. No external service is strictly required.

Installation

# 1. Clone the repository
git clone <your-repo-url>
cd voice_ai_agent

# 2. Create a virtual environment (recommended)
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Create a .env file for API keys
echo GROQ_API_KEY=your_key_here > .env
echo OPENAI_API_KEY=your_key_here >> .env

# 5. Make sure Ollama is running (if using local LLM)
ollama serve

Running the App

streamlit run app.py

The app will open at http://localhost:8501.

🎯 Supported Intents

Intent	Description	Example Command
`create_file`	Create a file or folder	"Create a file called notes.txt"
`write_code`	Generate code from description	"Write a Python function for retry logic"
`summarize`	Summarize provided text	"Summarize the following meeting notes..."
`chat`	General conversation	"What is the capital of France?"

Compound Commands

The agent supports compound commands — multiple intents in a single utterance:

"Create a Python file with a retry function"

→ Detects write_code + create_file → generates code → saves to /output/

"Summarize this text and save it to summary.txt"

→ Detects summarize + create_file → summarizes → saves output

🛡️ Safety

All file writes are sandboxed to the /output directory within the project. Path traversal attacks (e.g., ../../etc/passwd) are blocked by utils/helpers.py:safe_path().

✨ Bonus Features Implemented

Compound Commands — Multiple intents from a single utterance are detected and chained
Confirmation Step — Optional UI toggle to confirm before file creation
Session Memory — Full session history tracked in st.session_state, visible in the sidebar
Graceful Error Handling — Invalid audio, failed transcription, unknown intents all produce user-friendly messages

🔧 Model Choices & Rationale

Component	Model / Tool	Why
STT	Whisper (local, base)	Free, runs on CPU, good accuracy for English
STT	Groq Whisper API	Cloud fallback — fast, generous free tier
LLM	Ollama (llama3)	Free, local, no data leaves machine
LLM	Groq / OpenAI	Cloud fallback if Ollama is unavailable
LLM	Demo Mode (keyword-based)	Zero-dependency fallback — works offline with no API keys
UI	Streamlit	Rapid prototyping, built-in audio/file widgets, session state

⚠️ Limitations

Local Whisper requires ffmpeg installed (pip install openai-whisper handles most cases, but ffmpeg must be on PATH)
Ollama must be running as a service for local LLM inference
Microphone recording requires browser permissions and HTTPS in production (works on localhost)
Large audio files may be slow to transcribe on CPU-only machines (consider using Groq API fallback)
Intent classification accuracy depends on the LLM model used — larger models (llama3-70b) are more accurate
Demo mode uses keyword matching and template code — functional but less flexible than real LLM-powered responses

📁 Project Structure

voice_ai_agent/
├── app.py                    # Streamlit UI
├── main.py                   # Pipeline orchestrator
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── .env                      # (Optional) API keys
├── output/                   # Sandboxed output directory
├── llm/
│   ├── __init__.py
│   └── intent_classifier.py  # Intent classification + text gen
├── stt/
│   ├── __init__.py
│   └── whisper_stt.py        # Speech-to-text engine
├── tools/
│   ├── __init__.py
│   ├── file_ops.py           # Safe file operations
│   ├── code_generator.py     # Code generation
│   └── summarizer.py         # Text summarization
└── utils/
    ├── __init__.py
    └── helpers.py             # Shared utilities

📄 License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ Voice-Controlled Local AI Agent

📐 Architecture

Component Breakdown

🚀 Setup Instructions

Prerequisites

Installation

Running the App

🎯 Supported Intents

Compound Commands

🛡️ Safety

✨ Bonus Features Implemented

🔧 Model Choices & Rationale

⚠️ Limitations

📁 Project Structure

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
llm		llm
stt		stt
tools		tools
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎙️ Voice-Controlled Local AI Agent

📐 Architecture

Component Breakdown

🚀 Setup Instructions

Prerequisites

Installation

Running the App

🎯 Supported Intents

Compound Commands

🛡️ Safety

✨ Bonus Features Implemented

🔧 Model Choices & Rationale

⚠️ Limitations

📁 Project Structure

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages