Skip to content

Ridanshi/Mem0AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ Voice-Controlled Local AI Agent

A modular, voice-controlled AI agent that accepts audio input, converts speech to text, classifies intent using an LLM, executes local tools, and displays the full pipeline in a premium Streamlit UI.


📐 Architecture

┌──────────────┐     ┌──────────────┐     ┌───────────────────┐     ┌──────────────┐
│  Audio Input │────▶│   Whisper    │────▶│ Intent Classifier │────▶│    Tools     │
│  (upload /   │     │   (STT)     │     │   (LLM / Ollama)  │     │  Execution   │
│   mic / text)│     └──────────────┘     └───────────────────┘     └──────┬───────┘
└──────────────┘                                                           │
                                                                           ▼
                                                                  ┌──────────────┐
                                                                  │ Streamlit UI │
                                                                  │  (Results)   │
                                                                  └──────────────┘

Component Breakdown

Module Purpose
app.py Streamlit UI — upload, record, type, view results
main.py Pipeline orchestrator — ties all stages together
stt/whisper_stt.py Speech-to-text with local Whisper / Groq / OpenAI fallback
llm/intent_classifier.py Intent classification + text generation via Ollama / APIs
tools/file_ops.py Safe file & folder creation inside /output
tools/code_generator.py Code generation via LLM + auto-save
tools/summarizer.py Text summarization via LLM
utils/helpers.py Logging, path safety, audio temp files, env config

🚀 Setup Instructions

Prerequisites

  • Python 3.10+
  • Ollama installed and running locally (recommended) — Install Ollama
    • Pull a model: ollama pull llama3
  • OR: Set GROQ_API_KEY or OPENAI_API_KEY in a .env file for cloud fallback
  • OR: Run with no setup at all — the system includes a Demo Mode that uses keyword-based intent classification and template code generation when no LLM backend is available

Hardware Note: If your machine cannot run Ollama or Whisper locally, the system gracefully falls back to cloud APIs (Groq/OpenAI) or demo mode. No external service is strictly required.

Installation

# 1. Clone the repository
git clone <your-repo-url>
cd voice_ai_agent

# 2. Create a virtual environment (recommended)
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Create a .env file for API keys
echo GROQ_API_KEY=your_key_here > .env
echo OPENAI_API_KEY=your_key_here >> .env

# 5. Make sure Ollama is running (if using local LLM)
ollama serve

Running the App

streamlit run app.py

The app will open at http://localhost:8501.


🎯 Supported Intents

Intent Description Example Command
create_file Create a file or folder "Create a file called notes.txt"
write_code Generate code from description "Write a Python function for retry logic"
summarize Summarize provided text "Summarize the following meeting notes..."
chat General conversation "What is the capital of France?"

Compound Commands

The agent supports compound commands — multiple intents in a single utterance:

"Create a Python file with a retry function"

→ Detects write_code + create_file → generates code → saves to /output/

"Summarize this text and save it to summary.txt"

→ Detects summarize + create_file → summarizes → saves output


🛡️ Safety

All file writes are sandboxed to the /output directory within the project. Path traversal attacks (e.g., ../../etc/passwd) are blocked by utils/helpers.py:safe_path().


✨ Bonus Features Implemented

  1. Compound Commands — Multiple intents from a single utterance are detected and chained
  2. Confirmation Step — Optional UI toggle to confirm before file creation
  3. Session Memory — Full session history tracked in st.session_state, visible in the sidebar
  4. Graceful Error Handling — Invalid audio, failed transcription, unknown intents all produce user-friendly messages

🔧 Model Choices & Rationale

Component Model / Tool Why
STT Whisper (local, base) Free, runs on CPU, good accuracy for English
STT Groq Whisper API Cloud fallback — fast, generous free tier
LLM Ollama (llama3) Free, local, no data leaves machine
LLM Groq / OpenAI Cloud fallback if Ollama is unavailable
LLM Demo Mode (keyword-based) Zero-dependency fallback — works offline with no API keys
UI Streamlit Rapid prototyping, built-in audio/file widgets, session state

⚠️ Limitations

  • Local Whisper requires ffmpeg installed (pip install openai-whisper handles most cases, but ffmpeg must be on PATH)
  • Ollama must be running as a service for local LLM inference
  • Microphone recording requires browser permissions and HTTPS in production (works on localhost)
  • Large audio files may be slow to transcribe on CPU-only machines (consider using Groq API fallback)
  • Intent classification accuracy depends on the LLM model used — larger models (llama3-70b) are more accurate
  • Demo mode uses keyword matching and template code — functional but less flexible than real LLM-powered responses

📁 Project Structure

voice_ai_agent/
├── app.py                    # Streamlit UI
├── main.py                   # Pipeline orchestrator
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── .env                      # (Optional) API keys
├── output/                   # Sandboxed output directory
├── llm/
│   ├── __init__.py
│   └── intent_classifier.py  # Intent classification + text gen
├── stt/
│   ├── __init__.py
│   └── whisper_stt.py        # Speech-to-text engine
├── tools/
│   ├── __init__.py
│   ├── file_ops.py           # Safe file operations
│   ├── code_generator.py     # Code generation
│   └── summarizer.py         # Text summarization
└── utils/
    ├── __init__.py
    └── helpers.py             # Shared utilities

📄 License

MIT License — see LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages