A Python-based document chatbot that uses vector embeddings and retrieval-augmented generation (RAG) to answer questions about your documents. Built with LangChain, HuggingFace embeddings, and Chroma vector database.
- Document Processing: Load and process Markdown and PDF documents
- Vector Embeddings: Uses HuggingFace sentence-transformers for free, high-quality embeddings
- Semantic Search: Find relevant document chunks using similarity search
- Question Answering: Generate contextual answers based on retrieved document content
- Local Processing: No API costs for embeddings (HuggingFace models run locally)
- Flexible LLM Support: Compatible with Groq API for free and fast generation
Document Chatbot/ ├── createDatabase.py # Document processing and vector database creation ├── query.py # Query interface for asking questions ├── requirements.txt # Python dependencies ├── .env # Environment variables (API keys) ├── .gitignore # Git ignore rules ├── data/ # Document storage directory │ └── *.md # Markdown documents └── chroma/ # Vector database (auto-generated)
git clone cd "Document Chatbot" python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt pip install "unstructured[md]"
Create a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_here
Place your documents in the data/ directory:
- Supported formats: Markdown (
.md), PDF (.pdf) - The system will automatically create the directory if it doesn't exist
Process your documents and create the vector database:
python createDatabase.py
Ask questions about your documents:
python query.py "What is the main character's name?" python query.py "How does Alice meet the Mad Hatter?" --threshold 0.4