Skip to content

cplegendre/rag-knowledge-base

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Local RAG-Powered Personal Knowledge Base

A privacy-first, fully local RAG (Retrieval-Augmented Generation) system for querying your personal notes, PDFs, and Markdown files β€” powered by ChromaDB, SentenceTransformers, and Ollama.

Python ChromaDB Ollama Streamlit License


✨ Features

Feature Details
πŸ”’ 100% local No cloud APIs β€” your data never leaves your machine
πŸ“„ Multi-format ingestion PDF, Markdown (.md), plain text (.txt)
πŸ” Hybrid search BM25 + dense vector search fused with Reciprocal Rank Fusion (RRF)
πŸ”„ Query refinement loop Iterative LLM-powered query rewriting for better retrieval
🧩 Streamlit UI Chat interface, one-off search, summarisation, and insights tabs
πŸ’» CLI Full-featured command-line interface + REPL
πŸ“¦ Idempotent ingestion Re-ingesting the same file never creates duplicate chunks
βš™οΈ Fully configurable All parameters in .env or environment variables

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        User Interface                           β”‚
β”‚              CLI (cli.py)          Streamlit UI (app.py)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      RAG Pipeline (src/rag_pipeline.py)         β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Ingestion   β”‚   β”‚  Query Rewriter   β”‚   β”‚  LLM Client   β”‚   β”‚
β”‚  β”‚ (ingestion.pyβ”‚   β”‚(query_rewriter.py)β”‚   β”‚   (llm.py)    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β”‚                    β”‚                       β”‚           β”‚
β”‚         β–Ό                    β–Ό                       β”‚           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚           β”‚
β”‚  β”‚  Embeddings  β”‚   β”‚ Hybrid Retriever β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚  β”‚(embeddings.pyβ”‚   β”‚  (retrieval.py)  β”‚                         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚         β”‚                    β”‚                                   β”‚
β”‚         β–Ό                    β–Ό                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            β”‚
β”‚  β”‚       Vector Store              β”‚                            β”‚
β”‚  β”‚     (vector_store.py)           β”‚                            β”‚
β”‚  β”‚         ChromaDB                β”‚                            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

Ingestion:

Document (PDF/MD/TXT) β†’ Chunk β†’ Embed (SentenceTransformers) β†’ Store (ChromaDB)

Query:

Query β†’ [Refine loop] β†’ Embed β†’ Hybrid Search (BM25 + Vector + RRF) β†’ LLM β†’ Answer

πŸš€ Quick Start

1. Prerequisites

# Python 3.10+
python --version

# Install Ollama from https://ollama.com
ollama serve
ollama pull llama3     # or mistral, phi3, gemma2, etc.

2. Install Dependencies

git clone https://github.com/YOUR_USERNAME/rag-knowledge-base
cd rag-knowledge-base

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install -r requirements.txt

3. Configure

cp .env.example .env
# Edit .env to set your preferred model and paths

4. Ingest Documents

# Ingest a directory of notes
python cli.py ingest ./docs_sample

# Or a single file
python cli.py ingest ./my_notes/important.pdf

5. Ask Questions

# CLI one-shot
python cli.py query "What are the main components of a RAG system?"

# Streaming response
python cli.py query "Explain hybrid search" --stream

# Streamlit UI
streamlit run app.py

πŸ“– CLI Usage

usage: rag-kb [-h] [--db DB] [--model MODEL] {ingest,query,summarise,insights,stats,repl} ...

Commands:
  ingest      Ingest a file or directory
  query       Ask a question (with optional streaming & refinement)
  summarise   Summarise a topic from the knowledge base
  insights    Generate insights from the knowledge base
  stats       Show knowledge base statistics
  repl        Interactive REPL mode

Examples

# Ingest
python cli.py ingest ./my_notes --no-recursive

# Query with options
python cli.py query "What is BM25?" --mode bm25 --stream
python cli.py query "Explain RAG" --no-refine

# Summarise
python cli.py summarise "data engineering best practices"

# Insights
python cli.py insights "machine learning"

# Stats
python cli.py stats

# Interactive REPL
python cli.py repl
# In REPL: /mode hybrid|semantic|bm25, /stats, /quit

🌐 Streamlit UI

streamlit run app.py

Opens at http://localhost:8501 with four tabs:

Tab Feature
πŸ’¬ Chat Streaming Q&A with source attribution
πŸ” Search Raw hybrid/semantic/BM25 search results
πŸ“Š Insights AI-generated insights from your notes
πŸ“ Summarise Topic summarisation

πŸ”§ Configuration Reference

All settings can be set via .env or environment variables:

Variable Default Description
CHROMA_DB_PATH ./chroma_db ChromaDB persistence directory
EMBEDDING_MODEL all-MiniLM-L6-v2 SentenceTransformer model
EMBEDDING_DEVICE cpu cpu or cuda
OLLAMA_MODEL llama3 Ollama model name
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
TOP_K_SEMANTIC 10 Dense retrieval candidates
TOP_K_BM25 10 BM25 retrieval candidates
TOP_K_FINAL 5 Final chunks passed to LLM
RRF_K 60 RRF smoothing constant
CHUNK_SIZE 512 Words per chunk
CHUNK_OVERLAP 64 Overlap between chunks
MAX_REFINEMENT_LOOPS 2 Query refinement iterations
REFINEMENT_SCORE_THRESHOLD 0.35 Minimum relevance to skip refinement

πŸ§ͺ Running Tests

pytest tests/ -v
pytest tests/ --cov=src --cov-report=term-missing

πŸ“š Supported Formats

Format Extension Notes
Markdown .md YAML front-matter is stripped
Plain text .txt UTF-8
PDF .pdf Text extraction via pdfplumber

πŸ—ΊοΈ Roadmap / Potential Extensions

  • Cross-encoder reranking β€” add a ms-marco-MiniLM reranker after retrieval
  • HyDE β€” Hypothetical Document Embeddings for better open-domain QA
  • Multi-modal β€” embed images from PDFs (CLIP embeddings)
  • Conversation memory β€” multi-turn chat with context window management
  • Document tagging β€” auto-tag chunks with topics for filtered search
  • Evaluation harness β€” RAGAS-based automated quality scoring
  • Web scraping ingestion β€” ingest URLs directly into the knowledge base
  • Export β€” export Q&A sessions as Markdown reports

🀝 Contributing

  1. Fork the repo
  2. Create a feature branch (git checkout -b feat/cross-encoder-reranker)
  3. Commit your changes (git commit -m "feat: add cross-encoder reranker")
  4. Push and open a PR

πŸ“„ License

MIT β€” see LICENSE for details.


πŸ™ Acknowledgements

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages