ProteinScope is an end-to-end Retrieval-Augmented Generation (RAG) system focused on protein and nutrition research. The project combines data ingestion, semantic retrieval, FAISS-based indexing, evaluation workflows, and LLM-powered reasoning into a modular AI pipeline.
The system aggregates information from:
- Reddit discussions
- PubMed research papers
- Nutrition blogs and articles
It then processes, filters, embeds, indexes, and retrieves relevant knowledge for grounded question answering using local LLM inference via Ollama.
- End-to-end RAG architecture using semantic retrieval
- FAISS vector search for efficient similarity matching
- Sentence-transformer embeddings for contextual retrieval
- Retrieval grounding with source-aware responses
- Incremental scraping pipelines
- Append-only update strategy
- Deduplication workflows
- Structured corpus generation
Supported sources:
- PubMed
- Nutrition blogs
- Out-of-scope query detection
- Fallback validation checks
- Timeout handling
- Response consistency improvements
- Verified-source filtering support
- Ollama-based local inference
- Llama 3.1 integration
- Modular chat backend abstraction
- Citation-aware generation workflow
- Streamlit-powered analytics interface
- Data insights and trending analysis
- Business-oriented nutrition insights
- Interactive chatbot experience
- Export functionality
Data Sources
├── Reddit
├── PubMed
└── Nutrition Blogs
│
▼
Scraping + Incremental Updates
│
▼
Corpus Cleaning & Filtering
│
▼
Chunking Pipeline
│
▼
Embedding Generation
(sentence-transformers/all-MiniLM-L6-v2)
│
▼
FAISS Vector Index
│
▼
Retriever
│
▼
LLM Response Generation (Ollama)
│
▼
Grounded Nutrition Answers
ProteinScope-main/
│
├── nutrition_insights/
│ ├── data/ # Corpus, FAISS index, metadata
│ ├── rag/ # Retrieval and indexing pipeline
│ ├── scrappers/ # Reddit, PubMed, blog scrapers
│ ├── phase3/ # Streamlit application
│ ├── scripts/ # Utility scripts
│ ├── llm_connection.py # Ollama/LLM integration
│ └── merge_scrapper.py # Corpus merging workflow
│
└── fix_combined_json.py
- PyTorch
- Sentence Transformers
- Ollama
- Llama 3.1
- FAISS
- Python
- Async processing workflows
- JSONL corpus pipelines
- Streamlit
- Reddit API workflows
- PubMed
- Blog scraping pipelines
The ingestion pipeline:
- Scrapes content from multiple sources
- Cleans and normalizes text
- Removes duplicates
- Applies filtering logic
- Stores structured corpus entries
Documents are chunked using overlapping windows for improved retrieval quality.
Current configuration:
- Chunk Size: 1200 characters
- Overlap: 200 characters
ProteinScope uses:
sentence-transformers/all-MiniLM-L6-v2for semantic vector generation.
FAISS is used for:
- Efficient nearest-neighbor retrieval
- Semantic similarity search
- Scalable retrieval operations
The system was benchmarked across retrieval quality and generation reliability.
| Metric | Result |
|---|---|
| Out-of-Scope Detection Accuracy | 90.1% |
| Average Latency | 7.42s |
| Response Consistency | 62.6% |
git clone https://github.com/realanu0812/ProteinScope.git
cd ProteinScopepython -m venv venv
source venv/bin/activateWindows:
venv\Scripts\activatepip install -r requirements.txtIf using the Streamlit application:
pip install -r nutrition_insights/phase3/requirements.txtInstall Ollama:
https://ollama.aiPull the default model:
ollama pull llama3.1:8bStart the Ollama server:
ollama serveRun the indexing pipeline:
python nutrition_insights/rag/build_index.pyOptional arguments:
--normalize
--verified-only
--min-qualitypython nutrition_insights/rag/query_cli.pyExample:
python nutrition_insights/rag/query_cli.py --query "Best protein sources for muscle gain"streamlit run nutrition_insights/phase3/run_app.pyThe dashboard includes:
- Interactive chatbot
- Source analytics
- Nutrition insights
- Trending analysis
- Export tools
Located in:
nutrition_insights/scrappers/
Includes:
- Reddit scraper
- Journal scraper
- Blog scraper
nutrition_insights/rag/filter_corpus.py
Responsible for:
- Quality filtering
- Corpus cleaning
- Source normalization
nutrition_insights/rag/build_index.py
Responsible for:
- Chunk generation
- Embedding creation
- FAISS index construction
- Metadata serialization
ProteinScope includes several safeguards for stable retrieval and generation:
- Incremental scraping workflows
- Append-only updates
- Timeout handling
- Retrieval validation
- Verified-source filtering
- Deduplication logic
- Out-of-scope detection
Potential enhancements:
- Hybrid retrieval
- Reranking pipelines
- Multi-agent workflows
- Streaming responses
- Distributed vector storage
- Evaluation automation
- Fine-tuned nutrition models
- API deployment layer
- Nutrition research assistance
- Protein intake analysis
- Evidence-grounded dietary Q&A
- Research summarization
- Retrieval system experimentation
- RAG pipeline benchmarking
Anurag Mishra