Smart Medical Research Assistant powered by Retrieval-Augmented Generation (RAG)
This project builds a RAG pipeline for summarization and question-answering over scientific / medical papers (mainly from arXiv).
It integrates web scraping, embeddings, vector databases, and LLMs to deliver concise research summaries, relevant URLs, and detailed answers from PDFs.
medicalSearch/
│
├── Scraping/ # Web scraping modules for arXiv / papers
├── Rag_Summary/ # Summarization pipeline
├── Qa-Bot/ # Question-answering over PDFs
├── Neuro-Med-app/ # Application / API / frontend
├── Murag/ # Experiments / alternative implementations
├── Extracted_Images/ # Assets & diagrams
├── .env # Environment variables
├── .gitignore
└── README.md # Project documentation- Web Scraping: Collects abstracts and PDFs from arXiv.
- Embedding Models: Supports
sentence-transformers,bioBERT, anddeepseek-embed. - Vector Databases:
- Elasticsearch → Abstracts & metadata storage.
- Chromedb → PDF embeddings & retrieval.
- Summarization (RAG): Retrieves abstracts and generates concise summaries with paper URLs.
- Q&A (RAG): Answers detailed questions by retrieving relevant PDF passages.
- LLM Augmentation: Uses LLaMA 3.2 and DeepSeek (medical / fine-tuned) for improved responses.
- Python ≥ 3.8
- pip / conda
- Running Elasticsearch instance
- Chromedb (or other vector DB)
- GPU (recommended for embeddings & LLMs)
- Clone the repository:
git clone https://github.com/sara-bm/medicalSearch.git cd medicalSearch - Create a virtual environment:
python -m venv venv source venv/bin/activate # macOS/Linux # .\venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables in a .env file:
ELASTICSEARCH_URL=http://localhost:9200 ELASTICSEARCH_USER=your_user ELASTICSEARCH_PASSWORD=your_password CHROMEDB_PATH=./chromedb OPENAI_API_KEY=your_api_key MODEL_PATH=./models
-
Scraping Papers Run the scraper to fetch abstracts & PDFs:
python Scraping/run_scraper.py
-
Summarization Get a summary + paper URLs for a query:
python Rag_Summary/summarize.py --query "latest research on mRNA vaccines" -
Q&A : Answer detailed questions from PDFs:
python Qa-Bot/qa.py --query "What side effects were reported in mRNA vaccine studies?" -
Run Streamlit App :
cd Neuro-Med-app uvicorn app:app --reload
- Embeddings: Swap between BioBERT, SciBERT, or Sentence-BERT.
- LLMs: Replace or fine-tune LLaMA / DeepSeek.
- Retrieval: Adjust similarity thresholds & top-k results.
- PDF Splitting: Tune chunk size for document embeddings.
Key libraries:
- transformers, sentence-transformers, bioBERT
- elasticsearch, chromadb
- uvicorn, fastapi (for API)
- pdfplumber / PyPDF2 (PDF parsing)
- scikit-learn or faiss (similarity search)
pip install -r requirements.txt- Scrape new papers from arXiv.
- Index abstracts in Elasticsearch.
- Embed full-text PDFs into Chromadb.
- User asks: “What are the risks of long-term AI use in radiology?”
- Summarizer returns summary + URLs.
- QA module retrieves PDFs & answers in detail.
- Depends on scraped datasets (limited coverage).
- Risk of hallucinations from LLM.
- PDF parsing may introduce noise.
- Latency when processing large PDFs.
- Fork the project.
- Create your feature branch (git checkout -b feature/new-feature).
- Commit your changes (git commit -m 'Add new feature').
- Push to the branch (git push origin feature/new-feature).
- Open a Pull Request.
- MIT License
If you use this project in research, please cite:
textNeuroMed / medicalSearch (2025).
"RAG pipeline for medical literature summarization & Q&A."
GitHub Repository: https://github.com/sara-bm/medicalSearch