Skip to content

SruthiPuli/InsightPDF---RAG-Powered-AI-Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InsightPDF - RAG-Powered AI Chatbot

Project Overview

A PDF-based Retrieval-Augmented Generation (RAG) chatbot allows users to interact with PDF documents intelligently. Instead of manually searching through pages, you can simply ask questions, and the chatbot extracts relevant information from your PDFs, providing accurate and context-aware answers instantly.

This chatbot built with Streamlit, LangChain, HuggingFace embeddings, FAISS, and Groq LLM.

Live Demo : (https://insightpdf---rag-powered-ai-chatbot.streamlit.app/)

Screenshots

image image

Table of Contents

  1. Project Overview
  2. Screenshots
  3. Key Features
  4. Installation
  5. Usage
  6. Configuration
  7. Tech Stack
  8. Folder Structure
  9. Contributions
  10. License
  11. About

Key Features

  • Context-Aware PDF Intelligence: Engineered for deep, targeted querying, allowing users to extract precise insights instantly without manual skimming or document searching.

  • Intelligent Text Processing: Utilizes automated extraction and recursive chunking logic to preserve document hierarchy and maximize retrieval accuracy.

  • High-Speed Retrieval with FAISS: Implements local vector storage for optimized similarity searches, ensuring near-instant access to relevant document segments.

  • Ultra-Fast Inference via Groq: Delivers real-time, grounded responses by leveraging Groq’s Tensor Streaming Processor (TSP) architecture for industry-leading, deterministic low latency.

  • Semantic Precision with HuggingFace: Employs state-of-the-art embedding models to generate high-fidelity semantic representations, ensuring superior search relevance.

Installation

Clone the Repository

git clone https://github.com/SruthiPuli/InsightPDF---RAG-Powered-AI-Chatbot.git
cd InsightPDF---RAG-Powered-AI-Chatbot

Optional: Create a Python Virtual Environment

To avoid package dependency issues, it is recommended to create a virtual environment before installing the required libraries. You can skip this step if you prefer installing packages globally.

# Create a virtual environment named 'my_venv'
python -m venv my_venv
# Activate the virtual environment
# On Windows
my_venv\Scripts\activate
# On macOS/Linux
source my_venv/bin/activate

Install Project Dependencies

Install all required dependencies using the following command:

# python packages
pip install -r requirements.txt

Usage

  • Once the setup is complete, start the Streamlit app by running:
# Run the chatbot
streamlit run app.py
# Open the URL in your browser (usually http://localhost:8501)
  • Upload any PDF file using the file uploader in the Streamlit interface.

  • Once the document is processed and indexed, start asking questions through the chat input.

  • The chatbot retrieves relevant context from the PDF and answers your queries in real time.

Configuration

Configure Environment Variables

Ensure your .env file is set up with your Groq API key:

# To access Groq LLM
GROQ_API_KEY="your_actual_api_key_here"

Then, in your Python script, load it like this:

# Python
import os
from dotenv import load_dotenv

load_dotenv()
api_key = os.getenv("GROQ_API_KEY")

Streamlit Deployment (Important)

If you deploy this application on Streamlit Cloud, do not use the .env file.

  1. Go to your app dashboard on Streamlit Cloud.
  2. Open Settings → Secrets.
  3. Add your API key in the following format:
# Secrets
GROQ_API_KEY = "your_actual_api_key_here"

Tech Stack

  • Python – Core programming language used for application logic.
  • Streamlit – Interactive web framework for building the user interface.
  • LangChain – Provides end-to-end components for building Retrieval-Augmented Generation (RAG) pipelines, including document loading, text splitting, embedding integration, vector store management, and seamless LLM orchestration.
  • Groq LLM – Used for real-time response generation, leveraging Groq’s Tensor Streaming Processor (TSP) architecture to deliver ultra-fast, deterministic, low-latency inference for context-aware answers. It allows 14,000+ requests in a day for free.
  • HuggingFace Embeddings – Responsible for converting document text and user queries into semantic vector representations, enabling accurate similarity-based retrieval.
  • FAISS – High-performance vector database for efficient and fast similarity search over embedded document chunks.
  • PyPdfReader – Extracts and processes text from PDF documents.
  • Sentence-Transformers – Provides pre-trained embedding models; this project uses sentence-transformers/all-MiniLM-L6-v2 for lightweight, high-quality embeddings that balance speed and semantic accuracy.
  • Python-dotenv – Manages environment variables securely during local development.

Folder Structure

pdf-rag-chatbot/
├─ outputs/               # Images, Live Demo Video
├─ .gitattributes         # Tells Git how to handle files
├─ app.py                 # Main Streamlit app
├─ requirements.txt       # Python dependencies
├─ sample_pdf             # Sample Pdf to upload
├─ LICENSE                # MIT License
└─ README.md              # README File              
               

Contributions

Contributions are welcome! If you’d like to improve this project, feel free to fork the repository, create a new branch, and submit a pull request. Bug reports, feature requests, and documentation improvements are all appreciated.

License

This project is licensed under the MIT License. If you fork or use this project, please give credit by mentioning or pinging me: Sruthi Pulipati (GitHub: SruthiPuli).

About

This project is solely developed by Sruthi Pulipati (GitHub: SruthiPuli).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages