PDF RAG Summarizer

A Retrieval-Augmented Generation (RAG) system that allows you to summarize PDF documents and ask questions about their content. The project combines document ingestion, vector-based retrieval, and large language models to provide concise summaries and accurate, context-aware answers. It is designed as a modular, extensible Python project, suitable for coursework, research, or further development.

The system utilizes:

LangChain as the orchestration framework,
OpenAI for embeddings and question answering,
Chroma as the vector database,
Transformers for optional local PDF summarization,
Gradio for a simple web-based user interface.

Installation

Clone the repository:

git clone https://github.com/Flat-Earther/PDF-summary-RAG.git
cd PDF-summary-RAG

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On macOS/Linux
# On Windows use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```
Create a .env file in the root directory and add your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key
```
Prepare your documents: Create a data/ directory in the project root and place your PDF files inside it.
Run the application:
```
python app.py
```
Open your web browser and go to:
```
http://localhost:7860
```
to access the Gradio interface.

Project Structure

pdf_rag_project/
├── app.py                  # Application entry point (Gradio UI)
├── config.py               # Global configuration
├── ingestion/              # PDF loading, chunking, vectorstore creation
├── summarization/          # PDF summarization logic
├── rag/                    # Retrieval and question answering
├── utils/                  # Prompts and PDF export utilities
├── data/                   # Input PDF documents
├── db/                     # Chroma vector database
└── requirements.txt

Features

📄 PDF ingestion and text chunking
🧠 Vector-based document retrieval (RAG)
✍️ Automatic document summarization
❓ Question answering grounded in document content
🖥️ Simple web UI using Gradio
🧩 Modular architecture for easy extension

Important Notes

Only PDF files are supported as input documents.
Summarization can be done:
- locally using Transformers (default), or
- via an LLM fallback if configured.
Question answering always uses retrieved document context; if the answer is not present, the model is instructed to respond with “I don’t know”.
The default language of responses depends on the document content and prompt configuration, and can be easily modified in utils/prompts.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF RAG Summarizer

Installation

Project Structure

Features

Important Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
ingestion		ingestion
rag		rag
summarization		summarization
utils		utils
README.md		README.md
app.py		app.py
config.py		config.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PDF RAG Summarizer

Installation

Project Structure

Features

Important Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages