Multimodal PDF Agentic RAG Application

A full-stack multimodal Retrieval-Augmented Generation (RAG) prototype that allows users to upload PDF documents and query them using text and images.
The system uses CLIP embeddings, Pinecone vector database, and an agentic orchestration layer to reason over document content.

This project is built as a working prototype with a minimal frontend, focusing primarily on backend architecture, multimodal retrieval, and agent-based reasoning.

🚀 Key Features

PDF ingestion (text + images)
CLIP-based multimodal embeddings
Pinecone vector database for similarity search
Agentic RAG pipeline using LangGraph
Multimodal LLM reasoning (text + one image per query)
Namespace-based isolation per document
Lightweight Streamlit frontend (prototype)

🧠 Tech Stack

Backend

Python
FastAPI – API layer
LangGraph – Agent orchestration
LangChain – LLM integration
Hugging Face Transformers
- CLIP (text + image embeddings)
- Multimodal LLM (Gemma / similar)
Pinecone – Vector database
PyMuPDF (fitz) – PDF parsing
Pillow (PIL) – Image processing

Frontend

Streamlit (single-file UI)

Infrastructure

Hugging Face Inference API
Pinecone managed vector index
Environment-based configuration (.env)

🏗️ High-Level Architecture

        ┌──────────────┐
        │   Frontend   │
        │  (Streamlit) │
        └──────┬───────┘
               │
               ▼
        ┌──────────────┐
        │   FastAPI    │
        │   Backend    │
        └──────┬───────┘
               │
┌──────────────┴──────────────┐
│                             │
▼                             ▼

┌──────────────┐ ┌──────────────┐ │ CLIP │ │ Pinecone │ │ Embeddings │──────────────▶│ Vector Store │ │ (Text/Image) │ └──────────────┘ └──────────────┘ │ ▼ ┌────────────────────┐ │ LangGraph Agent │ │ (Retriever + LLM) │ └─────────┬──────────┘ ▼ ┌────────────────────┐ │ Multimodal LLM │ │ (Text + 1 Image) │ └────────────────────┘

🧩 Design Decisions (Important)

Only one image is passed to the LLM per query, respecting multimodal model constraints.
Pinecone stores only vectors and lightweight metadata (no large payloads).
Images are stored externally and referenced via IDs.
The agent explicitly controls retrieval, selection, and prompt construction.
Frontend is intentionally minimal to keep focus on backend reasoning.

⚠️ Frontend Note (Prototype)

The frontend is intentionally minimal and unstyled.

Purpose: demonstrate end-to-end functionality
Focus of this project is backend architecture, not UI/UX
Designed as a prototype, not a production UI

🛠️ How to Run on Another Device

1️⃣ Clone the Repository

git clone https://github.com/rikin-2911/Multimodal-PDF-Agentic-RAG-Application.git
cd Multimodal-PDF-Agentic-RAG-Application

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Testing_Examples		Testing_Examples
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal PDF Agentic RAG Application

🚀 Key Features

🧠 Tech Stack

Backend

Frontend

Infrastructure

🏗️ High-Level Architecture

🧩 Design Decisions (Important)

⚠️ Frontend Note (Prototype)

🛠️ How to Run on Another Device

1️⃣ Clone the Repository

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal PDF Agentic RAG Application

🚀 Key Features

🧠 Tech Stack

Backend

Frontend

Infrastructure

🏗️ High-Level Architecture

🧩 Design Decisions (Important)

⚠️ Frontend Note (Prototype)

🛠️ How to Run on Another Device

1️⃣ Clone the Repository

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages