Skip to content

rikin-2911/Multimodal-PDF-Agentic-RAG-Application

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal PDF Agentic RAG Application

A full-stack multimodal Retrieval-Augmented Generation (RAG) prototype that allows users to upload PDF documents and query them using text and images.
The system uses CLIP embeddings, Pinecone vector database, and an agentic orchestration layer to reason over document content.

This project is built as a working prototype with a minimal frontend, focusing primarily on backend architecture, multimodal retrieval, and agent-based reasoning.


🚀 Key Features

  • PDF ingestion (text + images)
  • CLIP-based multimodal embeddings
  • Pinecone vector database for similarity search
  • Agentic RAG pipeline using LangGraph
  • Multimodal LLM reasoning (text + one image per query)
  • Namespace-based isolation per document
  • Lightweight Streamlit frontend (prototype)

🧠 Tech Stack

Backend

  • Python
  • FastAPI – API layer
  • LangGraph – Agent orchestration
  • LangChain – LLM integration
  • Hugging Face Transformers
    • CLIP (text + image embeddings)
    • Multimodal LLM (Gemma / similar)
  • Pinecone – Vector database
  • PyMuPDF (fitz) – PDF parsing
  • Pillow (PIL) – Image processing

Frontend

  • Streamlit (single-file UI)

Infrastructure

  • Hugging Face Inference API
  • Pinecone managed vector index
  • Environment-based configuration (.env)

🏗️ High-Level Architecture

        ┌──────────────┐
        │   Frontend   │
        │  (Streamlit) │
        └──────┬───────┘
               │
               ▼
        ┌──────────────┐
        │   FastAPI    │
        │   Backend    │
        └──────┬───────┘
               │
┌──────────────┴──────────────┐
│                             │
▼                             ▼

┌──────────────┐ ┌──────────────┐ │ CLIP │ │ Pinecone │ │ Embeddings │──────────────▶│ Vector Store │ │ (Text/Image) │ └──────────────┘ └──────────────┘ │ ▼ ┌────────────────────┐ │ LangGraph Agent │ │ (Retriever + LLM) │ └─────────┬──────────┘ ▼ ┌────────────────────┐ │ Multimodal LLM │ │ (Text + 1 Image) │ └────────────────────┘


🧩 Design Decisions (Important)

  • Only one image is passed to the LLM per query, respecting multimodal model constraints.
  • Pinecone stores only vectors and lightweight metadata (no large payloads).
  • Images are stored externally and referenced via IDs.
  • The agent explicitly controls retrieval, selection, and prompt construction.
  • Frontend is intentionally minimal to keep focus on backend reasoning.

⚠️ Frontend Note (Prototype)

The frontend is intentionally minimal and unstyled.

  • Purpose: demonstrate end-to-end functionality
  • Focus of this project is backend architecture, not UI/UX
  • Designed as a prototype, not a production UI

🛠️ How to Run on Another Device

1️⃣ Clone the Repository

git clone https://github.com/rikin-2911/Multimodal-PDF-Agentic-RAG-Application.git
cd Multimodal-PDF-Agentic-RAG-Application

About

A fullstack multimodal RAG application for PDF QnA and Summarization along with Visual Elements Understanding and Presentation capabilities..

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages