Skip to content

Latest commit

 

History

History
61 lines (40 loc) · 1.76 KB

File metadata and controls

61 lines (40 loc) · 1.76 KB

MultiModal RAG with ColPali

Overview

This repository is designed to demonstrate how to integrate ColPali embeddings for advanced multi-modal retrieval augmented generation (RAG). We use a PDF index for querying, combined with a Llama 3.2 Vision-Language model for result generation.

ColPali Model

We incorporate the ColPali Embedding model from Hugging Face, specifically vidore/colpali-v1.2, which provides robust embeddings for text and vision. The RAGMultiModalModel class is leveraged for indexing and retrieval.

Installation Steps

  1. Install Requirements

    pip install byaldi
    sudo apt-get install -y poppler-utils
    pip install huggingface_hub
    !pip install together --q
  2. Log in to Hugging Face
    Provide your HF_TOKEN to authenticate with Hugging Face.

    from huggingface_hub import login
    login(token="HF_TOKEN")
  3. Initialize the Model

    from byaldi import RAGMultiModalModel
    model = RAGMultiModalModel.from_pretrained('vidore/colpali-v1.2')

Index Creation

The PDF file colpali.pdf is downloaded, then passed to model.index, which creates an index for retrieval. The index_name argument is set to 'colpali'.

Querying the Model

After generating the index, we run a query such as:

query = "What is ColPali's (late interaction) evaluation base line score on DocQ and InfoQ?"
results = model.search(query, k=2)

The top retrievals are processed to gather the best possible answer from the colpali.pdf content.

MultiModal RAG Flow

  1. User Query
  2. Text + Vision Embedding via ColPali
  3. Index -> Retrieve relevant pages
  4. Llama 3.2 VLM processes both text query and retrieved PDF content
  5. Generated Answer