🔮 Project: Multi-Vector Retrieval (ColBERT)

**WHY**
Currently, VOD Training only complies with document-level embeddings. This represents each document with a single-vector representation, constraining the granularity of the contextual information captured. 

ColBERT introduced a more complex interaction by encoding each passage into a matrix of token-level embeddings. During search, it further embeds every query into another matrix, allowing efficient passage retrieval that contextually matches the query using scalable vector-similarity operators.

The rich interactions enabled by ColBERT have been proven to surpass the quality of single-vector representation models. However, making it scale efficiently to large corpora is not trivial.

**HOW**
The project will address the aforementioned goals through the following means:

_Utilizing Fine-Grained Contextual Late Interaction:_
- Leverage ColBERT's ability to encode queries and passages into sequences of token-level embeddings.
- Improve vod's on-disk data structures to handle 3-dimensional tensors with variable shapes (e.g., shape `N x ? x H`)
- Implement ColBERT's `MaxSim` operator in the loss layer
- Implement ColBERT's two-stage retrieval

_Combine T5 Models with ColBERT:_
- Benchmark ColT5 against ColBERT
- Benchmark the end-to-end search latency in search engine like Raffle.

_Implement [XTR: ContXextualized Token Retriever](https://arxiv.org/abs/2304.01982):_
- Implement XTR loss
- Implement XTR one-stage retrieval

_Refinements:_
- Investigate Robust Multi-Hop Reasoning at Scale via Condensed Retrieval.
- Explore effective and efficient retrieval via Lightweight Late Interaction (e.g., PLAID)

**WHAT**
The anticipated outcomes of this project include:

1. State-of-the-art retrieval for RAG models (T5 + XTR)
2. A scalable solution capable of handling large corpora without compromising efficiency.

**References**
- [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832) (SIGIR'20).
- [Relevance-guided Supervision for OpenQA with ColBERT](https://arxiv.org/abs/2007.00814) (TACL'21).
- [Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval](https://arxiv.org/abs/2101.00436) (NeurIPS'21).
- [ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction](https://arxiv.org/abs/2112.01488) (NAACL'22).
- [PLAID: An Efficient Engine for Late Interaction Retrieval](https://arxiv.org/abs/2205.09707) (CIKM'22).
- [Rethinking the Role of Token Retrieval in Multi-Vector Retrieval](https://arxiv.org/abs/2304.01982)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔮 Project: Multi-Vector Retrieval (ColBERT) #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🔮 Project: Multi-Vector Retrieval (ColBERT) #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions