WHY
Currently, VOD Training only complies with document-level embeddings. This represents each document with a single-vector representation, constraining the granularity of the contextual information captured.
ColBERT introduced a more complex interaction by encoding each passage into a matrix of token-level embeddings. During search, it further embeds every query into another matrix, allowing efficient passage retrieval that contextually matches the query using scalable vector-similarity operators.
The rich interactions enabled by ColBERT have been proven to surpass the quality of single-vector representation models. However, making it scale efficiently to large corpora is not trivial.
HOW
The project will address the aforementioned goals through the following means:
Utilizing Fine-Grained Contextual Late Interaction:
- Leverage ColBERT's ability to encode queries and passages into sequences of token-level embeddings.
- Improve vod's on-disk data structures to handle 3-dimensional tensors with variable shapes (e.g., shape
N x ? x H)
- Implement ColBERT's
MaxSim operator in the loss layer
- Implement ColBERT's two-stage retrieval
Combine T5 Models with ColBERT:
- Benchmark ColT5 against ColBERT
- Benchmark the end-to-end search latency in search engine like Raffle.
Implement XTR: ContXextualized Token Retriever:
- Implement XTR loss
- Implement XTR one-stage retrieval
Refinements:
- Investigate Robust Multi-Hop Reasoning at Scale via Condensed Retrieval.
- Explore effective and efficient retrieval via Lightweight Late Interaction (e.g., PLAID)
WHAT
The anticipated outcomes of this project include:
- State-of-the-art retrieval for RAG models (T5 + XTR)
- A scalable solution capable of handling large corpora without compromising efficiency.
References
WHY
Currently, VOD Training only complies with document-level embeddings. This represents each document with a single-vector representation, constraining the granularity of the contextual information captured.
ColBERT introduced a more complex interaction by encoding each passage into a matrix of token-level embeddings. During search, it further embeds every query into another matrix, allowing efficient passage retrieval that contextually matches the query using scalable vector-similarity operators.
The rich interactions enabled by ColBERT have been proven to surpass the quality of single-vector representation models. However, making it scale efficiently to large corpora is not trivial.
HOW
The project will address the aforementioned goals through the following means:
Utilizing Fine-Grained Contextual Late Interaction:
N x ? x H)MaxSimoperator in the loss layerCombine T5 Models with ColBERT:
Implement XTR: ContXextualized Token Retriever:
Refinements:
WHAT
The anticipated outcomes of this project include:
References