This repository explores various chunking strategies for improving the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) pipelines. Chunking determines how source documents are segmented before being embedded and retrieved, which can significantly affect retrieval quality and latency.
Chunking plays a critical role in balancing context preservation, retrieval precision, and inference cost. This project compares common and advanced methods under a controlled evaluation framework.
chunking_mehtods.py: Contains implementations of chunking strategies such as:- Fixed-size chunking
- Recursive chunking
- Sliding chunking
- Topic-based chunking
- Semantic chunking
- Hybrid chunking
utils.py: Utility functions shared across modules.
| Chunking Method | Strategy | Pros | Cons |
|---|---|---|---|
| Fixed-size | Uniform length split | Simple, fast | Can break semantic units |
| Recursive | Uses hierarchical splitting rules | Maintains structure | Slower, heuristic-based |
| Sliding window | Overlapping segments | High recall | Increases redundancy |
| Topic-based | Clusters sentences by semantic similarity | Groups text by meaningful topics | Requires embedding + clustering; variable chunk sizes |
| Semantic | Embedding-based or topic-aware | Semantic coherence | More complex to implement |
| Hybrid | Text-structure + semantic similarity | Balanced, readable and coherent | More complex logic and slower |
- spacy
- nltk
- sentence-transformers
- numpy
- scikit-learn