Skip to content

Exploring and benchmarking chunking methods for Retrieval-Augmented Generation (RAG), including fixed-size, recursive, sliding, semantic, and hybrid chunking strategies.

Notifications You must be signed in to change notification settings

gazelle93/Various-Chunking-Methods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This repository explores various chunking strategies for improving the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) pipelines. Chunking determines how source documents are segmented before being embedded and retrieved, which can significantly affect retrieval quality and latency.

Motivation

Chunking plays a critical role in balancing context preservation, retrieval precision, and inference cost. This project compares common and advanced methods under a controlled evaluation framework.

Repository Structure

  • chunking_mehtods.py: Contains implementations of chunking strategies such as:
    • Fixed-size chunking
    • Recursive chunking
    • Sliding chunking
    • Topic-based chunking
    • Semantic chunking
    • Hybrid chunking
  • utils.py: Utility functions shared across modules.

Methods Compared

Chunking Method Strategy Pros Cons
Fixed-size Uniform length split Simple, fast Can break semantic units
Recursive Uses hierarchical splitting rules Maintains structure Slower, heuristic-based
Sliding window Overlapping segments High recall Increases redundancy
Topic-based Clusters sentences by semantic similarity Groups text by meaningful topics Requires embedding + clustering; variable chunk sizes
Semantic Embedding-based or topic-aware Semantic coherence More complex to implement
Hybrid Text-structure + semantic similarity Balanced, readable and coherent More complex logic and slower

Prerequisites

  • spacy
  • nltk
  • sentence-transformers
  • numpy
  • scikit-learn

About

Exploring and benchmarking chunking methods for Retrieval-Augmented Generation (RAG), including fixed-size, recursive, sliding, semantic, and hybrid chunking strategies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages