Text Splitters are responsible for splitting large documents into smaller text chunks suitable for retrieval. A well-designed text splitting strategy can effectively improve the retrieval accuracy and generation quality of a RAG system — by controlling chunk size and preserving semantic integrity, ensuring each text chunk contains meaningful contextual information.
Below is an introduction to some commonly used components:
For more component usage details, see Text splitters.
MarkdownHeaderTextSplitter resides in different packages depending on the LangChain version:
- LangChain 1.x.x:
langchain-text-splitterspackage - LangChain 0.3.x:
langchain.text_splittermodule
After installing tRPC-Agent-Python, the relevant dependencies are installed automatically, so no further installation is required.
- Create a
MarkdownHeaderTextSplitterobject
# Import compatible with both LangChain 0.3.x and 1.x.x
try:
from langchain_text_splitters import MarkdownHeaderTextSplitter
except ImportError:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)- Construct a
LangchainKnowledgeobject using thismarkdown_splitterobject
rag = LangchainKnowledge(
...,
document_transformer=markdown_splitter,
...,
)- Create a
RecursiveCharacterTextSplitterobject
# Import compatible with both LangChain 0.3.x and 1.x.x
try:
# Import for langchain v1.x.x
from langchain_text_splitters import RecursiveCharacterTextSplitter
except ImportError:
# Import for langchain v0.3.x
from langchain.text_splitter import RecursiveCharacterTextSplitter
# chunk_size specifies the maximum number of characters per text chunk, chunk_overlap specifies the number of overlapping characters between adjacent chunks.
# Adjust these two parameters based on actual text length and use case to achieve optimal chunking results.
text_splitter = RecursiveCharacterTextSplitter(separators=["\n"], chunk_size=10, chunk_overlap=0)- Construct a
LangchainKnowledgeobject using thistext_splitterobject
# examples/knowledge_with_rag_agent/agent/tools.py
from trpc_agent_sdk.server.knowledge.langchain_knowledge import LangchainKnowledge
rag = LangchainKnowledge(
prompt_template=rag_prompt,
document_loader=text_loader,
document_transformer=text_splitter,
embedder=embedder,
vectorstore=vectorstore,
)Please refer to examples/knowledge_with_rag_agent/README.md.