Home / AI Arena / Agents / Chunking Strategies
Chunking Strategies
This is part of the AI Agents series. All code is at github.com/achintmehta/langchain.
Why chunking matters
LLMs have a context window — a hard limit on how much text they can see at once. A typical document (a PDF manual, a long article, a codebase file) is almost always larger than what you can fit alongside your system prompt and the user's question. Even when it fits, sending an entire 100-page manual for every query is wasteful and slow.
The solution is to split documents into smaller chunks, embed each chunk as a vector, and at query time retrieve only the chunks that are relevant to the question. The quality of this retrieval depends heavily on how well you chunked the document in the first place.
The full example code is in chunking/chunking_data.py.
Loading documents
Before you can split, you need to load. LangChain has loaders for most common sources:
from langchain_community.document_loaders import TextLoader, PyPDFLoader, WebBaseLoader
# Plain text file
text_docs = TextLoader("./my_document.txt").load()
# PDF — each page becomes a Document
pdf_docs = PyPDFLoader("./manual.pdf").load()
# Web page
web_docs = WebBaseLoader("https://example.com/article").load()
Each loader returns a list of Document objects. A Document has two fields: page_content (the text) and metadata (a dict with source, page number, URL, etc.). That metadata travels with the chunk all the way through the pipeline — when you retrieve a chunk later, you still know where it came from.
RecursiveCharacterTextSplitter
The most commonly used splitter in LangChain is RecursiveCharacterTextSplitter. It tries to split on natural boundaries — first paragraphs (\n\n), then lines (\n), then sentences (. ), then finally by character count if nothing else works. This means chunks tend to end at paragraph or sentence boundaries rather than in the middle of a sentence.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # target size in characters
chunk_overlap=50 # characters of overlap between consecutive chunks
)
chunks = splitter.split_documents(text_docs)
for i, chunk in enumerate(chunks[:3]):
print(f"--- Chunk {i} ({len(chunk.page_content)} chars) ---")
print(chunk.page_content[:200])
print()
Two parameters dominate the behaviour: chunk_size sets the target maximum length of each chunk, and chunk_overlap makes each chunk include some of the text from the previous one. Overlap exists to prevent information loss at boundaries — if a key sentence happens to land right at the edge of a chunk, the overlap ensures it appears in at least two chunks and is more likely to be retrieved.
Choosing chunk size and overlap
There is no universally correct chunk size. It depends on your model's context window, the density of your documents, and your query patterns. A few rules of thumb:
Chunk size should be small enough that several chunks plus your system prompt and question all fit in the context window comfortably. With a model that has a 4K context window, targeting 300–500 tokens per chunk is reasonable. With 32K+ context, you can afford larger chunks.
Overlap is typically 10–20% of the chunk size. More overlap means more redundancy and larger storage costs, but also better recall at chunk boundaries. For short chunks (< 300 chars), overlap matters more because boundaries are more frequent.
Smaller chunks retrieve more precisely but give the LLM less surrounding context. Larger chunks give the LLM more context but may include irrelevant content that dilutes the answer.
Language-aware splitting
If your documents are source code, RecursiveCharacterTextSplitter can be configured to split on language-appropriate boundaries — function definitions, class declarations, and so on — rather than just characters. This is much better than naively splitting Python code at arbitrary character counts.
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=60,
chunk_overlap=0
)
python_code = """
def hello_world():
print("Hello, world!")
def add(a, b):
return a + b
"""
chunks = python_splitter.create_documents([python_code])
for chunk in chunks:
print(repr(chunk.page_content))
LangChain has language-aware splitters for Python, JavaScript, TypeScript, Markdown, HTML, Go, Ruby, Rust, Scala, Swift, and more. The Markdown splitter, in particular, is useful for documentation corpora — it splits on heading boundaries so each chunk corresponds to a logical section.
Splitting strategies at a glance
Different document types call for different strategies. Here is a summary:
| Strategy | Best for | LangChain tool |
|---|---|---|
| Fixed-size with overlap | Logs, transcripts, uniform text | RecursiveCharacterTextSplitter |
| Sentence / paragraph boundary | Articles, how-to guides, policies | RecursiveCharacterTextSplitter (default) |
| Structure-aware (heading/section) | API docs, manuals, textbooks | MarkdownHeaderTextSplitter |
| Language-aware | Source code | RecursiveCharacterTextSplitter.from_language(...) |
| Per-page (PDF) | Scanned documents, PDFs with figures | PyPDFLoader (one Document per page) |
For most text documents, RecursiveCharacterTextSplitter with a chunk size of 400–600 characters and 10–15% overlap is a reasonable starting point. Start there, then adjust based on retrieval quality.
Metadata is as important as content
Every chunk inherits the metadata of its parent document — the source file path, URL, page number, or whatever the loader provided. You can also add your own metadata before or after splitting:
for doc in text_docs:
doc.metadata["category"] = "user-manual"
doc.metadata["version"] = "3.2"
chunks = splitter.split_documents(text_docs)
# Each chunk now has category and version in its metadata
When you store these chunks in a vector database (covered in the next part), the metadata comes along and can be used to filter searches. For example, you can restrict retrieval to chunks from a specific document version or category without having to re-embed anything.
What's next
Once you have chunks, you need to turn them into vectors and store them somewhere you can search efficiently. The next part covers embedding models and vector databases.