Most RAG pipelines start with fixed-size chunking, which means splitting the text every 500 or 1,000 characters. It is simple, fast, and easy to implement, but it has one important weakness: it does not care about meaning. A chunk can end in the middle of an explanation, while the next one begins with the rest of the idea or even with a different topic. When that happens, the embedding no longer represents one clear concept, retrieval quality drops, and the LLM ends up working with incomplete or mixed context.
A simple example makes this easier to see. Imagine a document that covers authentication and rate limiting:
With fixed-size chunking, the result might look like this:
Chunk 2 now contains two unrelated topics. If the user asks about rate limiting, the retriever may still return a chunk polluted with OAuth content. That is where weak retrieval starts to affect the final answer.
Semantic chunking takes a different approach. Instead of splitting by character count, it splits when the topic changes. It creates an embedding for each sentence, compares nearby sentences, and starts a new chunk when the similarity drops enough to suggest that the text has moved to a different idea.
For the same document, the result looks much cleaner:
Now each chunk contains one complete idea. That leads to better embeddings, more precise retrieval, and more grounded answers from the model. In practice, that is the real value of semantic chunking: it gives the retriever cleaner building blocks to work with.
Semantic chunking sounds advanced, but the implementation with LangChain is short. We use all-MiniLM-L6-v2 as the embedding model because it is lightweight and runs locally.
The breakpoint_threshold_amount of 70 means a new chunk starts when the similarity between sentences falls below the 70th percentile. That value is not universal, so it is worth testing different numbers depending on your documents. Once the chunks are created, they can be stored in any vector database and compared against a fixed-size baseline using similarity_search.
Semantic chunking works best when documents contain multiple topics, the content is written in normal paragraphs, and users ask specific questions that need focused answers. It is especially useful for internal documentation, API guides, and knowledge bases where preserving meaning matters more than preprocessing speed.
Fixed-size chunking is still a valid option for quick prototyping, well-structured content, or cases where speed matters more than retrieval precision. The two approaches are not mutually exclusive, and many teams start with fixed-size chunking and move to semantic chunking once retrieval quality becomes a real bottleneck.
One of the biggest lessons for us was that chunking is not just a preprocessing detail. It directly affects the quality of the whole RAG system. After switching to semantic chunking, the retrieved context became much clearer and the model stopped mixing unrelated ideas.
If your RAG system feels inconsistent, look at the chunks before tuning the prompt or switching models. Sometimes the real problem is not how the model answers, but what context it receives.