🦙✂️ Text Splitters: Smart Text Division with Llamaindex

Precision Text Splitting Made Easy with LlamaIndex

Gustavo EspĂ­ndola
3 min readMay 22, 2024

LlamaIndex’s text splitters are designed to tackle this challenge head-on, offering a range of sophisticated algorithms that can intelligently divide text while preserving its integrity. From character-based splitting to token-based approaches, LlamaIndex provides a comprehensive toolkit for tailoring the splitting process to your specific needs.

Sentence Splitter, as the name suggests, this splitter aims to divide text while respecting the boundaries of sentences. By keeping sentences and paragraphs together, it minimizes the risk of fragmented or incomplete information, making it an excellent choice for preserving context and coherence.

Additional Resources

Example Project
Get the code

Text Splitters

Text Splitters are tools that divide text into smaller fragments with semantic meaning, often corresponding to sentences. But here’s where the intelligence lies: it’s not just about splitting; it’s about combining these fragments strategically. They work as follows:

  1. Divide the text into small fragments with semantic meaning, such as sentences.
  2. Then, combine these small fragments into a larger fragment until a certain size is reached, usually measured by some function.
  3. Once that size is reached, that fragment becomes its own unit of text. Then, the creation of a new text fragment begins, with some overlap to maintain context between the fragments.

This means that Text Splitters are highly customizable in two fundamental aspects:

  • How the text is divided: You can define division rules based on characters, words, or tokens.
  • How the fragment size is measured: You can adjust the fragment size according to your specific needs.
Split examples

Chunking

Chunking is the process of breaking down content into smaller, more manageable parts, making it easier to handle from a computational perspective. The size of these fragments determines how much meaningful content each unit contains.

Overlap, on the other hand, is the option that allows adjacent fragments to share certain common information. In other words, it’s the “previously” when your favorite TV series’ new episode starts with details from the exciting scenes of the previous episode. And when it ends with “coming soon.”

So, the overlap of fragments is the specific amount of content these neighboring fragments share and is generally represented as a percentage. If your fragments are 512 characters long, your overlap could be 50, increasing the likelihood of delivering relevant data to the model.

Now, let’s illustrate this with an example

This overlap allows for some continuity and context between the fragments, which can be useful in storytelling, increasing the likelihood of delivering relevant data to the model.

Why is it important?
Creating good chunks is essential in semantic search and RAG (Retrieval-Augmented Generation). Effective content division ensures that we maintain coherence and context in the response. If we divide a story into unrelated fragments, we could lose the ability to create a coherent response.

What is the best strategy for chunks and overlap?
The best strategy for a text will largely depend on the nature of the document and the purpose of its analysis.

For unstructured text documents, I personally recommend the “Sentence Splitter” strategy.

SentenceSplitter, as the name suggests, this splitter aims to divide text while respecting the boundaries of sentences. By keeping sentences and paragraphs together, it minimizes the risk of fragmented or incomplete information, making it an excellent choice for preserving context and coherence.

Additional Resources

Example Project
Get the code

--

--

Gustavo EspĂ­ndola
Gustavo EspĂ­ndola

Written by Gustavo EspĂ­ndola

Maker & Senior Product Designer — Co-founder of CodeGPT by Judini AI

No responses yet