Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens

Overview

text-splitter

Docs Licence Crates.io codecov

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

use text_splitter::{Characters, TextSplitter};

// Maximum number of characters in a chunk
let max_characters = 1000;
// Default implementation uses character count for chunk size
let splitter = TextSplitter::default()
    // Optionally can also have the splitter trim whitespace for you
    .with_trim_chunks(true);

let chunks = splitter.chunks("your document text", max_characters);

By Tokens

use text_splitter::{TextSplitter};
// Can also use tiktoken-rs, or anything that implements the TokenCount
// trait from the text_splitter crate.
use tokenizers::Tokenizer;

let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let max_tokens = 1000;
let splitter = TextSplitter::new(tokenizer)
    // Optionally can also have the splitter trim whitespace for you
    .with_trim_chunks(true);

let chunks = splitter.chunks("your document text", max_tokens);

Method

To preserve as much semantic meaning within a chunk as possible, a recursive approach is used, starting at larger semantic units and, if that is too large, breaking it up into the next largest unit. Here is an example of the steps used:

  1. Split the text by a given level
  2. For each section, does it fit within the chunk size?
    • Yes. Merge as many of these neighboring sections into a chunk as possible to maximize chunk length.
    • No. Split by the next level and repeat.

The boundaries used to split the text if using the top-level split method, in descending length:

  1. 2 or more newlines (Newline is \r\n, \n, or \r)
  2. 1 newline
  3. Unicode Sentence Boundaries
  4. Unicode Word Boundaries
  5. Unicode Grapheme Cluster Boundaries
  6. Characters

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences: There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

You might also like...
🦀 Stupid simple presentation of the number of words, characters and lines on your clipboard.

clipcount: Counting words from the clipboard content Why does this exist? Do you find yourself often needing to count the number of words in a piece o

A small Rust library that let's you get position and size of the active window on Windows and MacOS

active-win-pos-rs A small Rust library that let's you get position and size of the active window on Windows and MacOS Build % git clone https://github

An implementation of Piet's text interface using cosmic-text

piet-cosmic-text Implements piet's Text interface using the cosmic-text crate. License piet-cosmic-text is free software: you can redistribute it and/

Count your code by tokens, types of syntax tree nodes, and patterns in the syntax tree. A tokei/scc/cloc alternative.

tcount (pronounced "tee-count") Count your code by tokens, types of syntax tree nodes, and patterns in the syntax tree. Quick Start Simply run tcount

Tiny Rust library to draw pretty line graphs using ascii characters.

rasciigraph Tiny Rust library to draw pretty line graphs using ascii characters. Usage Add this to your Cargo.toml [dependencies] rasciigraph = "0.1.1

`boxy` - declarative box-drawing characters

boxy - declarative box-drawing characters Box-drawing characters are used extensively in text user interfaces software for drawing lines, boxes, and o

Alexander Mongus is a state-of-the-art filter to sneak amogus characters in pictures
Alexander Mongus is a state-of-the-art filter to sneak amogus characters in pictures

A. Mongus Go to: http://www.lortex.org/amogu/ ??? This is a client-side, Webassembly-based filter to hide amongus characters in your images. Example:

omekasy is a command line application that converts alphanumeric characters in your input to various styles defined in Unicode.

omekasy is a command line application that converts alphanumeric characters in your input to various styles defined in Unicode. omekasy means "dress up" in Japanese.

A terminal clock that uses 7-segment display characters

Seven-segment clock (7clock) 7clock.3.mp4 This is a clock for terminals that uses the Unicode seven-segment display characters added in Unicode 13.0.

Comments
  • Iterative Approach

    Iterative Approach

    As is always the case in Rust, the "elegant" recursive approach causes issues.

    When building on top of this for Markdown Splitting, I ran into stack overflow issues, as well as compile time issues, which I think comes from lots of impl Iterator return types the compiler has to recursively figure out.

    Moving to an iterative approach with a custom impl Iterator for walking through the text str. And then can hopefully use a similar approach for walking through the Markdown events.

    opened by benbrandt 1
Releases(v0.3.0)
  • v0.3.0(May 19, 2023)

    What's Changed

    Breaking Changes

    • Match feature names for tokenizer crates to prevent conflicts in the future.
      • huggingface -> tokenizers
      • tiktoken -> tiktoken-rs

    Features

    • Moved from recursive approach to iterative approach to avoid stack overflow issues by @benbrandt in https://github.com/benbrandt/text-splitter/pull/7
    • Relax MSRV to 1.60.0

    Full Changelog: https://github.com/benbrandt/text-splitter/compare/v0.2.2...v0.3.0

    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(May 8, 2023)

  • v0.2.1(May 8, 2023)

    New Features

    • impl Default for TextSplitter using Characters. Character count is used for chunk length by default.
    • Specify the current MSRV (1.62.1)

    Full Changelog: https://github.com/benbrandt/text-splitter/compare/v0.2.0...v0.2.1

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(May 8, 2023)

    v0.2.0

    Breaking Changes

    Simpler Chunking API

    Simplified API for the main use case. TextSplitter now only exposes two chunking methods:

    • chunks
    • chunk_indices

    The other methods are now private. It was likely that the other methods would have caused confusion since it doesn't return the semantic units themselves, but merged versions.

    You also specify chunk size directly in these methods to allow reusing the TextSplitter for different chunk sizes.

    Allow passing in tokenizers directly

    Rather than wrapping a tokenizer in another struct, you can instead just pass a tokenizer directly into TextSplitter::new.

    Bug Fixes

    Better handling of recursive paragraph chunking to handle when both double and single newline splits are used.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(May 5, 2023)

Owner
Ben Brandt
Lead UX Engineer at Aleph Alpha
Ben Brandt
decode a byte stream of varint length-encoded messages into a stream of chunks

length-prefixed-stream decode a byte stream of varint length-encoded messages into a stream of chunks This crate is similar to and compatible with the

James Halliday 4 Feb 26, 2022
Keybinder to type diacrytical characters without needing to hack the layout itself. Supports bindings to the left Alt + letter

Ďíáǩříťíǩád I just thought that it's a shame the word diakritika does not have any diacritics in it. Key points diakritika is a simple Windows daemon

null 4 Feb 26, 2024
A tool to convert old and outdated "characters" into the superior Rustcii-Encoding.

rustcii A tool to convert old and outdated "characters" into the superior Rustcii-Encoding. Speak your mind. Blazingly ( ?? ) fast ( ?? ). Github | cr

null 8 Nov 16, 2022
a FREE and MODERN split-screen tetris game WITHOUT ADS

tetr:: A ✨ modern ✨ Tetris game made in OpenGL and Rust Gameplay tetr:: is an implementaion of modern Tetris, and essentially a clone of tetr.io. This

Adam Harmansky 3 Sep 10, 2022
Binary Ninja plugin written in Rust to automatically apply symbol information from split debug info on Linux.

Load Symbols Binary Ninja plugin written in Rust to automatically apply symbol information from split debug info on Linux. Requirements Last tested wi

null 4 Jul 20, 2022
Semantic find-and-replace using tree-sitter-based macro expansion

Semantic find-and-replace using tree-sitter-based macro expansion

Isaac Clayton 15 Nov 10, 2022
Blockoli is a high-performance tool for code indexing, embedding generation and semantic search tool for use with LLMs.

blockoli ???? Blockoli is a high-performance tool for code indexing, embedding generation and semantic search tool for use with LLMs. blockoli is buil

Asterisk 76 Jul 24, 2024
A tree-sitter based AST difftool to get meaningful semantic diffs

diffsitter Disclaimer diffsitter is very much a work in progress and nowhere close to production ready (yet). Contributions are always welcome! Summar

Afnan Enayet 1.3k Jan 8, 2023
Integrate a Rust project with semantic-release

semantic-release-cargo semantic-release-cargo integrates a cargo-based Rust project with semantic-release. This solves two use cases: publishing to cr

null 5 Jan 16, 2023
Small CLI for escaping and unescaping characters in strings

?? esc Small CLI for escaping characters in strings. Install cargo install esc Usage cat LICENSE-MIT | esc escape | pbcopy pbpaste | esc unescape | pb

Seth 1 Nov 26, 2021