Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens

Ben Brandt

Last update: May 8, 2023

Related tags

Command-line text-splitter

Overview

text-splitter

Large language models (LLMs) can be used for many tasks, but often have a limited context size that can be smaller than documents you might want to use. To use documents of larger length, you often have to split your text into chunks to fit within this context size.

This crate provides methods for splitting longer pieces of text into smaller chunks, aiming to maximize a desired chunk size, but still splitting at semantically sensible boundaries whenever possible.

Get Started

By Number of Characters

use text_splitter::{Characters, TextSplitter};

// Maximum number of characters in a chunk
let max_characters = 1000;
// Default implementation uses character count for chunk size
let splitter = TextSplitter::default()
    // Optionally can also have the splitter trim whitespace for you
    .with_trim_chunks(true);

let chunks = splitter.chunks("your document text", max_characters);

By Tokens

use text_splitter::{TextSplitter};
// Can also use tiktoken-rs, or anything that implements the TokenCount
// trait from the text_splitter crate.
use tokenizers::Tokenizer;

let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap();
let max_tokens = 1000;
let splitter = TextSplitter::new(tokenizer)
    // Optionally can also have the splitter trim whitespace for you
    .with_trim_chunks(true);

let chunks = splitter.chunks("your document text", max_tokens);

Method

To preserve as much semantic meaning within a chunk as possible, a recursive approach is used, starting at larger semantic units and, if that is too large, breaking it up into the next largest unit. Here is an example of the steps used:

Split the text by a given level
For each section, does it fit within the chunk size?
- Yes. Merge as many of these neighboring sections into a chunk as possible to maximize chunk length.
- No. Split by the next level and repeat.

The boundaries used to split the text if using the top-level split method, in descending length:

2 or more newlines (Newline is \r\n, \n, or \r)
1 newline
Unicode Sentence Boundaries
Unicode Word Boundaries
Unicode Grapheme Cluster Boundaries
Characters

Splitting doesn't occur below the character level, otherwise you could get partial bytes of a char, which may not be a valid unicode str.

Note on sentences: There are lots of methods of determining sentence breaks, all to varying degrees of accuracy, and many requiring ML models to do so. Rather than trying to find the perfect sentence breaks, we rely on unicode method of sentence boundaries, which in most cases is good enough for finding a decent semantic breaking point if a paragraph is too large, and avoids the performance penalties of many other methods.

Inspiration

This crate was inspired by LangChain's TextSplitter. But, looking into the implementation, there was potential for better performance as well as better semantic chunking.

A big thank you to the unicode-rs team for their unicode-segmentation crate that manages a lot of the complexity of matching the Unicode rules for words and sentences.

🦀 Stupid simple presentation of the number of words, characters and lines on your clipboard.

clipcount: Counting words from the clipboard content Why does this exist? Do you find yourself often needing to count the number of words in a piece o

3 Feb 23, 2024

A small Rust library that let's you get position and size of the active window on Windows and MacOS

active-win-pos-rs A small Rust library that let's you get position and size of the active window on Windows and MacOS Build % git clone https://github

21 Jan 6, 2023

An implementation of Piet's text interface using cosmic-text

piet-cosmic-text Implements piet's Text interface using the cosmic-text crate. License piet-cosmic-text is free software: you can redistribute it and/

7 Mar 12, 2023

Count your code by tokens, types of syntax tree nodes, and patterns in the syntax tree. A tokei/scc/cloc alternative.

tcount (pronounced "tee-count") Count your code by tokens, types of syntax tree nodes, and patterns in the syntax tree. Quick Start Simply run tcount

48 Dec 7, 2022

Tiny Rust library to draw pretty line graphs using ascii characters.

rasciigraph Tiny Rust library to draw pretty line graphs using ascii characters. Usage Add this to your Cargo.toml [dependencies] rasciigraph = "0.1.1

54 Jan 6, 2023

`boxy` - declarative box-drawing characters

boxy - declarative box-drawing characters Box-drawing characters are used extensively in text user interfaces software for drawing lines, boxes, and o

7 Dec 30, 2022

Alexander Mongus is a state-of-the-art filter to sneak amogus characters in pictures

A. Mongus Go to: http://www.lortex.org/amogu/ ??? This is a client-side, Webassembly-based filter to hide amongus characters in your images. Example:

3 Apr 16, 2022

omekasy is a command line application that converts alphanumeric characters in your input to various styles defined in Unicode.

omekasy is a command line application that converts alphanumeric characters in your input to various styles defined in Unicode. omekasy means "dress up" in Japanese.

105 Nov 16, 2022

A terminal clock that uses 7-segment display characters

Seven-segment clock (7clock) 7clock.3.mp4 This is a clock for terminals that uses the Unicode seven-segment display characters added in Unicode 13.0.

4 Nov 11, 2022

Comments

Iterative Approach

As is always the case in Rust, the "elegant" recursive approach causes issues.

When building on top of this for Markdown Splitting, I ran into stack overflow issues, as well as compile time issues, which I think comes from lots of impl Iterator return types the compiler has to recursively figure out.

Moving to an iterative approach with a custom impl Iterator for walking through the text str. And then can hopefully use a similar approach for walking through the Markdown events.

opened by benbrandt 1

Releases(v0.3.0)

v0.3.0(May 19, 2023)
What's Changed

Breaking Changes

Match feature names for tokenizer crates to prevent conflicts in the future.

huggingface -> tokenizers

tiktoken -> tiktoken-rs

Features

Moved from recursive approach to iterative approach to avoid stack overflow issues by @benbrandt in https://github.com/benbrandt/text-splitter/pull/7

Relax MSRV to 1.60.0

Full Changelog: https://github.com/benbrandt/text-splitter/compare/v0.2.2...v0.3.0
Source code(tar.gz)
Source code(zip)
v0.2.2(May 8, 2023)

Add all features to docs.rs

Full Changelog: https://github.com/benbrandt/text-splitter/compare/v0.2.1...v0.2.2
Source code(tar.gz)
Source code(zip)
v0.2.1(May 8, 2023)
New Features

impl Default for TextSplitter using Characters. Character count is used for chunk length by default.

Specify the current MSRV (1.62.1)

Full Changelog: https://github.com/benbrandt/text-splitter/compare/v0.2.0...v0.2.1
Source code(tar.gz)
Source code(zip)
v0.2.0(May 8, 2023)
v0.2.0

Breaking Changes

Simpler Chunking API

Simplified API for the main use case. TextSplitter now only exposes two chunking methods:

chunks

chunk_indices

The other methods are now private. It was likely that the other methods would have caused confusion since it doesn't return the semantic units themselves, but merged versions.

You also specify chunk size directly in these methods to allow reusing the TextSplitter for different chunk sizes.

Allow passing in tokenizers directly

Rather than wrapping a tokenizer in another struct, you can instead just pass a tokenizer directly into TextSplitter::new.

Bug Fixes

Better handling of recursive paragraph chunking to handle when both double and single newline splits are used.
Source code(tar.gz)
Source code(zip)
v0.1.0(May 5, 2023)

Initial release to crates.io
Source code(tar.gz)
Source code(zip)

Owner

Ben Brandt

Lead UX Engineer at Aleph Alpha

GitHub

decode a byte stream of varint length-encoded messages into a stream of chunks

length-prefixed-stream decode a byte stream of varint length-encoded messages into a stream of chunks This crate is similar to and compatible with the

4 Feb 26, 2022

Keybinder to type diacrytical characters without needing to hack the layout itself. Supports bindings to the left Alt + letter

Ďíáǩříťíǩád I just thought that it's a shame the word diakritika does not have any diacritics in it. Key points diakritika is a simple Windows daemon

4 Feb 26, 2024

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens

Related tags

Overview

text-splitter

Get Started

By Number of Characters

By Tokens

Method

Inspiration

You might also like...

🦀 Stupid simple presentation of the number of words, characters and lines on your clipboard.

A small Rust library that let's you get position and size of the active window on Windows and MacOS

An implementation of Piet's text interface using cosmic-text

Count your code by tokens, types of syntax tree nodes, and patterns in the syntax tree. A tokei/scc/cloc alternative.

Tiny Rust library to draw pretty line graphs using ascii characters.

`boxy` - declarative box-drawing characters

Alexander Mongus is a state-of-the-art filter to sneak amogus characters in pictures

omekasy is a command line application that converts alphanumeric characters in your input to various styles defined in Unicode.

A terminal clock that uses 7-segment display characters

Comments

Iterative Approach

Releases(v0.3.0)

v0.3.0(May 19, 2023)

What's Changed

Breaking Changes

Features

v0.2.2(May 8, 2023)

v0.2.1(May 8, 2023)

New Features

v0.2.0(May 8, 2023)

v0.2.0

Breaking Changes

Simpler Chunking API

Allow passing in tokenizers directly

Bug Fixes

v0.1.0(May 5, 2023)

Owner

Ben Brandt

decode a byte stream of varint length-encoded messages into a stream of chunks

Keybinder to type diacrytical characters without needing to hack the layout itself. Supports bindings to the left Alt + letter

A tool to convert old and outdated "characters" into the superior Rustcii-Encoding.

a FREE and MODERN split-screen tetris game WITHOUT ADS

Binary Ninja plugin written in Rust to automatically apply symbol information from split debug info on Linux.

Semantic find-and-replace using tree-sitter-based macro expansion

Blockoli is a high-performance tool for code indexing, embedding generation and semantic search tool for use with LLMs.

A tree-sitter based AST difftool to get meaningful semantic diffs

Integrate a Rust project with semantic-release

Small CLI for escaping and unescaping characters in strings