A rule based sentence segmentation library.

Last update: Jul 29, 2022

Overview

cutters

A rule based sentence segmentation library.

🚧 This library is experimental. 🚧

Features

Full UTF-8 support.
Robust parsing.
Language specific rules (each defined by its own PEG).
Fast and memory efficient parsing via the pest library.
Sentences can contain quotes which can contain subsentences.

Bindings

Besides native Rust, bindings for the following programming languages are available:

Python

Supported languages

Croatian (standard)
English (standard)

There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.

Example

After adding the cutters dependency to your Cargo.toml file, usage is simple.

fn main(){
    let text = r#"Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.""#;

    let sentences = cutters::cut(text, cutters::Language::Croatian);

    println!("{:#?}", sentences);
}

This results in the following output (note that the str struct fields are &str).

[
    Sentence {
        str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
        quotes: [],
    },
    Sentence {
        str: "St. Louis 9LX je događaj u svijetu šaha.",
        quotes: [],
    },
    Sentence {
        str: "To je prof.dr.sc. Ivan Horvat.",
        quotes: [],
    },
    Sentence {
        str: "Volim rock, punk, funk, pop itd.",
        quotes: [],
    },
    Sentence {
        str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
        quotes: [
            Quote {
                str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
                sentences: [
                    "Sve sretne obitelji nalik su jedna na drugu.",
                    "Svaka nesretna obitelj nesretna je na svoj način.",
                ],
            },
        ],
    },
]

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

274 Nov 5, 2022

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

52 Jan 7, 2023

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

🐍 python-vaporetto 🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

17 Dec 22, 2022

Viterbi-based accelerated tokenizer (Python wrapper)

🐍 python-vibrato 🎤 Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

20 Dec 29, 2022

A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

5.8k Dec 30, 2022

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

322 Dec 26, 2022

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

81 Dec 6, 2022

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

805 Dec 28, 2022

A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

72 Dec 16, 2022

A rule based sentence segmentation library.

Related tags

Overview

cutters

Features

Bindings

Supported languages

Example

You might also like...

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

Viterbi-based accelerated tokenizer (Python wrapper)

A command-line tool and library for generating regular expressions from user-provided test cases

An efficient and powerful Rust library for word wrapping text.

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

A Rust library for generically joining iterables with a separator

Owner

Rust port of sentence-transformers (https://github.com/UKPLab/sentence-transformers)

SHA256 sentence: discover a SHA256 checksum that matches a sentence's description of hex digit words.

Implementation of sentence embeddings with BERT in Rust, using the Burn library.

A small rust library for creating regex-based lexers

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

Neural network transition-based dependency parser (in Rust)

Simple STM32F103 based glitcher FW

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Vaporetto: a fast and lightweight pointwise prediction based tokenizer