A rule based sentence segmentation library.

Overview

cutters

A rule based sentence segmentation library.

Release Docs License Downloads

🚧 This library is experimental. 🚧

Features

  • Full UTF-8 support.
  • Robust parsing.
  • Language specific rules (each defined by its own PEG).
  • Fast and memory efficient parsing via the pest library.
  • Sentences can contain quotes which can contain subsentences.

Bindings

Besides native Rust, bindings for the following programming languages are available:

Supported languages

  • Croatian (standard)
  • English (standard)

There is also an additional Baseline "language" that simply splits the text on sentence terminals as defined by UTF-8. Its intended use is for benchmarking.

Example

After adding the cutters dependency to your Cargo.toml file, usage is simple.

fn main(){
    let text = r#"Petar Krešimir IV. je vladao od 1058. do 1074. St. Louis 9LX je događaj u svijetu šaha. To je prof.dr.sc. Ivan Horvat. Volim rock, punk, funk, pop itd. Tolstoj je napisao: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.""#;

    let sentences = cutters::cut(text, cutters::Language::Croatian);

    println!("{:#?}", sentences);
}

This results in the following output (note that the str struct fields are &str).

[
    Sentence {
        str: "Petar Krešimir IV. je vladao od 1058. do 1074. ",
        quotes: [],
    },
    Sentence {
        str: "St. Louis 9LX je događaj u svijetu šaha.",
        quotes: [],
    },
    Sentence {
        str: "To je prof.dr.sc. Ivan Horvat.",
        quotes: [],
    },
    Sentence {
        str: "Volim rock, punk, funk, pop itd.",
        quotes: [],
    },
    Sentence {
        str: "Tolstoj je napisao: \"Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.\"",
        quotes: [
            Quote {
                str: "Sve sretne obitelji nalik su jedna na drugu. Svaka nesretna obitelj nesretna je na svoj način.",
                sentences: [
                    "Sve sretne obitelji nalik su jedna na drugu.",
                    "Svaka nesretna obitelj nesretna je na svoj način.",
                ],
            },
        ],
    },
]
You might also like...
Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

🐍 python-vaporetto 🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

Viterbi-based accelerated tokenizer (Python wrapper)

🐍 python-vibrato 🎤 Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

A command-line tool and library for generating regular expressions from user-provided test cases
A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Owner
null
Rust port of sentence-transformers (https://github.com/UKPLab/sentence-transformers)

Rust SBert Rust port of sentence-transformers using rust-bert and tch-rs. Supports both rust-tokenizers and Hugging Face's tokenizers. Supported model

null 41 Nov 13, 2022
SHA256 sentence: discover a SHA256 checksum that matches a sentence's description of hex digit words.

SHA256 sentence "The SHA256 for this sentence begins with: one, eight, two, a, seven, c and nine." Inspired by @lauriewired post Inspired by @humbleha

Joel Parker Henderson 16 Oct 9, 2023
Implementation of sentence embeddings with BERT in Rust, using the Burn library.

Sentence Transformers in Burn This library provides an implementation of the Sentence Transformers framework for computing text representations as vec

Tyler Vergho 4 Sep 4, 2023
A small rust library for creating regex-based lexers

A small rust library for creating regex-based lexers

nph 1 Feb 5, 2022
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
Neural network transition-based dependency parser (in Rust)

dpar Introduction dpar is a neural network transition-based dependency parser. The original Go version can be found in the oldgo branch. Dependencies

Daniël de Kok 41 Jan 25, 2022
Simple STM32F103 based glitcher FW

Airtag glitcher (Bluepill firmware) Simple glitcher firmware running on an STM32F103 on a bluepill board. See https://github.com/pd0wm/airtag-dump for

Willem Melching 27 Dec 22, 2022
Difftastic is an experimental structured diff tool that compares files based on their syntax.

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Wilfred Hughes 13.9k Jan 2, 2023
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Untanglr Untanglr takes in a some mangled words and makes sense out of them so you dont have to. It goes through the input and splits it probabilistic

Andrei Butnaru 15 Nov 23, 2022
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022