Semantic text segmentation. For sentence boundary detection, compound splitting and more.

Overview

NNSplit

PyPI Crates.io npm CI License

A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is also supported.

Features

  • Robust: Not reliant on proper punctuation, spelling and case. See the metrics.
  • Small: NNSplit uses a byte-level LSTM, so weights are small (< 4MB) and models can be trained for every unicode encodable language.
  • Portable: NNSplit is written in Rust with bindings for Rust, Python, and Javascript (Browser and Node.js). See how to get started in the usage section.
  • Fast: Up to 2x faster than Spacy sentencization, see the benchmark.
  • Multilingual: NNSplit currently has models for 7 different languages (German, English, French, Norwegian, Swedish, Simplified Chinese, Turkish). Try them in the demo.

Documentation has moved to the NNSplit website: https://bminixhofer.github.io/nnsplit.

License

NNSplit is licensed under the MIT license.

Comments
  • Porting to Android

    Porting to Android

    Hi, I am trying to run onnx model on Android and have sharted with the steps like is described there: https://github.com/onnx/tutorials/blob/master/tutorials/PytorchCaffe2MobileSqueezeNet.ipynb

    import onnx
    import caffe2.python.onnx.backend
    from onnx import helper
    
    # Load the ONNX GraphProto object. Graph is a standard Python protobuf object
    model = onnx.load("model.onnx")
    

    Unfortinately I receive an error:

    ---------------------------------------------------------------------------
    DecodeError                               Traceback (most recent call last)
    <ipython-input-8-0e15f43f99e0> in <module>()
          1 # Load the ONNX GraphProto object. Graph is a standard Python protobuf object
    ----> 2 model = onnx.load("model.onnx")
          3 
    
    2 frames
    /usr/local/lib/python3.6/dist-packages/onnx/__init__.py in _deserialize(s, proto)
         95                          '\ntype is {}'.format(type(proto)))
         96 
    ---> 97     decoded = cast(Optional[int], proto.ParseFromString(s))
         98     if decoded is not None and decoded != len(s):
         99         raise google.protobuf.message.DecodeError(
    
    DecodeError: Error parsing message
    

    Could you please what could be the issue? I use EN model and Google Colab

    opened by UncleLiaoNN 30
  • More language support.

    More language support.

    Hi, a lot of thanks to your project.

    In the README, it says:

    Alternatively, you can also load your own model.

    Where can I can find models for other languages except English and German? Or could you tell me how to train my own model for other languages step by step? I'm happy to contribute for providing more models.

    Thank you, Guangrui Wang

    enhancement 
    opened by aguang-xyz 12
  • Publish 0.3.x python wheels for Linux/non-macOS platforms

    Publish 0.3.x python wheels for Linux/non-macOS platforms

    After running into #13, I tried to use the Python bindings instead. It worked, but I noticed that installed version 0.2.2 (saw it didn't match up with the documentation in the README).

    After digging into it a little bit, I saw that 0.2.2 was the last release with a platform-agnostic wheel available. All 0.3.x wheels seem to be built specifically for macOS, and are not installable on my Linux/Ubuntu machine.

    I'm wondering if there are some easy adjustments that could be made to make publishing wheels for all platforms again possible (or at least Linux/Ubuntu :innocent: )?

    bug 
    opened by hobofan 9
  • Use ONNX models everywhere due to TorchScript instability

    Use ONNX models everywhere due to TorchScript instability

    Hey, there! I was trying to run the Rust example from the README, but got the following error on a cargo run:

    Error: Compat { error: TorchError { c_error: "The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n  File \"code/__torch__/torch/nn/quantized/dynamic/modules/rnn.py\", line 195, in __setstate__\n    state: Tuple[Tuple[Tensor, Optional[Tensor]], bool]) -> None:\n    _72, _73, = (state)[0]\n    _74 = ops.quantized.linear_prepack(_72, _73)\n          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n    self.param = _74\n    self.training = (state)[1]\n\nTraceback of TorchScript, original code (most recent call last):\n  File \"/usr/local/lib/python3.6/dist-packages/torch/nn/quantized/dynamic/modules/rnn.py\", line 29, in __setstate__\n    @torch.jit.export\n    def __setstate__(self, state):\n        self.param = torch.ops.quantized.linear_prepack(*state[0])\n                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n        self.training = state[1]\nRuntimeError: Didn\'t find engine for operation quantized::linear_prepack NoQEngine\n" } }
    

    Let me know it there is any more info you need for debugging!

    bug 
    opened by hobofan 6
  • PanicException with 0.4.*

    PanicException with 0.4.*

    After installing the new version, I get the following exception when running NNSplit.load("en").

    PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: Py(0x5632cc1b1140, PhantomData) }
    

    This occurs both with and without onnxruntime-gpu installed.

    bug 
    opened by MiniXC 5
  • Update session create

    Update session create

    onnxruntime has deprecated creating a session without setting a provider at creation time (the latest version fails). This PR fixes the session creation.

    opened by lvaughn 3
  • ImportError in Python (NNSplit)

    ImportError in Python (NNSplit)

    Hi, I was trying the simple example in Python from the documentation and I'm getting an ImportError:

    from nnsplit import NNSplit
    splitter = NNSplit.load("en")
    
    # returns `Split` objects
    splits = splitter.split(["This is a test This is another test."])[0]
    
    # a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
    for sentence in splits:
       print(sentence)
    

    When executing this example I'm getting the following error:

    Traceback (most recent call last):
      File "nnsplit.py", line 1, in <module>
        from nnsplit import NNSplit
      File "G:\OneDrive\projects\s\nnsplit.py", line 1, in <module>
        from nnsplit import NNSplit
    ImportError: cannot import name 'NNSplit' from partially initialized module 'nnsplit' (most likely due to a circular import) (G:\OneDrive\projects\s\nnsplit.py)
    

    I have installed the packages in a new conda environment, executing pip list installed I have:

    pip list installed
    Package         Version
    --------------- -------------------
    certifi         2020.12.5
    nnsplit         0.5.7.post0
    numpy           1.20.3
    onnxruntime     1.7.0
    onnxruntime-gpu 1.7.0
    pip             21.1.1
    protobuf        3.17.1
    setuptools      52.0.0.post20210125
    six             1.16.0
    tqdm            4.61.0
    wheel           0.36.2
    wincertstore    0.2
    
    opened by albertovilla 3
  • Missing file in NPM package?

    Missing file in NPM package?

    I'm trying to import nnsplit in a JavaScript project, and webpack is failing with:

    ./node_modules/nnsplit/nnsplit.bundle/nnsplit_javascript_bg.wasm
    Module not found: Can't resolve './nnsplit_javascript_bg.js' in '/tmp/experiment/node_modules/nnsplit/nnsplit.bundle'
    

    Looking in node_modules/nnsplit/nnsplit.bundle, indeed the file nnsplit_javascript_bg.js is referenced by package.json, but missing from the filesystem.

    (Not sure though whether that's the real culprit, as the nodejs example seems to work as intended.)

    bug 
    opened by bard 3
  • Build Python 3.9 Wheels

    Build Python 3.9 Wheels

    When trying to install into python3.9 it will not install a version later than 0.2.2. I am not certain but I believe that this is because wheels are only build for versions 3.6, 3.7 and 3.8. Would it be possible to add wheels for the 3.9 version?

    opened by QuantumEntangledAndy 3
  • Unable to use own trained onnx models

    Unable to use own trained onnx models

    Hello and first of all: thank you for a great library!

    I've tried to train my own model using an unusual input data format following the train Python notebook you've provided. However, after the training, when trying to load the custom model via NNSplit.load("en/model.onnx") call in python bindings, I get this:

    nnsplit.ResourceError: model not found: "en/model.onnx"

    I may be wrong, but it seems the current logic of model_loader.rs does not allow custom local paths, only the ones that are listed in the models.csv:

    https://github.com/bminixhofer/nnsplit/blob/a5a15815382029bf5c3438fd4753f644847d4dbf/nnsplit/src/model_loader.rs#L59

    Effectively limiting the available models to the pretrained ones.

    opened by synweap15 2
  • Security: update version of tract-onnx

    Security: update version of tract-onnx

    This security vulnerability:

    https://rustsec.org/advisories/RUSTSEC-2021-0073.html

    is fixed in prost==0.8.0, which is included in a recent new release of tract-onnx: https://github.com/sonos/tract/releases/tag/0.15.2

    Would it be possible to do a new release with the tract-onnx dependency bumped?

    opened by cjrh 2
  • `AttributeError: 'InferenceSession' object has no attribute '_providers' Segmentation fault (core dumped)`

    `AttributeError: 'InferenceSession' object has no attribute '_providers' Segmentation fault (core dumped)`

    I was trying to segment sentences for my transcribing program, but I ran into this error when I first tried using it this.

    Full Error

    Traceback (most recent call last):
      File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 280, in __init__
        self._create_inference_session(providers, provider_options)
      File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 307, in _create_inference_session
        sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
    RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:142 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "speech.py", line 436, in <module>
        transcript, source_align_data = transcript_audio(input_path, True, transcript_path, granularity=granularity)
      File "speech.py", line 271, in transcript_audio
        sentence_segmenter = NNSplit.load("en")
      File "backend.py", line 6, in create_session
      File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in __init__
        print("EP Error using {}".format(self._providers))
    AttributeError: 'InferenceSession' object has no attribute '_providers'
    Segmentation fault (core dumped)
    
    opened by ErrorBot1122 0
  • Hi, pip install nnsplit doesn't work

    Hi, pip install nnsplit doesn't work

    Hello, first of all, nnsplit is really cool, it's really great stuff. :) I'd really like to run nnsplit on my local computer, but an error occurs when I try to pip install nnsplit:

    ERROR: Could not find a version that satisfies the requirement nnsplit (from versions: 0.0.1, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.2.0, 0.2.1, 0.2.2) ERROR: No matching distribution found for nnsplit

    can I get some helps?

    opened by tartaron 1
  • Control where the model is downloaded too?

    Control where the model is downloaded too?

    Hi, This is more of a minor feature request. I'm trying to use NNSplit in a container, which has a read-only file system except for the /tmp dir. It would be groovy if one could provide a local path to load the model from/download to. Perhaps this is in the python interface already but i couldn't see it.

    I know you can specify a path when calling NNSplit() but this gets more complcated as I'm including it in a modele that then gets included in another project.

    Anyway, nice work and thanks!

    opened by awhillas 3
  • ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found

    ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found

    i run it on centos7 and python3.8.3,and i just want to run it on cpu not gpu. I met follows error:

    >>> import nnsplit
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home/zyb/miniconda3/lib/python3.8/site-packages/nnsplit.cpython-38-x86_64-linux-gnu.so)
    
    opened by v-yunbin 0
  • Simplified Chinese model does not detect sentence boundaries correctly

    Simplified Chinese model does not detect sentence boundaries correctly

    Hi,

    I have tried Simplified Chinese model on demo page and it seems that sentence boundary and tokens detection are not correct.

    I have 2 ideas why that could happen:

    1. Period in Chinese is 。
    2. There are no white spaces between words. Possibly it is better to use something like https://github.com/voidism/pywordseg to split on words as a preprocessing step

    It looks like issue 2. causes that tokens are also not detected correctly. I have compared with https://github.com/voidism/pywordseg results and they do not match. But I am not sure here, because I have compared Spacy, pywordseg and Stanford Word Segmenter and all of them provide different results

    opened by marlon-br 0
Releases(0.5.8)
Owner
Benjamin Minixhofer
AI Student at JKU Linz
Benjamin Minixhofer
Rust port of sentence-transformers (https://github.com/UKPLab/sentence-transformers)

Rust SBert Rust port of sentence-transformers using rust-bert and tch-rs. Supports both rust-tokenizers and Hugging Face's tokenizers. Supported model

null 41 Nov 13, 2022
SHA256 sentence: discover a SHA256 checksum that matches a sentence's description of hex digit words.

SHA256 sentence "The SHA256 for this sentence begins with: one, eight, two, a, seven, c and nine." Inspired by @lauriewired post Inspired by @humbleha

Joel Parker Henderson 16 Oct 9, 2023
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

WriteForAll: tips to make text better WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

Joel Parker Henderson 2 Dec 27, 2022
Implementation of sentence embeddings with BERT in Rust, using the Burn library.

Sentence Transformers in Burn This library provides an implementation of the Sentence Transformers framework for computing text representations as vec

Tyler Vergho 4 Sep 4, 2023
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Text Expression Runner – Readable and easy to use text expressions

ter - Text Expression Runner ter is a cli to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the c

Maximilian Schulke 72 Jul 31, 2022
Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

Kasper 82 Jan 4, 2023
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

Tanker 331 Dec 28, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Source text parsing, lexing, and AST related functionality for Deno

Source text parsing, lexing, and AST related functionality for Deno.

Deno Land 90 Jan 1, 2023
Font independent text analysis support for shaping and layout.

lipi Lipi (Sanskrit for 'writing, letters, alphabet') is a pure Rust crate that provides font independent text analysis support for shaping and layout

Chad Brokaw 12 Sep 22, 2022
bottom encodes UTF-8 text into a sequence comprised of bottom emoji

bottom encodes UTF-8 text into a sequence comprised of bottom emoji (with , sprinkled in for good measure) followed by ????. It can encode any valid UTF-8 - being a bottom transcends language, after all - and decode back into UTF-8.

Bottom Software Foundation 345 Dec 30, 2022
fastest text uwuifier in the west

uwuify fastest text uwuifier in the west transforms Hey... I think I really love you. Do you want a headpat? into hey... i think i w-weawwy wuv you.

Daniel Liu 1.2k Dec 29, 2022
A crate using DeepSpeech bindings to convert mic audio from speech to text

DS-TRANSCRIBER Need an Offline Speech To Text converter? Records your mic, and returns a String containing what was said. Features Begins transcriptio

null 32 Oct 8, 2022
Sorta Text Format in UTF-8

STFU-8: Sorta Text Format in UTF-8 STFU-8 is a hacky text encoding/decoding protocol for data that might be not quite UTF-8 but is still mostly UTF-8.

Rett Berg 18 Sep 4, 2022
The fastest way to identify any mysterious text or analyze strings from a file, just ask `lemmeknow` !

The fastest way to identify anything lemmeknow ⚡ Identify any mysterious text or analyze strings from a file, just ask lemmeknow. lemmeknow can be use

Swanand Mulay 594 Dec 30, 2022