Semantic text segmentation. For sentence boundary detection, compound splitting and more.

Benjamin Minixhofer

Last update: Dec 29, 2022

Related tags

Text processing javascript python rust machine-learning deep-learning pretrained-models sentence-boundary-detection compound-splitting

Overview

NNSplit

A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is also supported.

Features

Robust: Not reliant on proper punctuation, spelling and case. See the metrics.
Small: NNSplit uses a byte-level LSTM, so weights are small (< 4MB) and models can be trained for every unicode encodable language.
Portable: NNSplit is written in Rust with bindings for Rust, Python, and Javascript (Browser and Node.js). See how to get started in the usage section.
Fast: Up to 2x faster than Spacy sentencization, see the benchmark.
Multilingual: NNSplit currently has models for 7 different languages (German, English, French, Norwegian, Swedish, Simplified Chinese, Turkish). Try them in the demo.

Documentation has moved to the NNSplit website: https://bminixhofer.github.io/nnsplit.

License

NNSplit is licensed under the MIT license.

Comments

Porting to Android

Hi, I am trying to run onnx model on Android and have sharted with the steps like is described there: https://github.com/onnx/tutorials/blob/master/tutorials/PytorchCaffe2MobileSqueezeNet.ipynb

import onnx
import caffe2.python.onnx.backend
from onnx import helper

# Load the ONNX GraphProto object. Graph is a standard Python protobuf object
model = onnx.load("model.onnx")

Unfortinately I receive an error:

---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
<ipython-input-8-0e15f43f99e0> in <module>()
      1 # Load the ONNX GraphProto object. Graph is a standard Python protobuf object
----> 2 model = onnx.load("model.onnx")
      3 

2 frames
/usr/local/lib/python3.6/dist-packages/onnx/__init__.py in _deserialize(s, proto)
     95                          '\ntype is {}'.format(type(proto)))
     96 
---> 97     decoded = cast(Optional[int], proto.ParseFromString(s))
     98     if decoded is not None and decoded != len(s):
     99         raise google.protobuf.message.DecodeError(

DecodeError: Error parsing message

Could you please what could be the issue? I use EN model and Google Colab

opened by UncleLiaoNN 30

More language support.

Hi, a lot of thanks to your project.

In the README, it says:

Alternatively, you can also load your own model.

Where can I can find models for other languages except English and German? Or could you tell me how to train my own model for other languages step by step? I'm happy to contribute for providing more models.

Thank you, Guangrui Wang
enhancement

opened by aguang-xyz 12
Publish 0.3.x python wheels for Linux/non-macOS platforms

After running into #13, I tried to use the Python bindings instead. It worked, but I noticed that installed version 0.2.2 (saw it didn't match up with the documentation in the README).

After digging into it a little bit, I saw that 0.2.2 was the last release with a platform-agnostic wheel available. All 0.3.x wheels seem to be built specifically for macOS, and are not installable on my Linux/Ubuntu machine.

I'm wondering if there are some easy adjustments that could be made to make publishing wheels for all platforms again possible (or at least Linux/Ubuntu :innocent: )?
bug

opened by hobofan 9

Use ONNX models everywhere due to TorchScript instability

Hey, there! I was trying to run the Rust example from the README, but got the following error on a cargo run:

Error: Compat { error: TorchError { c_error: "The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n  File \"code/__torch__/torch/nn/quantized/dynamic/modules/rnn.py\", line 195, in __setstate__\n    state: Tuple[Tuple[Tensor, Optional[Tensor]], bool]) -> None:\n    _72, _73, = (state)[0]\n    _74 = ops.quantized.linear_prepack(_72, _73)\n          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n    self.param = _74\n    self.training = (state)[1]\n\nTraceback of TorchScript, original code (most recent call last):\n  File \"/usr/local/lib/python3.6/dist-packages/torch/nn/quantized/dynamic/modules/rnn.py\", line 29, in __setstate__\n    @torch.jit.export\n    def __setstate__(self, state):\n        self.param = torch.ops.quantized.linear_prepack(*state[0])\n                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n        self.training = state[1]\nRuntimeError: Didn\'t find engine for operation quantized::linear_prepack NoQEngine\n" } }

Let me know it there is any more info you need for debugging!

bug

opened by hobofan 6

PanicException with 0.4.*
After installing the new version, I get the following exception when running NNSplit.load("en").

PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: Py(0x5632cc1b1140, PhantomData) }

This occurs both with and without onnxruntime-gpu installed.
bug
opened by MiniXC 5
Update session create

onnxruntime has deprecated creating a session without setting a provider at creation time (the latest version fails). This PR fixes the session creation.

opened by lvaughn 3

ImportError in Python (NNSplit)

Hi, I was trying the simple example in Python from the documentation and I'm getting an ImportError:

from nnsplit import NNSplit
splitter = NNSplit.load("en")

# returns `Split` objects
splits = splitter.split(["This is a test This is another test."])[0]

# a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
for sentence in splits:
   print(sentence)

When executing this example I'm getting the following error:

Traceback (most recent call last):
  File "nnsplit.py", line 1, in <module>
    from nnsplit import NNSplit
  File "G:\OneDrive\projects\s\nnsplit.py", line 1, in <module>
    from nnsplit import NNSplit
ImportError: cannot import name 'NNSplit' from partially initialized module 'nnsplit' (most likely due to a circular import) (G:\OneDrive\projects\s\nnsplit.py)

I have installed the packages in a new conda environment, executing pip list installed I have:

pip list installed
Package         Version
--------------- -------------------
certifi         2020.12.5
nnsplit         0.5.7.post0
numpy           1.20.3
onnxruntime     1.7.0
onnxruntime-gpu 1.7.0
pip             21.1.1
protobuf        3.17.1
setuptools      52.0.0.post20210125
six             1.16.0
tqdm            4.61.0
wheel           0.36.2
wincertstore    0.2

opened by albertovilla 3

Missing file in NPM package?
I'm trying to import nnsplit in a JavaScript project, and webpack is failing with:

./node_modules/nnsplit/nnsplit.bundle/nnsplit_javascript_bg.wasm Module not found: Can't resolve './nnsplit_javascript_bg.js' in '/tmp/experiment/node_modules/nnsplit/nnsplit.bundle'

Looking in node_modules/nnsplit/nnsplit.bundle, indeed the file nnsplit_javascript_bg.js is referenced by package.json, but missing from the filesystem.

(Not sure though whether that's the real culprit, as the nodejs example seems to work as intended.)
bug
opened by bard 3
Build Python 3.9 Wheels

When trying to install into python3.9 it will not install a version later than 0.2.2. I am not certain but I believe that this is because wheels are only build for versions 3.6, 3.7 and 3.8. Would it be possible to add wheels for the 3.9 version?

opened by QuantumEntangledAndy 3
Unable to use own trained onnx models

Hello and first of all: thank you for a great library!

I've tried to train my own model using an unusual input data format following the train Python notebook you've provided. However, after the training, when trying to load the custom model via NNSplit.load("en/model.onnx") call in python bindings, I get this:

nnsplit.ResourceError: model not found: "en/model.onnx"

I may be wrong, but it seems the current logic of model_loader.rs does not allow custom local paths, only the ones that are listed in the models.csv:

https://github.com/bminixhofer/nnsplit/blob/a5a15815382029bf5c3438fd4753f644847d4dbf/nnsplit/src/model_loader.rs#L59

Effectively limiting the available models to the pretrained ones.

opened by synweap15 2
Security: update version of tract-onnx

This security vulnerability:

https://rustsec.org/advisories/RUSTSEC-2021-0073.html

is fixed in prost==0.8.0, which is included in a recent new release of tract-onnx: https://github.com/sonos/tract/releases/tag/0.15.2

Would it be possible to do a new release with the tract-onnx dependency bumped?

opened by cjrh 2

`AttributeError: 'InferenceSession' object has no attribute '_providers' Segmentation fault (core dumped)`

I was trying to segment sentences for my transcribing program, but I ran into this error when I first tried using it this.

Full Error

Traceback (most recent call last):
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 280, in __init__
    self._create_inference_session(providers, provider_options)
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 307, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:142 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "speech.py", line 436, in <module>
    transcript, source_align_data = transcript_audio(input_path, True, transcript_path, granularity=granularity)
  File "speech.py", line 271, in transcript_audio
    sentence_segmenter = NNSplit.load("en")
  File "backend.py", line 6, in create_session
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in __init__
    print("EP Error using {}".format(self._providers))
AttributeError: 'InferenceSession' object has no attribute '_providers'
Segmentation fault (core dumped)

opened by ErrorBot1122 0

Hi, pip install nnsplit doesn't work

Hello, first of all, nnsplit is really cool, it's really great stuff. :) I'd really like to run nnsplit on my local computer, but an error occurs when I try to pip install nnsplit:

ERROR: Could not find a version that satisfies the requirement nnsplit (from versions: 0.0.1, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.2.0, 0.2.1, 0.2.2) ERROR: No matching distribution found for nnsplit

can I get some helps?

opened by tartaron 1
Control where the model is downloaded too?

Hi, This is more of a minor feature request. I'm trying to use NNSplit in a container, which has a read-only file system except for the /tmp dir. It would be groovy if one could provide a local path to load the model from/download to. Perhaps this is in the python interface already but i couldn't see it.

I know you can specify a path when calling NNSplit() but this gets more complcated as I'm including it in a modele that then gets included in another project.

Anyway, nice work and thanks!

opened by awhillas 3

ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found

i run it on centos7 and python3.8.3，and i just want to run it on cpu not gpu. I met follows error：

>>> import nnsplit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home/zyb/miniconda3/lib/python3.8/site-packages/nnsplit.cpython-38-x86_64-linux-gnu.so)

opened by v-yunbin 0

Simplified Chinese model does not detect sentence boundaries correctly
Hi,

I have tried Simplified Chinese model on demo page and it seems that sentence boundary and tokens detection are not correct.

I have 2 ideas why that could happen:

Period in Chinese is 。

There are no white spaces between words. Possibly it is better to use something like https://github.com/voidism/pywordseg to split on words as a preprocessing step

It looks like issue 2. causes that tokens are also not detected correctly. I have compared with https://github.com/voidism/pywordseg results and they do not match. But I am not sure here, because I have compared Spacy, pywordseg and Stanford Word Segmenter and all of them provide different results
opened by marlon-br 0

Releases(0.5.8)

0.5.8(Jul 23, 2021)
Updates tract to version 0.15.2

Source code(tar.gz)
Source code(zip)
0.5.7(Mar 16, 2021)
Adds Python 3.9 support (#25)

CI failed spuriously for Release 0.5.6.

Source code(tar.gz)
Source code(zip)
0.5.6(Mar 16, 2021)
Adds Python 3.9 support (#25)

Source code(tar.gz)
Source code(zip)
0.5.5(Feb 13, 2021)
Fixes missing nnsplit_javascript_bg.js (#27)

Source code(tar.gz)
Source code(zip)
0.5.4(Feb 6, 2021)
Fixes Release 0.5.3 not being uploaded to PyPI

Source code(tar.gz)
Source code(zip)
0.5.3(Feb 6, 2021)
Updates Rust dependencies

Adds support for Russian (ru) and Ukrainian (uk)

Source code(tar.gz)
Source code(zip)
0.5.2(Nov 1, 2020)
Split sequence data is now stored in the ONNX file instead of being hardcoded: https://github.com/bminixhofer/nnsplit/pull/21

Added verbose argument to the split(..) method of the Python bindings to display a progress bar

Retrained Chinese model with properly removed punctuation

Retrained German model with Compound Splitting as additional split level

docs.rs documentation now has all features enabled

Added methods to get the levels of the current models:

Python: splitter.get_levels() JS: splitter.getLevels() Rust: splitter.logic().split_sequence().get_levels()

NNSplit now has a website with demo, benchmarks and metrics! https://bminixhofer.github.io/nnsplit/

Source code(tar.gz)
Source code(zip)
0.5.1(Oct 20, 2020)

Introduce model versioning: With the new model architecture, old Rust releases broke because models were always fetched from the master branch. Sorry! Now they are versioned along with the library so this won't happen again. Please upgrade to this version to use the new models.

Update German and English models.
Source code(tar.gz)
Source code(zip)
0.5.0(Oct 18, 2020)
Add five new languages:

Norwegian

Swedish

Turkish

Chinese

French

Retrain all models with new downsampling trick, improves Accuracy significantly at roughly the same speed.
Source code(tar.gz)
Source code(zip)
0.4.12(Sep 22, 2020)

Add missing sigmoid to JS.
Source code(tar.gz)
Source code(zip)
0.4.10(Sep 21, 2020)

Source code(tar.gz)
Source code(zip)
0.4.9(Sep 21, 2020)

Testing release CI.
Source code(tar.gz)
Source code(zip)
0.4.8(Sep 21, 2020)

Testing release CI.
Source code(tar.gz)
Source code(zip)