A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

Benjamin Minixhofer

Last update: Jan 8, 2023

Related tags

Text processing nlp rust machine-learning natural-language-processing spellcheck grammar style-checker proofreading grammatical-error-correction

Overview

nlprule

A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based approach to NLP using resources from LanguageTool.

Python Usage

Install: pip install nlprule

Use:

from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)

rules.correct("He wants that you send him an email.")
# returns: 'He wants you to send him an email.'

rules.correct("I can due his homework.")
# returns: 'I can do his homework.'

for s in rules.suggest("She was not been here since Monday."):
    print(s.start, s.end, s.replacements, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?

for sentence in tokenizer.pipe("A brief example is shown."):
    for token in sentence:
        print(
            repr(token.text).ljust(10),
            repr(token.span).ljust(10),
            repr(token.tags).ljust(24),
            repr(token.lemmas).ljust(24),
            repr(token.chunks).ljust(24),
        )
# prints:
# ''         (0, 0)     ['SENT_START']           []                       []                      
# 'A'        (0, 1)     ['DT']                   ['A', 'a']               ['B-NP-singular']       
# 'brief'    (2, 7)     ['JJ']                   ['brief']                ['I-NP-singular']       
# 'example'  (8, 15)    ['NN:UN']                ['example']              ['E-NP-singular']       
# 'is'       (16, 18)   ['VBZ']                  ['be', 'is']             ['B-VP']                
# 'shown'    (19, 24)   ['VBN']                  ['show', 'shown']        ['I-VP']                
# '.'        (24, 25)   ['.', 'PCT', 'SENT_END'] ['.']                    ['O']

Rust Usage

Recommended setup:

Cargo.toml

[dependencies]
nlprule = "<version>"

[build-dependencies]
nlprule-build = "<version>" # must be the same as the nlprule version!

build.rs

fn main() {
    println!("cargo:rerun-if-changed=build.rs");

    nlprule_build::BinaryBuilder::new(
        &["en"],
        std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
    )
    .build()
    .validate();
}

src/main.rs

use nlprule::{Rules, Tokenizer, tokenizer_filename, rules_filename};

fn main() {
    let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
        env!("OUT_DIR"),
        "/",
        tokenizer_filename!("en")
    ));
    let mut rules_bytes: &'static [u8] = include_bytes!(concat!(
        env!("OUT_DIR"),
        "/",
        rules_filename!("en")
    ));

    let tokenizer = Tokenizer::from_reader(&mut tokenizer_bytes).expect("tokenizer binary is valid");
    let rules = Rules::from_reader(&mut rules_bytes).expect("rules binary is valid");

    assert_eq!(
        rules.correct("She was not been here since Monday.", &tokenizer),
        String::from("She was not here since Monday.")
    );
}

nlprule and nlprule-build versions are kept in sync.

Main features

Rule-based Grammatical Error Correction through multiple thousand rules.
A text processing pipeline doing sentence segmentation, part-of-speech tagging, lemmatization, chunking and disambiguation.
Support for English, German and Spanish.
Spellchecking. (in progress)

Goals

A single place to apply spellchecking and grammatical error correction for a downstream task.
Fast, low-resource NLP suited for running:
1. as a pre- / postprocessing step for more sophisticated (i. e. ML) approaches.
2. in the background of another application with low overhead.
3. client-side in the browser via WebAssembly.
100% Rust code and dependencies.

Comparison to LanguageTool

	\|Disambiguation rules\|	\|Grammar rules\|	LT version	nlprule time	LanguageTool time
English	843 (100%)	3725 (~ 85%)	5.2	1	1.7 - 2.0
German	486 (100%)	2970 (~ 90%)	5.2	1	2.4 - 2.8
Spanish	Experimental support. Not fully tested yet.

See the benchmark issue for details.

Projects using nlprule

prosemd: a proofreading and linting language server for markdown files with VSCode integration.
cargo-spellcheck: a tool to check all your Rust documentation for spelling and grammar mistakes.

Please submit a PR to add your project!

Acknowledgements

All credit for the resources used in nlprule goes to LanguageTool who have made a Herculean effort to create high-quality resources for Grammatical Error Correction and broader NLP.

License

nlprule is licensed under the MIT license or Apache-2.0 license, at your option.

The nlprule binaries (*.bin) are derived from LanguageTool v5.2 and licensed under the LGPLv2.1 license. nlprule statically and dynamically links to these binaries. Under LGPLv2.1 §6(a) this does not have any implications on the license of nlprule itself.

Comments

API to include the correct binaries at compile time

Hey, nice library and I am currently checking what would be needed to obsolete the current LanguageTool backend in https://github.com/drahnr/cargo-spellcheck .

There are a few things which would need to be addressed, the most important is to avoid the need for https://github.com/bminixhofer/nlprule/blob/master/scripts/build.sh . The compile feature could gate a build.rs file which would prep the data which in turn could be included via include_bytes!. That way, one can locally source the data at compile time and include the source files within the binary, with optional external overrides. Another thing that would be nice, is documentation on how to obtain the referenced dumps.

Looking forward :100:
documentation enhancement P2

opened by drahnr 23
refactor/transform: introduce transform and improve error handling
Note that this PR still lacks test adjustments required and transform is not covered yet.

Changes:

improve the error type

add fs-err for better errors without much chore

introduce fn transform for transformations before artifacts hit the cache_dir (this is not quite correct now, tbd)

introduce a type alias type Result<T> to avoid boilerplate

use BufReader and BufWriter instead of intermediate Vec where possible

migrate a few elements to more idiomatic expressions
opened by drahnr 9
Support for older glibc

Hi, first off thank you for this library, it's the only non-java languagetool alternative I've found.

Unfortunately, I am receiving an error when trying to use it,ImportError: /lib/x86_64-linux-gnu/libm.so.6: version 'GLIBC_2.27' not found (required by python/lib/python3.8/site-packages/nlprule.cpython-38-x86_64-linux-gnu.so)

I'm on a hosting environment where I don't have access to upgrade system libraries so i can't just upgrade glibc. The current version is glibc 2.19.

Is glibc 2.27 a hard requirement or is there a way to specify an older version of glibc?

I have a feeling this is Rust specific issue but I am new to Rust and not familiar with it's environment.

Thanks
bug

opened by dvwright 8
Cache the compressed artifacts
In order to be able to include the .bin artifacts in a repository and craft releases / publish with cargo the sources may not be larger than 10MB or failures like:

error: api errors (status 200 OK): max upload size is: 10485760

will pop up.

The simplest path is to cache the compressed artifacts rather than the uncompressed and decompress at runtime. An optional builder API could be used to load compressed or decompressed .bin variants.
enhancement good first issue
opened by drahnr 8
postprocess has different semantic than anticipated

So for my usecase as defined in #27 is reversed.

nlprule-data/0.4.4/en/en_tokenizer.bin target/debug/build/cargo-spellcheck-2b832a17a2fec7ef/out/en_tokenizer.bin.brotli

What the usecase described in #27 would require, is to be able to apply compression before storing it in the cache dir and then uncompressing it for the target/debug/....

Reasoning: When uploading with cargo it picks a subset of the git tree, the size of the binary is not relevant.

I think adding a secondary fn cache_preprocess() so I can compress it there before it is stored to $cache_dir and then decompress as part of the current fn postprocess() so it ends up only as binencoded in the $OUT_DIR from where it can be included in the binary.

opened by drahnr 7
switch regex engine from oniguruma to fancy-regex

I would like to switch from rust-onig to fancy-regex.

This would probably come with a speedup and remove the last non-Rust dependency. This is nice in general and would enable compiling to WebAssembly.

Changing this in NLPRule would be easy but it is currently blocked by https://github.com/fancy-regex/fancy-regex/issues/59 and https://github.com/fancy-regex/fancy-regex/issues/49.
enhancement good first issue P2

opened by bminixhofer 7
Improve tagger: Return iterators over `WordData`, remove groups, parallelize deserialization
I had another look at the tagger today. This PR:

Changes all the get_tags_* methods to return iterators instead of Vec.

Removes the groups. These were only used by nlprule in the PosReplacer which wasn't used anywhere as it is not fully implemented. Some of the currently unimplemented rules might need the groups in some form though, but we can probably get away with search + caching since the groups are only needed if a rule actually matches there.

Iterates over the FST in parallel in chunks with disjoint words, this allows populating the tags without synchronization.

Replaces the word HashMap with a Vec since the IDs go from zero to n_words anyway, so we don't need to hash anything.

I see another ~ 30% speedup in loading the Tokenizer. This could also have positive impact on rule checking speed, but there's some weird behavior in the local benchmark on my PC so I have to double check.

@drahnr you might be interested in this PR. It would also be great if you could double check the speedup.
opened by bminixhofer 6
oob access since 0.5.3

Since attempting to upgrade to 0.5.3 it consistently segfaults in https://github.com/bminixhofer/nlprule/blob/main/nlprule/src/rule/engine/composition.rs#L345-L347

See https://ci.spearow.io/teams/main/pipelines/cargo-spellcheck/jobs/pr-validate/builds/26

But the bug is in line 344 - where you should push the length in chars, not in bytes.

opened by drahnr 6
Token as returned by pipe() is relative to the sentence boundaries
// Token<'_> pub char_span: (usize, usize), pub byte_span: (usize, usize),

using fn pipe() returns a set of tokens, that includes spans relative to the sentence, but there seems to be no trivial way of retrieving the spans from within the original text provided to pipe.

Suggestion: Use a Range<usize> instead of a tuple for the relevant range of bytes/ characters for easier usage and make that relative to the input text.

Since for single sentences, there is no change in semantics. For multi sentence ones there is.

It would also make sense to add the respective bounds in bytes and chars of the sentence (or replace the sentence entirely).

pub sentence: &'t str,

Related cargo spellcheck issue https://github.com/drahnr/cargo-spellcheck/pull/162
opened by drahnr 6

panic in `Regex::regex()`

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /tmp/build/56ca5ece/git-pull-request-resource/../cargo/registry/src/github.com-1ecc6299db9ec823/nlprule-0.6.2/src/utils/regex.rs:78:33

thread 'stack backtrace:

<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /tmp/build/56ca5ece/git-pull-request-resource/../cargo/registry/src/github.com-1ecc6299db9ec823/nlprule-0.6.2/src/utils/regex.rs:78:33

   0:     0x560ee78bdfd0 - std::backtrace_rs::backtrace::libunwind::trace::h5e9d00f0cdf4f57e

                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5

   1:     0x560ee78bdfd0 - std::backtrace_rs::backtrace::trace_unsynchronized::hd5302bd66215dab9

                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5

There is an .unwrap on the regex.borrow() call which panics.

https://ci.spearow.io/teams/main/pipelines/cargo-spellcheck/jobs/pr-validate/builds/45

opened by drahnr 5

License of extracted rules

I had brief look into the licensing of language tools rules, if they are permitted to be distributed under other licenses than the language tool library itself, which is LGPLv2.1.

Mostly in relation to #12 which would render the whole idea of including it at compile time rather pointless for most applications.

opened by drahnr 5
Single Or Pural

Hi! Thanks for the great projects. I'm working with code generation, so I need further grammar corrections on the generated code. I found that this toolkit is unable to respond to such simple grammatical knowledge as whether a noun is in singular or plural form.

opened by MT010104 0
Be more responsible about network requests
When I tried entering an invalid language code to confirm that there's a Python exception I need to handle if the language code selected in my existing Enchant-based infrastructure isn't supported by nlprule, I got this very surprising error message:

ValueError: HTTP status client error (404 Not Found) for url (https://github.com/bminixhofer/nlprule/releases/download/0.6.4/ef_tokenizer.bin.gz)

Personally, I consider it very irresponsible to not warn people that a dependency is going to perform network requests under some circumstances, nor to provide an obvious way to handle things offline.

I highly recommend you change this and, for my own use, since I tend to incorporate PyO3-based stuff into my PyQt apps anyway, I think I'll probably switch to writing my own nlprule wrapper so I can trust that, if no network libraries show up in the Cargo.lock, and the author isn't being actively malicious, then what I build will work on an airgapped machine or in a networkless sandbox.

(Seriously. Sandboxes like Flatpak are becoming more and more common. Just assuming applications will have network access is not cool.)
opened by ssokolow 2
Document how to load custom rulesets
I have a project where I'd prefer not to reinvent nlprule for applying my custom grammar rules (common validly-spelled typos I see in fanfiction), but the documentation is very unclear on how to do anything with custom rules.

In a PyQt application, how do I specify files by path like with the Rust API?

How do I go from the raw LanguageTool XML to the .bin files?

Do I need to do multiple passes with different nlprule instances if I also want to check regular grammar stuff or is there a way to merge rulesets?
opened by ssokolow 4
Clarify license statement
Can you clarify this phrasing?

The nlprule binaries (*.bin) are derived from LanguageTool v5.2 and licensed under the LGPLv2.1 license. nlprule statically and dynamically links to these binaries. Under LGPLv2.1 §6(a) this does not have any implications on the license of nlprule itself.

...because:

I don't see any sign of static or dynamic linking in the sense the LGPL considers... just aggregating assets, similar to how you can use runtime-loaded CC-BY-SA art assets in a game with GPLed code without there being a license conflict as long as you don't embed the art assets inside the binary or otherwise make the binary unavoidably dependent from the assets. (eg. compiling in a hash check that will fail if someone swaps in new .png files.)

When people see "statically and dynamically links" and "LGPL", they get concerned, because Rust statically links all its code so, if you statically link your LGPLed stuff into nlprule and you statically link nlprule into a Rust binary, then that Rust binary must be distributed in accordance with the LGPL's requirement that it be possible to swap out the LGPLed components with modified versions... and Rust doesn't have stable ABI to facilitate that without sharing the source.

I've actually seen people warn other people away from nlprule in favour of some more recent bindings for the LanguageTool HTTP API because "nlprule statically links to LGPLed stuff, which means your Rust binaries must be released under the LGPL, GPL, or AGPL".
opened by ssokolow 5
Bump lxml from 4.6.3 to 4.9.1 in /build
Bumps lxml from 4.6.3 to 4.9.1.

Changelog

Sourced from lxml's changelog.

4.9.1 (2022-07-01)

Bugs fixed

A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

4.9.0 (2022-06-01)

Bugs fixed

GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

Other changes

Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

4.8.0 (2022-02-17)

Features added

GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.

The ElementMaker now supports QName values as tags, which always override the default namespace of the factory.

Bugs fixed

GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.

... (truncated)

Commits

d01872c Prevent parse failure in new test from leaking into later test runs.

d65e632 Prepare release of lxml 4.9.1.

86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...

50c2764 Delete unused Travis CI config and reference in docs (GH-345)

8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...

b9f7074 Remove debug print from test.

b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...

897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...

853c9e9 Prepare release of 4.9.0.

d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Support for AnnotatedText

hey, thanks for this awesome project! do you consider adding AnnotatedText support?

this would allow nlprule to be used to spell-check markdown/word/html/etc. documents converted to AnnotatedText format (supported by LanguageTool)

right now i'm thinking how it could be done, but i can't quite figure out how LanguageTool can spellcheck ignoring the markup but then map the ranges to original document

opened by mishushakov 10

Releases(0.6.4)

0.6.4(Apr 24, 2021)
Internal improvements

Decrease time it takes to load the Tokenizer by ~ 40% (#70).

Tag lookup is backed by a vector instead of a hashmap now.

Breaking changes

The tagger now returns iterators over tags instead of allocating a vector.

Remove get_group_members function.

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(592.91 KB)
de_tokenizer.bin.gz(1.78 MB)
en_rules.bin.gz(980.16 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(147.61 KB)
es_tokenizer.bin.gz(2.22 MB)
0.6.3(Apr 18, 2021)
Fixes

Fix a bug where calling Rule::suggest in parallel across threads would cause a panic (#68, thanks @drahnr!)

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(592.73 KB)
de_tokenizer.bin.gz(1.77 MB)
en_rules.bin.gz(979.84 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(147.51 KB)
es_tokenizer.bin.gz(2.20 MB)
0.6.2(Apr 16, 2021)

Internal improvements

Speed up loading the Tokenizer by ~ 25% (#66).
Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(593.08 KB)
de_tokenizer.bin.gz(1.77 MB)
en_rules.bin.gz(980.07 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(147.61 KB)
es_tokenizer.bin.gz(2.20 MB)
0.6.1(Apr 14, 2021)
Fixes

Build Python wheels in container for full manylinux2014 compliance, now works for glibc 2.17 and above (thanks @dvwright!)

Speed up loading the Tokenizer by avoiding an allocation (thanks @drahnr!)

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(592.86 KB)
de_tokenizer.bin.gz(1.78 MB)
en_rules.bin.gz(980.06 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(147.50 KB)
es_tokenizer.bin.gz(2.21 MB)
0.6.0(Apr 8, 2021)
Fix a significant bug where text with multiple sentences would sometimes cause an error if one of the latter sentences matches some pattern (#61, #63, thanks @drahnr!).

Breaking changes

Remove multiword_tags on tokens (now part of the regular tags).

Make fields of the Word private and add getter methods.

Word constructor is now called new instead of new_with_tags.

New features

Adds as_str convenience method to multiple structs (WordId, PosId, Word).

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(592.58 KB)
de_tokenizer.bin.gz(1.78 MB)
en_rules.bin.gz(979.69 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(147.64 KB)
es_tokenizer.bin.gz(2.21 MB)
0.5.3(Apr 1, 2021)
CI failed for Release 0.5.2

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(593.36 KB)
de_tokenizer.bin.gz(1.78 MB)
en_rules.bin.gz(979.99 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(147.78 KB)
es_tokenizer.bin.gz(2.21 MB)
0.5.2(Apr 1, 2021)
Restore FromIterator and IntoIterator impl on Rules (#58, thanks @drahnr!)

Add Clone derives on Tokenizer and Rules (and, accordingly, on their fields)

Source code(tar.gz)
Source code(zip)
0.5.1(Mar 31, 2021)
Breaking changes

Changes the focus from Vec<Token> to Sentence (#54). pipe and sentencize return iterators over Sentence / IncompleteSentence now.

Removes the special SENT_START token (now only used internally). Each token corresponds to at least one character in the input text now.

Makes the fields of Token and IncompleteToken private and adds getter methods (#54).

char_span and byte_span are replaced by a Span struct which keeps track of char and byte indices at the same time (#54). To e.g. get the byte range, use token.span().byte().

Spans are relative to the input text now, not anymore to sentence boundaries (#53, thanks @drahnr!).

New features

The regex backend can now be chosen from Oniguruma or fancy-regex with the features regex-onig and regex-fancy. regex-onig is the default.

nlprule now compiles to WebAssembly. WebAssembly support is guaranteed for future versions and tested in CI.

A new selector API to select individual rules (details documented in nlprule::rule::id). For example:

use nlprule::{Tokenizer, Rules, rule::id::Category}; use std::convert::TryInto; let mut rules = Rules::new("path/to/en_rules.bin")?; // disable rules named "confusion_due_do" in category "confused_words" rules .select_mut( &Category::new("confused_words") .join("confusion_due_do") .into(), ) .for_each(|rule| rule.disable()); // disable all grammar rules rules .select_mut(&Category::new("grammar").into()) .for_each(|rule| rule.disable()); // a string syntax where slashes are the separator is also supported rules .select_mut(&"confused_words/confusion_due_do".try_into()?) .for_each(|rule| rule.enable());
Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(593.37 KB)
de_tokenizer.bin.gz(1.78 MB)
en_rules.bin.gz(979.48 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(147.69 KB)
es_tokenizer.bin.gz(2.21 MB)
0.5.0(Mar 31, 2021)
Superseded by 0.5.1. The release script for 0.5.0 did not finish.

Source code(tar.gz)
Source code(zip)
0.4.6(Feb 20, 2021)
Breaking changes

.validate() in nlprule-build now returns a Result<()> to encourage calling it after .postprocess().

Fixes

Fixes an error where Cursor position in nlprule-build was not reset appropriately.

Use fs_err everywhere for better error messages.

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(559.43 KB)
de_tokenizer.bin.gz(1.77 MB)
en_rules.bin.gz(906.92 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(136.09 KB)
es_tokenizer.bin.gz(2.21 MB)
0.4.5(Feb 20, 2021)
New features

A transform function in nlprule-build to transform binaries immediately after acquiring them. Suited for e. g. compressing the binaries before caching them.

Fixes

Require srx=^0.1.2 to include a patch for out of bounds access.

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(559.61 KB)
de_tokenizer.bin.gz(1.77 MB)
en_rules.bin.gz(907.74 KB)
en_tokenizer.bin.gz(6.83 MB)
es_rules.bin.gz(135.89 KB)
es_tokenizer.bin.gz(2.21 MB)
0.4.4(Feb 17, 2021)
Breaking changes

This is a patch release but there are some small breaking changes to the public API:

from_reader and new methods of the Tokenizer and Rules now return an nlprule::Error instead of bincode:Error.

tag_store and word_store methods of the Tagger are now private.

New features

The nlprule-build crate now has a postprocess method to allow e.g. compression of the produced binaries (#32, thanks @drahnr!).

Internal improvements

Newtypes for PosIdInt and WordIdInt to clarify use of ids in the tagger (#31).

Newtype for indices into the match graph (GraphId). All graph ids are validated at build-time now (also fixed an error where invalid graph ids in the XML files were ignored through this) (#31).

Reduced size of the English tokenizer through better serialization of the chunker. From 15MB (7.7MB gzipped) to 11MB (6.9MB gzipped).

Reduce allocations through making more use of iterators internally (#30). Improves speed but there is no significant benchmark improvement on my machine.

Improve error handling by propagating more errors in the compile module instead of panicking and better build-time validation. Reduces unwraps from ~80 to ~40.

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(557.86 KB)
de_tokenizer.bin.gz(1.77 MB)
en_rules.bin.gz(905.42 KB)
en_tokenizer.bin.gz(6.82 MB)
es_rules.bin.gz(135.91 KB)
es_tokenizer.bin.gz(2.21 MB)
0.4.3(Feb 12, 2021)
Breaking changes

nlprule does sentence segmentation internally now using srx. The Python API has changed, removing the SplitOn class and the *_sentence methods:

tokenizer = Tokenizer.load("en") rules = Rules.load("en", tokenizer) rules.correct("He wants that you send him an email.") # this takes an arbitrary text

new_from is now called from_reader in the Rust API (thanks @drahnr!)

Token.text and IncompleteToken.text are now called Token.sentence / IncompleteToken.sentence to avoid confusion with Token.word.text.

Tokenizer.tokenize is now private. Use Tokenizer.pipe instead (also does sentence segmentation).

New features

Support for Spanish (experimental).

A new multiword tagger improves tagging of e. g. named entities for English and Spanish.

Adds the nlprule-build crate which makes using the correct binaries in Rust easier (thanks @drahnr for the suggestion and discussion!)

Scripts and docs in build/README.md to make creating the nlprule build directories easier and more reproducible.

Full support for LanguageTool unifications.

Binary size of the Tokenizer improved a lot. Now roughly x6 smaller for German and x2 smaller for English.

New iterator helpers for Rules (thanks @drahnr!)

A method .sentencize on the Tokenizer which does only sentence segmentation and nothing else.

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(557.41 KB)
de_tokenizer.bin.gz(1.77 MB)
en_rules.bin.gz(904.41 KB)
en_tokenizer.bin.gz(7.71 MB)
es_rules.bin.gz(135.64 KB)
es_tokenizer.bin.gz(2.21 MB)
0.4.0(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
0.3.0(Jan 17, 2021)

BREAKING: suggestion.text is now more accurately called suggestion.replacements

Lots of speed improvements: NLPRule is now roughly 2.5x to 5x faster for German and English, respectively.

Rules have more information in the public API now: See #5
Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(558.58 KB)
de_tokenizer.bin.gz(11.09 MB)
en_rules.bin.gz(900.59 KB)
en_tokenizer.bin.gz(13.74 MB)
0.2.2(Jan 10, 2021)

Python 3.9 support (fixes #7)
Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(1.15 MB)
de_tokenizer.bin.gz(11.86 MB)
en_rules.bin.gz(1.59 MB)
en_tokenizer.bin.gz(13.70 MB)
0.2.1(Jan 8, 2021)

Fix precedence of Rule IDs over Rule Group IDs.
Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(1.15 MB)
de_tokenizer.bin.gz(11.87 MB)
en_rules.bin.gz(1.59 MB)
en_tokenizer.bin.gz(13.70 MB)
0.2.0(Jan 7, 2021)
Updated to LT version 5.2.

Suggestions now have a message and source attribute (#5):

suggestions = rules.suggest_sentence("She was not been here since Monday.") for s in suggestions: print(s.start, s.end, s.text, s.source, s.message) # prints: # 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?

NLPRule is parallelized by default now. Parallelism can be turned off by setting the NLPRULE_PARALLELISM environment variable to false.

Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(1.15 MB)
de_tokenizer.bin.gz(11.87 MB)
en_rules.bin.gz(1.59 MB)
en_tokenizer.bin.gz(13.70 MB)
0.1.9(Jan 4, 2021)

Testing new distribution of binaries.
Source code(tar.gz)
Source code(zip)
de_rules.bin.gz(1.06 MB)
de_tokenizer.bin.gz(11.86 MB)
en_rules.bin.gz(1.28 MB)
en_tokenizer.bin.gz(13.57 MB)
0.1.8(Jan 4, 2021)

Testing new distribution of binaries.
Source code(tar.gz)
Source code(zip)
0.1.7(Jan 4, 2021)

Testing new distribution of binaries.
Source code(tar.gz)
Source code(zip)
0.1.6(Dec 31, 2020)

Various optimizations, NLPRule is now ~ 30% faster.
Source code(tar.gz)
Source code(zip)
0.1.5(Dec 30, 2020)

Source code(tar.gz)
Source code(zip)
0.1.4(Dec 30, 2020)

Source code(tar.gz)
Source code(zip)
0.1.3(Dec 28, 2020)

Fix https requests.
Source code(tar.gz)
Source code(zip)
0.1.2(Dec 28, 2020)

Initial release.
Source code(tar.gz)
Source code(zip)
0.1.1(Dec 28, 2020)

Initial release.
Source code(tar.gz)
Source code(zip)