A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

Overview

nlprule

PyPI Crates.io Docs.rs PyPI Downloads License

A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based approach to NLP using resources from LanguageTool.

Python Usage

Install: pip install nlprule

Use:

from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)
rules.correct("He wants that you send him an email.")
# returns: 'He wants you to send him an email.'

rules.correct("I can due his homework.")
# returns: 'I can do his homework.'

for s in rules.suggest("She was not been here since Monday."):
    print(s.start, s.end, s.replacements, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?
for sentence in tokenizer.pipe("A brief example is shown."):
    for token in sentence:
        print(
            repr(token.text).ljust(10),
            repr(token.span).ljust(10),
            repr(token.tags).ljust(24),
            repr(token.lemmas).ljust(24),
            repr(token.chunks).ljust(24),
        )
# prints:
# ''         (0, 0)     ['SENT_START']           []                       []                      
# 'A'        (0, 1)     ['DT']                   ['A', 'a']               ['B-NP-singular']       
# 'brief'    (2, 7)     ['JJ']                   ['brief']                ['I-NP-singular']       
# 'example'  (8, 15)    ['NN:UN']                ['example']              ['E-NP-singular']       
# 'is'       (16, 18)   ['VBZ']                  ['be', 'is']             ['B-VP']                
# 'shown'    (19, 24)   ['VBN']                  ['show', 'shown']        ['I-VP']                
# '.'        (24, 25)   ['.', 'PCT', 'SENT_END'] ['.']                    ['O']
Rust Usage

Recommended setup:

Cargo.toml

[dependencies]
nlprule = "<version>"

[build-dependencies]
nlprule-build = "<version>" # must be the same as the nlprule version!

build.rs

fn main() {
    println!("cargo:rerun-if-changed=build.rs");

    nlprule_build::BinaryBuilder::new(
        &["en"],
        std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
    )
    .build()
    .validate();
}

src/main.rs

use nlprule::{Rules, Tokenizer, tokenizer_filename, rules_filename};

fn main() {
    let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
        env!("OUT_DIR"),
        "/",
        tokenizer_filename!("en")
    ));
    let mut rules_bytes: &'static [u8] = include_bytes!(concat!(
        env!("OUT_DIR"),
        "/",
        rules_filename!("en")
    ));

    let tokenizer = Tokenizer::from_reader(&mut tokenizer_bytes).expect("tokenizer binary is valid");
    let rules = Rules::from_reader(&mut rules_bytes).expect("rules binary is valid");

    assert_eq!(
        rules.correct("She was not been here since Monday.", &tokenizer),
        String::from("She was not here since Monday.")
    );
}

nlprule and nlprule-build versions are kept in sync.

Main features

  • Rule-based Grammatical Error Correction through multiple thousand rules.
  • A text processing pipeline doing sentence segmentation, part-of-speech tagging, lemmatization, chunking and disambiguation.
  • Support for English, German and Spanish.
  • Spellchecking. (in progress)

Goals

  • A single place to apply spellchecking and grammatical error correction for a downstream task.
  • Fast, low-resource NLP suited for running:
    1. as a pre- / postprocessing step for more sophisticated (i. e. ML) approaches.
    2. in the background of another application with low overhead.
    3. client-side in the browser via WebAssembly.
  • 100% Rust code and dependencies.

Comparison to LanguageTool

|Disambiguation rules| |Grammar rules| LT version nlprule time LanguageTool time
English 843 (100%) 3725 (~ 85%) 5.2 1 1.7 - 2.0
German 486 (100%) 2970 (~ 90%) 5.2 1 2.4 - 2.8
Spanish Experimental support. Not fully tested yet.

See the benchmark issue for details.

Projects using nlprule

  • prosemd: a proofreading and linting language server for markdown files with VSCode integration.
  • cargo-spellcheck: a tool to check all your Rust documentation for spelling and grammar mistakes.

Please submit a PR to add your project!

Acknowledgements

All credit for the resources used in nlprule goes to LanguageTool who have made a Herculean effort to create high-quality resources for Grammatical Error Correction and broader NLP.

License

nlprule is licensed under the MIT license or Apache-2.0 license, at your option.

The nlprule binaries (*.bin) are derived from LanguageTool v5.2 and licensed under the LGPLv2.1 license. nlprule statically and dynamically links to these binaries. Under LGPLv2.1 §6(a) this does not have any implications on the license of nlprule itself.

Comments
  • API to include the correct binaries at compile time

    API to include the correct binaries at compile time

    Hey, nice library and I am currently checking what would be needed to obsolete the current LanguageTool backend in https://github.com/drahnr/cargo-spellcheck .

    There are a few things which would need to be addressed, the most important is to avoid the need for https://github.com/bminixhofer/nlprule/blob/master/scripts/build.sh . The compile feature could gate a build.rs file which would prep the data which in turn could be included via include_bytes!. That way, one can locally source the data at compile time and include the source files within the binary, with optional external overrides. Another thing that would be nice, is documentation on how to obtain the referenced dumps.

    Looking forward :100:

    documentation enhancement P2 
    opened by drahnr 23
  • refactor/transform: introduce transform and improve error handling

    refactor/transform: introduce transform and improve error handling

    Note that this PR still lacks test adjustments required and transform is not covered yet.

    Changes:

    • improve the error type
    • add fs-err for better errors without much chore
    • introduce fn transform for transformations before artifacts hit the cache_dir (this is not quite correct now, tbd)
    • introduce a type alias type Result<T> to avoid boilerplate
    • use BufReader and BufWriter instead of intermediate Vec where possible
    • migrate a few elements to more idiomatic expressions
    opened by drahnr 9
  • Support for older glibc

    Support for older glibc

    Hi, first off thank you for this library, it's the only non-java languagetool alternative I've found.

    Unfortunately, I am receiving an error when trying to use it,ImportError: /lib/x86_64-linux-gnu/libm.so.6: version 'GLIBC_2.27' not found (required by python/lib/python3.8/site-packages/nlprule.cpython-38-x86_64-linux-gnu.so)

    I'm on a hosting environment where I don't have access to upgrade system libraries so i can't just upgrade glibc. The current version is glibc 2.19.

    Is glibc 2.27 a hard requirement or is there a way to specify an older version of glibc?

    I have a feeling this is Rust specific issue but I am new to Rust and not familiar with it's environment.

    Thanks

    bug 
    opened by dvwright 8
  • Cache the compressed artifacts

    Cache the compressed artifacts

    In order to be able to include the .bin artifacts in a repository and craft releases / publish with cargo the sources may not be larger than 10MB or failures like:

    error: api errors (status 200 OK): max upload size is: 10485760
    

    will pop up.

    The simplest path is to cache the compressed artifacts rather than the uncompressed and decompress at runtime. An optional builder API could be used to load compressed or decompressed .bin variants.

    enhancement good first issue 
    opened by drahnr 8
  • postprocess has different semantic than anticipated

    postprocess has different semantic than anticipated

    So for my usecase as defined in #27 is reversed.

    nlprule-data/0.4.4/en/en_tokenizer.bin target/debug/build/cargo-spellcheck-2b832a17a2fec7ef/out/en_tokenizer.bin.brotli

    What the usecase described in #27 would require, is to be able to apply compression before storing it in the cache dir and then uncompressing it for the target/debug/....

    Reasoning: When uploading with cargo it picks a subset of the git tree, the size of the binary is not relevant.

    I think adding a secondary fn cache_preprocess() so I can compress it there before it is stored to $cache_dir and then decompress as part of the current fn postprocess() so it ends up only as binencoded in the $OUT_DIR from where it can be included in the binary.

    opened by drahnr 7
  • switch regex engine from oniguruma to fancy-regex

    switch regex engine from oniguruma to fancy-regex

    I would like to switch from rust-onig to fancy-regex.

    This would probably come with a speedup and remove the last non-Rust dependency. This is nice in general and would enable compiling to WebAssembly.

    Changing this in NLPRule would be easy but it is currently blocked by https://github.com/fancy-regex/fancy-regex/issues/59 and https://github.com/fancy-regex/fancy-regex/issues/49.

    enhancement good first issue P2 
    opened by bminixhofer 7
  • Improve tagger: Return iterators over `WordData`, remove groups, parallelize deserialization

    Improve tagger: Return iterators over `WordData`, remove groups, parallelize deserialization

    I had another look at the tagger today. This PR:

    • Changes all the get_tags_* methods to return iterators instead of Vec.
    • Removes the groups. These were only used by nlprule in the PosReplacer which wasn't used anywhere as it is not fully implemented. Some of the currently unimplemented rules might need the groups in some form though, but we can probably get away with search + caching since the groups are only needed if a rule actually matches there.
    • Iterates over the FST in parallel in chunks with disjoint words, this allows populating the tags without synchronization.
    • Replaces the word HashMap with a Vec since the IDs go from zero to n_words anyway, so we don't need to hash anything.

    I see another ~ 30% speedup in loading the Tokenizer. This could also have positive impact on rule checking speed, but there's some weird behavior in the local benchmark on my PC so I have to double check.

    @drahnr you might be interested in this PR. It would also be great if you could double check the speedup.

    opened by bminixhofer 6
  • oob access since 0.5.3

    oob access since 0.5.3

    Since attempting to upgrade to 0.5.3 it consistently segfaults in https://github.com/bminixhofer/nlprule/blob/main/nlprule/src/rule/engine/composition.rs#L345-L347

    See https://ci.spearow.io/teams/main/pipelines/cargo-spellcheck/jobs/pr-validate/builds/26

    But the bug is in line 344 - where you should push the length in chars, not in bytes.

    opened by drahnr 6
  • Token as returned by pipe() is relative to the sentence boundaries

    Token as returned by pipe() is relative to the sentence boundaries

    // Token<'_>
        pub char_span: (usize, usize),
        pub byte_span: (usize, usize),
    

    using fn pipe() returns a set of tokens, that includes spans relative to the sentence, but there seems to be no trivial way of retrieving the spans from within the original text provided to pipe.

    Suggestion: Use a Range<usize> instead of a tuple for the relevant range of bytes/ characters for easier usage and make that relative to the input text.

    Since for single sentences, there is no change in semantics. For multi sentence ones there is.

    It would also make sense to add the respective bounds in bytes and chars of the sentence (or replace the sentence entirely).

    pub sentence: &'t str,
    

    Related cargo spellcheck issue https://github.com/drahnr/cargo-spellcheck/pull/162

    opened by drahnr 6
  • panic in `Regex::regex()`

    panic in `Regex::regex()`

    thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /tmp/build/56ca5ece/git-pull-request-resource/../cargo/registry/src/github.com-1ecc6299db9ec823/nlprule-0.6.2/src/utils/regex.rs:78:33
    
    thread 'stack backtrace:
    
    <unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /tmp/build/56ca5ece/git-pull-request-resource/../cargo/registry/src/github.com-1ecc6299db9ec823/nlprule-0.6.2/src/utils/regex.rs:78:33
    
       0:     0x560ee78bdfd0 - std::backtrace_rs::backtrace::libunwind::trace::h5e9d00f0cdf4f57e
    
                                   at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
    
       1:     0x560ee78bdfd0 - std::backtrace_rs::backtrace::trace_unsynchronized::hd5302bd66215dab9
    
                                   at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
    

    There is an .unwrap on the regex.borrow() call which panics.

    https://ci.spearow.io/teams/main/pipelines/cargo-spellcheck/jobs/pr-validate/builds/45

    opened by drahnr 5
  • License of extracted rules

    License of extracted rules

    I had brief look into the licensing of language tools rules, if they are permitted to be distributed under other licenses than the language tool library itself, which is LGPLv2.1.

    Mostly in relation to #12 which would render the whole idea of including it at compile time rather pointless for most applications.

    opened by drahnr 5
  • Single Or Pural

    Single Or Pural

    Hi! Thanks for the great projects. I'm working with code generation, so I need further grammar corrections on the generated code. I found that this toolkit is unable to respond to such simple grammatical knowledge as whether a noun is in singular or plural form. image

    opened by MT010104 0
  • Be more responsible about network requests

    Be more responsible about network requests

    When I tried entering an invalid language code to confirm that there's a Python exception I need to handle if the language code selected in my existing Enchant-based infrastructure isn't supported by nlprule, I got this very surprising error message:

    ValueError: HTTP status client error (404 Not Found) for url (https://github.com/bminixhofer/nlprule/releases/download/0.6.4/ef_tokenizer.bin.gz)
    

    Personally, I consider it very irresponsible to not warn people that a dependency is going to perform network requests under some circumstances, nor to provide an obvious way to handle things offline.

    I highly recommend you change this and, for my own use, since I tend to incorporate PyO3-based stuff into my PyQt apps anyway, I think I'll probably switch to writing my own nlprule wrapper so I can trust that, if no network libraries show up in the Cargo.lock, and the author isn't being actively malicious, then what I build will work on an airgapped machine or in a networkless sandbox.

    (Seriously. Sandboxes like Flatpak are becoming more and more common. Just assuming applications will have network access is not cool.)

    opened by ssokolow 2
  • Document how to load custom rulesets

    Document how to load custom rulesets

    I have a project where I'd prefer not to reinvent nlprule for applying my custom grammar rules (common validly-spelled typos I see in fanfiction), but the documentation is very unclear on how to do anything with custom rules.

    1. In a PyQt application, how do I specify files by path like with the Rust API?
    2. How do I go from the raw LanguageTool XML to the .bin files?
    3. Do I need to do multiple passes with different nlprule instances if I also want to check regular grammar stuff or is there a way to merge rulesets?
    opened by ssokolow 4
  • Clarify license statement

    Clarify license statement

    Can you clarify this phrasing?

    The nlprule binaries (*.bin) are derived from LanguageTool v5.2 and licensed under the LGPLv2.1 license. nlprule statically and dynamically links to these binaries. Under LGPLv2.1 §6(a) this does not have any implications on the license of nlprule itself.

    ...because:

    1. I don't see any sign of static or dynamic linking in the sense the LGPL considers... just aggregating assets, similar to how you can use runtime-loaded CC-BY-SA art assets in a game with GPLed code without there being a license conflict as long as you don't embed the art assets inside the binary or otherwise make the binary unavoidably dependent from the assets. (eg. compiling in a hash check that will fail if someone swaps in new .png files.)

    2. When people see "statically and dynamically links" and "LGPL", they get concerned, because Rust statically links all its code so, if you statically link your LGPLed stuff into nlprule and you statically link nlprule into a Rust binary, then that Rust binary must be distributed in accordance with the LGPL's requirement that it be possible to swap out the LGPLed components with modified versions... and Rust doesn't have stable ABI to facilitate that without sharing the source.

    I've actually seen people warn other people away from nlprule in favour of some more recent bindings for the LanguageTool HTTP API because "nlprule statically links to LGPLed stuff, which means your Rust binaries must be released under the LGPL, GPL, or AGPL".

    opened by ssokolow 5
  • Bump lxml from 4.6.3 to 4.9.1 in /build

    Bump lxml from 4.6.3 to 4.9.1 in /build

    Bumps lxml from 4.6.3 to 4.9.1.

    Changelog

    Sourced from lxml's changelog.

    4.9.1 (2022-07-01)

    Bugs fixed

    • A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

    4.9.0 (2022-06-01)

    Bugs fixed

    • GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

    Other changes

    • Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

    • Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

    • GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

    4.8.0 (2022-02-17)

    Features added

    • GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.

    • The ElementMaker now supports QName values as tags, which always override the default namespace of the factory.

    Bugs fixed

    • GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.

    ... (truncated)

    Commits
    • d01872c Prevent parse failure in new test from leaking into later test runs.
    • d65e632 Prepare release of lxml 4.9.1.
    • 86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...
    • 50c2764 Delete unused Travis CI config and reference in docs (GH-345)
    • 8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...
    • b9f7074 Remove debug print from test.
    • b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...
    • 897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...
    • 853c9e9 Prepare release of 4.9.0.
    • d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Support for AnnotatedText

    Support for AnnotatedText

    hey, thanks for this awesome project! do you consider adding AnnotatedText support?

    this would allow nlprule to be used to spell-check markdown/word/html/etc. documents converted to AnnotatedText format (supported by LanguageTool)

    right now i'm thinking how it could be done, but i can't quite figure out how LanguageTool can spellcheck ignoring the markup but then map the ranges to original document

    opened by mishushakov 10
Releases(0.6.4)
Owner
Benjamin Minixhofer
AI Student at JKU Linz
Benjamin Minixhofer
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

WriteForAll: tips to make text better WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

Joel Parker Henderson 2 Dec 27, 2022
A HDPSG-inspired symbolic natural language parser written in Rust

Treebender A symbolic natural language parsing library for Rust, inspired by HDPSG. What is this? This is a library for parsing natural or constructed

Theia Vogel 32 Dec 26, 2022
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Text Expression Runner – Readable and easy to use text expressions

ter - Text Expression Runner ter is a cli to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the c

Maximilian Schulke 72 Jul 31, 2022
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

Catherine Koshka 13 Oct 31, 2022
frawk is a small programming language for writing short programs processing textual data

frawk frawk is a small programming language for writing short programs processing textual data. To a first approximation, it is an implementation of t

Eli Rosenthal 1k Jan 7, 2023
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
Low rank adaptation (LoRA) for Candle.

candle-lora LoRA (low rank adaptation) implemented in Rust for use with Candle. This technique interchanges the fully-trainable layers of the model wi

Eric Buehler 34 Oct 6, 2023
Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

Kasper 82 Jan 4, 2023
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

Tanker 331 Dec 28, 2022
Semantic text segmentation. For sentence boundary detection, compound splitting and more.

NNSplit A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is a

Benjamin Minixhofer 273 Dec 29, 2022
Source text parsing, lexing, and AST related functionality for Deno

Source text parsing, lexing, and AST related functionality for Deno.

Deno Land 90 Jan 1, 2023
Font independent text analysis support for shaping and layout.

lipi Lipi (Sanskrit for 'writing, letters, alphabet') is a pure Rust crate that provides font independent text analysis support for shaping and layout

Chad Brokaw 12 Sep 22, 2022
A Rust wrapper for the Text synthesization service TextSynth API

A Rust wrapper for the Text synthesization service TextSynth API

ALinuxPerson 2 Mar 24, 2022
bottom encodes UTF-8 text into a sequence comprised of bottom emoji

bottom encodes UTF-8 text into a sequence comprised of bottom emoji (with , sprinkled in for good measure) followed by ????. It can encode any valid UTF-8 - being a bottom transcends language, after all - and decode back into UTF-8.

Bottom Software Foundation 345 Dec 30, 2022
fastest text uwuifier in the west

uwuify fastest text uwuifier in the west transforms Hey... I think I really love you. Do you want a headpat? into hey... i think i w-weawwy wuv you.

Daniel Liu 1.2k Dec 29, 2022