Natural Language Processing for Rust

Overview

rs-natural

Build Status

Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something cool will come out of it.

Currently working:

  • Jaro-Winkler Distance
  • Levenshtein Distance
  • Tokenizing
  • NGrams (with and without padding)
  • Phonetics (Soundex)
  • Naive-Bayes classification
    • Serialization via Serde
  • Term Frequency-Inverse Document Frequency(tf-idf)
    • Serialization via Serde

Near-sight goals:

  • Logistic regression classification
  • Optimize naive-bayes (currently pretty slow)
  • Plural/Singular inflector

How to use

Use at your own risk. Some functionality is missing, some other functionality is slow as molasses because it isn't optomized yet. I'm targeting master, and don't offer backward compatibility.

Setup

It's a crate with a cargo.toml. Add this to your cargo.toml:

[dependencies]
natural = "0.3.0"

# Or enable Serde support
natural = { version = "0.4.0", features = ["serde_support"]}
serde = "1.0"

Distance

extern crate natural;
use natural::distance::jaro_winkler_distance;
use natural::distance::levenshtein_distance;

assert_eq!(levenshtein_distance("kitten", "sitting"), 3);
assert_eq!(jaro_winkler_distance("dixon", "dicksonx"), 0.767); 

Note, don't actually assert_eq! on JWD since it returns an f64. To test, I actually use:

fn f64_eq(a: f32, b: f32) {
  assert!((a - b).abs() < 0.01);
}

Phonetics

There are two ways to gain access to the SoundEx algorithm in this library, either through a simple soundex function that accepts two &str parameters and returns a boolean, or through the SoundexWord struct. I will show both here.

use natural::phonetics::soundex;
use natural::phonetics::SoundexWord;

assert!(soundex("rupert", "robert"));


let s1 = SoundexWord::new("rupert");
let s2 = SoundexWord::new("robert");
assert!(s1.sounds_like(s2));
assert!(s1.sounds_like_str("robert"));

Tokenization

extern crate natural;
use natural::tokenize::tokenize;

assert_eq!(tokenize("hello, world!"), vec!["hello", "world"]);
assert_eq!(tokenize("My dog has fleas."), vec!["My", "dog", "has", "fleas"]);

NGrams

You can create an ngram with and without padding, e.g.:

extern crate natural;

use natural::ngram::get_ngram;
use natural::ngram::get_ngram_with_padding;

assert_eq!(get_ngram("hello my darling", 2), vec![vec!["hello", "my"], vec!["my", "darling"]]);

assert_eq!(get_ngram_with_padding("my fleas", 2, "----"), vec![
  vec!["----", "my"], vec!["my", "fleas"], vec!["fleas", "----"]]);

Classification

extern crate natural;
use natural::classifier::NaiveBayesClassifier;

let mut nbc = NaiveBayesClassifier::new();

nbc.train(STRING_TO_TRAIN, LABEL);
nbc.train(STRING_TO_TRAIN, LABEL);
nbc.train(STRING_TO_TRAIN, LABEL);
nbc.train(STRING_TO_TRAIN, LABEL);

nbc.guess(STRING_TO_GUESS); //returns a label with the highest probability

Tf-Idf

extern crate natural;
use natural::tf_idf::TfIdf;

tf_idf.add("this document is about rust.");
tf_idf.add("this document is about erlang.");
tf_idf.add("this document is about erlang and rust.");
tf_idf.add("this document is about rust. it has rust examples");

println!(tf_idf.get("rust")); //0.2993708f32
println!(tf_idf.get("erlang")); //0.13782766f32

//average of multiple terms
println!(tf_idf.get("rust erlang"); //0.21859923
Comments
  • Allow contributors to help maintain

    Allow contributors to help maintain

    Hey @christophertrml ,

    I've contributed a few times to the project, and I'd really like to continue to do so. I don't see a ton of options for NLP in Rust. I'd really like to make the library more uniform, and overall more flexible. Currently the tf-idf works but isn't easy to use in a meaningful way. The Naive-Bayes classifier has been in need of some work for a while now, and a logistic regression classifier has been a near-sighted goal since I first learned of the project. I could help handle issues and PRs, as well as expanding the feature set, and improving already existing features of the project if you give me permissions for the repository.

    Let me know what you think,

    Travis Sturzl

    opened by tsturzl 6
  • Need crates.io access to publish crate

    Need crates.io access to publish crate

    Hey @christophertrml,

    In order to publish changes I need access on crates.io. When you have time, could you add me? See: https://doc.rust-lang.org/cargo/reference/publishing.html#cargo-owner

    Thanks!

    opened by tsturzl 3
  • passing

    passing "ies" to natural::stem::get panics

    I get the message:

    thread 'main' panicked at 'index out of bounds: the len is 3 but the index is 18446744073709551615', /c
    heckout/src/libcore/slice/mod.rs:2041:10
    

    The value seems to indicate an unsigned value dropping below zero.

    bug 
    opened by mt-caret 2
  • Allow exporting/storing support

    Allow exporting/storing support

    Hi,

    Would be cool to be able to serialize/store the trained model somehow, probably would be a good idea provide some API that can access internal structures for store/restore to be able to have independent serialization/storage engines.

    Regards

    enhancement 
    opened by tglman 2
  • Publish on crate.io?

    Publish on crate.io?

    I use rs-natural in a few of my hobby projects, and I notice that this project isn't on crates.io. This creates a problem if I myself would like to publish one of my projects that uses rs-natural to crates.io. Is there any reason that it is not? I know rust-stem is also not on crates.io, but issue #3 seems to address this. If this is the only thing holding it back I'd be more than happy to make a PR where rust-stem is included. There shouldn't be any licensing issues since both projects are MIT licensed.

    opened by tsturzl 2
  • Fix bug stemming the word

    Fix bug stemming the word "ion"

    When attempting to stem the word "ion," I got this panic:

    thread 'main' panicked at 'attempt to subtract with overflow', /home/paul/.cargo/registry/src/github.com-1ecc6299db9ec823/natural-0.3.0/src/stem.rs:273:29
    
    opened by pwoolcoc 1
  • Fix right padding on padded ngrams

    Fix right padding on padded ngrams

    When trying to use padded ngrams I realized that the vectors weren't symmetrical. It seems like the tokens were appended in the wrong order. Here is an example:

    get_ngram_with_padding("This is a test thing for ngram padding.", 5, "-")
    

    Output:

    [
      [
        ["-", "-", "-", "-", "This"], 
        ["-", "-", "-", "This", "is"], 
        ["-", "-", "This", "is", "a"], 
        ["-", "This", "is", "a", "test"],
        ["This", "is", "a", "test", "thing"],
        ["is", "a", "test", "thing", "for"], 
        ["a", "test", "thing", "for", "ngram"],
        ["test", "thing", "for", "ngram", "padding"], 
        ["thing", "for", "ngram", "padding", "-", "-", "-", "-"], 
        ["for", "ngram", "padding", "-", "-", "-"], 
        ["ngram", "padding", "-", "-"], 
        ["padding", "-"]
      ]
    ]
    

    With this PR, the output is now:

    [
        [
            ["-", "-", "-", "-", "This"],
            ["-", "-", "-", "This", "is"],
            ["-", "-", "This", "is", "a"],
            ["-", "This", "is", "a", "test"],
            ["This", "is", "a", "test", "thing"],
            ["is", "a", "test", "thing", "for"],
            ["a", "test", "thing", "for", "ngram"],
            ["test", "thing", "for", "ngram", "padding"],
            ["thing", "for", "ngram", "padding", "-"]
            ["for", "ngram", "padding", "-", "-"],
            ["ngram", "padding", "-", "-", "-"],
            ["padding", "-", "-", "-", "-"],
        ]
    ]
    
    opened by Roughsketch 1
  • Classifier: remove `label_probabilities` hashmap, and just store highest running probability

    Classifier: remove `label_probabilities` hashmap, and just store highest running probability

    This isn't a major optimization, but you can simply just remove the HashMap label_probabilities from the classifier and just store the highest running probability. This way you don't need a HashMap or another loop. You simply just store the result_label and result_probability each time the probability is higher than the last highest probability stored in result_probability. Another solution I looked at is to use a vector of tuples Vector<(String, f32)> and just allocating the vector(construct using with_capacity) with a fixed size that is exactly the size of the amount of documents(self.count) to prevent reallocation, because you don't really need to access the label by the key and you aren't really using any methods of a HashMap that you'd be missing by using a Vector instead. But this is also unnecessary, because you're only passing back the highest label rather than all the labels and probabilities. This should reduce overhead of heap allocating a HashMap, storing unused values, and reduce overhead of having a second loop.

    All tests are passing on my local.

    opened by tsturzl 1
  • Add Levenshtein distance

    Add Levenshtein distance

    Added Levenshtein implementation based on a public domain implementation[1]. Added couple of base tests and made the existing ones pass. I'm open for feedback :)

    [1] http://hetland.org/coding/python/levenshtein.py

    opened by Madrigal 0
  • Enhancement: Add probability of belonmging to a classification

    Enhancement: Add probability of belonmging to a classification

    This crate could be highly valuable to the Data Privacy world. If we could teach our systems to flag data that contains sensitive content, (e.g.: PII, NPPI, PCI), we can use Privacy by Design to implement guardrails to avoid privacy breeches or misuse of data.

    A suggested enhancement for such an application would be to calculate the chance of a data string belonging to a classification.

    Example

    // returns 0 to 100
    pub fn match_classification(data_to_validate: String, category: String) -> u8 {
        let chance = 0; 
    
        /*
       calculating ...
       */
    
        chance
    }
    
    let chance = NaiveBayesClassifier::match_classification("ssn: 003-43-7621", "NPPI");
    println!("This data is {} likely to be in the NPPI category.", chance); 
    
    opened by dsietz 0
  • cleanup warnings

    cleanup warnings

    • replace deprecated range pattern
    • remove unused imports
    • make normalized_prob not mutable
    • replace deprecated trim_right
    • add auto-generated comments to Cargo.lock
    opened by jeremyandrews 0
  • Allow empty padding for ngrams

    Allow empty padding for ngrams

    As of now, it is impossible to have blank-padded ngrams since internally it assumes an empty string means no padding.

    I've made this possible in this branch by making the pad member of NGram an Option<&'a str>. If this feature is wanted, I can make a PR.

    opened by Roughsketch 0
Owner
Chris Tramel
Open source and bleeding edge fetishist. Rust and Scala enthusiast. API developer. Masters in CS with a focus on NLP.
Chris Tramel
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
A HDPSG-inspired symbolic natural language parser written in Rust

Treebender A symbolic natural language parsing library for Rust, inspired by HDPSG. What is this? This is a library for parsing natural or constructed

Theia Vogel 32 Dec 26, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
frawk is a small programming language for writing short programs processing textual data

frawk frawk is a small programming language for writing short programs processing textual data. To a first approximation, it is an implementation of t

Eli Rosenthal 1k Jan 7, 2023
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
The Reactive Extensions for the Rust Programming Language

This is an implementation of reactive streams, which, at the high level, is patterned off of the interfaces and protocols defined in http://reactive-s

ReactiveX 468 Dec 20, 2022
Simple, extendable and embeddable scripting language.

duckscript duckscript SDK CLI Simple, extendable and embeddable scripting language. Overview Language Goals Installation Homebrew Binary Release Ducks

Sagie Gur-Ari 356 Dec 24, 2022
Query textual streams with PromQL-like language

pq - query textual streams with PromQL Glossary Time Series - a stream of timestamped values, aka samples sharing the same metric name and, optionally

Ivan Velichko 310 Dec 23, 2022
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

Catherine Koshka 13 Oct 31, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023