Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Overview

Whatlang - rust library for natural language detection

Whatlang

Natural language detection for Rust with focus on simplicity and performance.

Build Status License Documentation

Content

Features

  • Supports 78 languages
  • 100% written in Rust
  • Lightweight, fast and simple
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)
  • Provides reliability information

Get started

Add to you Cargo.toml:

[dependencies]

whatlang = "0.11.1"

Example:

extern crate whatlang;

use whatlang::{detect, Lang, Script};

fn main() {
    let text = "Ĉu vi ne volas eklerni Esperanton? Bonvolu! Estas unu de la plej bonaj aferoj!";

    let info = detect(text).unwrap();
    assert_eq!(info.lang(), Lang::Epo);
    assert_eq!(info.script(), Script::Latin);
    assert_eq!(info.confidence(), 1.0);
    assert!(info.is_reliable());
}

For more details (e.g. how to blacklist some languages) please check the documentation.

Feature toggles

Feature Description
enum-map Lang and Script implement Enum trait from enum-map

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How is_reliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Language recognition whatlang rust

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

Running benchmarks

This is mostly useful to test performance optimizations.

cargo bench

Comparison with alternatives

Whatlang CLD2 CLD3
Implementation language Rust C++ C++
Languages 87 83 107
Algorithm trigrams quadgrams neural network
Supported Encoding UTF-8 UTF-8 ?
HTML support no yes ?

Ports and clones

Derivation

Whatlang is a derivative work from Franc (JavaScript, MIT) by Titus Wormer.

License

MIT © Sergey Potapov

Contributors

Comments
  • Upgrade Hashbrown dependency

    Upgrade Hashbrown dependency

    test detect::tests::test_detect_lang_ukrainian ... ok
    test detect::tests::test_detect_with_options_with_whitelist ... ok
    test detect::tests::test_detect_with_options_with_whitelist_mandarin_japanese ... ok
    test detect::tests::test_detect_with_options_with_blacklist_mandarin_japanese ... ok
    test detect::tests::test_detect_with_options_with_blacklist_none ... ok
    test lang::tests::test_code ... ok
    test detector::tests::test_detect_script ... ok
    test lang::tests::test_from_code ... ok
    test lang::tests::test_name ... ok
    test script::tests::test_detect_script ... ok
    test lang::tests::test_eng_name ... ok
    test script::tests::test_is_bengali ... ok
    test detect::tests::test_detect_spanish ... ok
    test script::tests::test_is_ethiopic ... ok
    test script::tests::test_is_cyrillic ... ok
    test script::tests::test_is_georgian ... ok
    test detector::tests::test_detect ... ok
    test script::tests::test_is_gurmukhi ... ok
    test script::tests::test_is_greek ... ok
    test script::tests::test_is_hangul ... ok
    test script::tests::test_is_gujarati ... ok
    test script::tests::test_is_oriya ... ok
    test script::tests::test_is_kannada ... ok
    test script::tests::test_is_katakana ... ok
    test script::tests::test_is_latin ... ok
    test script::tests::test_is_hiragana ... ok
    test script::tests::test_is_tamil ... ok
    test script::tests::test_is_telugu ... ok
    test script::tests::test_script_name ... ok
    test detector::tests::test_detect_lang ... ok
    test detect::tests::test_detect_with_options_with_blacklist ... ok
    test trigrams::tests::test_count ... ok
    test script::tests::test_is_thai ... ok
    test trigrams::tests::test_get_trigrams_with_positions ... ok
    test trigrams::tests::test_to_trigram_char ... ok
    test utils::tests::test_is_top_char ... ok
    test detect::tests::test_detect_with_random_text ... ok
    
    test result: ok. 37 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
    
         Running target/debug/deps/detect-e62082f7a21f59b1
    
    running 2 tests
    test test_with_russian_text ... ok
    test test_with_multiple_examples ... ok
    
    test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
    
         Running target/debug/deps/proptests-b01aa78ab877cbd6
    
    running 1 test
    test proptest_detect_does_not_crash ... ok
    
    test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
    
       Doc-tests whatlang
    
    running 11 tests
    test src/lang.rs - lang::Lang::eng_name (line 5032) ... ok
    test src/lang.rs - lang::Lang::name (line 5021) ... ok
    test src/detector.rs - detector::Detector (line 13) ... ok
    test src/lang.rs - lang::Lang::code (line 5010) ... ok
    test src/detector.rs - detector::Detector (line 24) ... ok
    test src/lang.rs - lang::Lang::from_code (line 4999) ... ok
    test src/detect.rs - detect::detect (line 13) ... ok
    test src/detect.rs - detect::detect_lang (line 27) ... ok
    test src/lib.rs -  (line 24) ... ok
    test src/script.rs - script::detect_script (line 77) ... ok
    test src/lib.rs -  (line 9) ... ok
    
    test result: ok. 11 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
    
    opened by batisteo 8
  • New API

    New API

    Here is the proposal for a new API. The main advantage of it is that it's easier to use, and the desired result can be obtained with one line without limitation to pass additional options (whitelist/blaclist).

    let text = "Bla bla bla";
    let whitelist = [Lang::Epo, Lang::Spa];
    
    // get Option<Result>
    let result = whatlang::new(text).detect();
    
    // get only language, Option<Lang>
    let lang = whatlang::new(text).detect_lang();
    
    // detect only script, Option<Script>
    let lang = whatlang::new(text).detect_script();
    
    // with whitelist specified (same syntax for black list)
    let result = whatlang::new(text).whitelist(&whitelist).detect()
    
    opened by greyblake 7
  • Failed to compile

    Failed to compile

        |
    259 |     Lat = 84,
        |     --- not covered
        |
        = help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
        = note: the matched value is of type `whatlang::Lang`
    

    using dockerfile from https://github.com/valeriansaliou/sonic/blob/master/Dockerfile (same locally on Mac Book Pro with homebrew

    opened by AlexMikhalev 6
  • Generate language list without Tera

    Generate language list without Tera

    Tera is pulling quite a lot of dependencies, and is only used to generate one file during build. I removed it and replaced it with a pure Rust implementation. When cargo testing, I went from 132 dependencies to 96.

    I can understand you refuse this change, as it may be less readable than the current version, but it makes build times shorter (I'm using whatlang in a project with 400 dependencies, and it takes 20 minutes to build, so if I could avoid Tera and its dependencies it would be great).

    Another solution would be to generate this file once and for all, as it is probably not updated very often.

    opened by elegaanz 6
  • Implemented Lang::from_code() via procedural macros

    Implemented Lang::from_code() via procedural macros

    to_code() / from_code() implemenetation looked a bit weird to me and I wanted to get some expirience in Rust language programming so I've made this pull request :)

    pattern matching operator in from_code() is now being generated via procedural macros docs: https://doc.rust-lang.org/book/procedural-macros.html

    I wanted to make from_string() function universal, so EnumFromString could be applied to any enum that's why there is from_string() in EnumFromString trait and from_code() in concrete implemetation for Lang enum

    I also wanted to move EnumFromString inside of the whatlang-derive library, but macro libraries can only export functions for now:

    error: proc-macro crate types cannot export any items other than functions tagged with #[proc_macro_derive] currently
    

    as for benchmarks:

    before change:

    running 2 tests
    test bench_detect        ... bench:  29,613,380 ns/iter (+/- 307,791)
    test bench_detect_script ... bench:     227,694 ns/iter (+/- 689)
    

    after change:

    running 2 tests
    test bench_detect        ... bench:  29,496,420 ns/iter (+/- 387,359)
    test bench_detect_script ... bench:     224,283 ns/iter (+/- 1,048)
    

    it's almost just not changed

    opened by alekseyl1992 6
  • Optimize latin languages detection

    Optimize latin languages detection

    Summary

    Optimize alphabet_calculate_scores function used during Latin Language detection.

    Compute the score in two steps:

    • iterate over the text and scores character based on their frequency
    • then sum the character's scores in Language's scores

    This avoids imbricated loops that make the compute complexity quadratic.

    For now, I didn't do anything on the trigrams part, the behavior is more complicated to understand. 😅 But, I will probably try to optimize it in another PR.

    Whatlang benchmarks

    main branch

    test bench_detect        ... bench:   8,120,987 ns/iter (+/- 119,807)
    test bench_detect_script ... bench:     102,829 ns/iter (+/- 1,600)
    
    test latin alphabet only ... bench:   3,533,305 ns/iter (+/- 102,068)
    

    Commits

    Replace sort_by

    test bench_detect        ... bench:   7,984,651 ns/iter (+/- 82,686)
    test bench_detect_script ... bench:      93,662 ns/iter (+/- 2,249)
    
    test latin alphabet only ... bench:   3,542,116 ns/iter (+/- 45,714)
    

    Use inverted mapping between char and Lang

    test bench_detect        ... bench:   7,642,256 ns/iter (+/- 91,700)
    test bench_detect_script ... bench:      94,344 ns/iter (+/- 2,501)
    
    test latin alphabet only ... bench:   3,221,397 ns/iter (+/- 32,611)
    

    Clamp score in normalization loop instead of creating intermediate vec

    test bench_detect        ... bench:   7,594,489 ns/iter (+/- 95,791)
    test bench_detect_script ... bench:      98,816 ns/iter (+/- 2,039)
    
    test latin alphabet only ... bench:   3,333,266 ns/iter (+/- 29,036)
    

    Increment a common score when a common char is found

    test bench_detect        ... bench:   5,190,589 ns/iter (+/- 73,054)
    test bench_detect_script ... bench:      98,731 ns/iter (+/- 1,395)
    
    test latin alphabet only ... bench:     846,625 ns/iter (+/- 13,893)
    

    Use binary search instead of iter find

    test bench_detect        ... bench:   4,992,537 ns/iter (+/- 77,899)
    test bench_detect_script ... bench:      99,535 ns/iter (+/- 5,190)
    
    test latin alphabet only ... bench:     575,347 ns/iter (+/- 7,259)
    

    Fix returned raw score

    test bench_detect        ... bench:   4,999,493 ns/iter (+/- 69,303)
    test bench_detect_script ... bench:      93,847 ns/iter (+/- 1,339)
    
    test latin alphabet only ... bench:     580,022 ns/iter (+/- 7,469)
    

    Use intermediate char score

    test bench_detect        ... bench:   4,802,885 ns/iter (+/- 73,186)
    test bench_detect_script ... bench:      93,860 ns/iter (+/- 2,912)
    
    test latin alphabet only ... bench:     388,278 ns/iter (+/- 4,427)
    

    Count Max score in char score Loop

    test bench_detect        ... bench:   4,837,791 ns/iter (+/- 71,111)
    test bench_detect_script ... bench:      94,130 ns/iter (+/- 715)
    
    test latin alphabet only ... bench:     354,250 ns/iter (+/- 11,778)
    

    Make lang score access O(1) when iterating over char scores

    test bench_detect        ... bench:   4,769,341 ns/iter (+/- 78,060)
    test bench_detect_script ... bench:      93,748 ns/iter (+/- 706)
    
    test latin alphabet only ... bench:     308,811 ns/iter (+/- 10,769)
    
    opened by ManyTheFish 5
  • Slovak language support

    Slovak language support

    Hello there!

    Using whatlang as part of sonic language detection system. It works great overall, thanks a lot for your work, and for adding Latin recently, which has been implemented in sonic.

    I've got an user on my end requesting Slovak to be added to sonic. Do you think this is something possible from whatlang, is there any reason it's not there (I see that Slovene is supported, while Slovak is not there).

    Ref: https://github.com/valeriansaliou/sonic/issues/178

    opened by valeriansaliou 5
  • feature request : add function to return language/script direction

    feature request : add function to return language/script direction

    Most western languages are read from left to right, however some languages (Persian, Arabic...) are read from right to left, which impact how they should be displayed. I would find useful if this crate could, in addition to detect languages, map a Language or a Script to the direction it should be displayed (which could be a simple enum Direction::Ltr or Direction::Rtl for instance)

    opened by trinity-1686a 5
  • Ruby codegen

    Ruby codegen

    This is an alternative for #34 that generates the languages list with the Ruby helper in misc. It just needs to be run after a language is added to the list, as before. It removes the needs for almost all build dependencies, since all CSV and JSON parsing are now done in Ruby.

    opened by elegaanz 5
  • Using generic function to compute alphabet score for Cyrillic and latin

    Using generic function to compute alphabet score for Cyrillic and latin

    #111

    it degrades performance a bit, mostly because we re-compute alphabet lang map instead of having static variable. It's probably possible to optimise

    on my pc cargo bench went from 5kk to 8kk ns

    opened by egorgrachev 4
  • bump dependencies

    bump dependencies

    Instead of getting rid of hashbrown dependency (#98), upgrade to the latest release instead.

    Also upgraded enum-map, serde_json and proptest to their latest release.

    opened by jqnatividad 4
  • Can distinguish between Simplified Chinese and Japanese Kanji?

    Can distinguish between Simplified Chinese and Japanese Kanji?

    I'm from the Meilisearch community. ( related: meilisearch/meilisearch/issues/2403 )

    Wouldn't it be possible to distinguish between Simplified Chinese(Mandarin) and Kanji(Japanese) where the strings consists only of Hanzi/Kanji?

    For example

    whatlang v0.16.0 detects...

    | Word | Cmn | Jpn | Mean in english | | ------------ |:---:|:---:| ---------------------------------------------------- | | 東京 | ⭕ | ❌ | "Tokyo" in Kanji | | 东京 | ⭕ | ❌ | "Tokyo" in Simplified Chinese | | 大阪 | ⭕ | ❌ | "Osaka" in both Kanji and Simplified Chiinese | | 会員 | ⭕ | ❌ | "member, customer" in Kanji | | 会员 | ⭕ | ❌ | "member, customer" in Simplified Chinese | | 関西国際空港 | ⭕ | ❌ | "Kansai International Airport" in Kanji | | 関西国际空港 | ⭕ | ❌ | "Kansai International Airport" in Simplified Chinese |

    My expected result is...

    | Word | Cmn | Jpn | Mean in english | | ------------ |:---:|:---:| ---------------------------------------------------- | | 東京 | ❌ | ⭕ | "Tokyo" in Kanji | | 东京 | ⭕ | ❌ | "Tokyo" in Simplified Chinese | | 大阪 | ⭕ | ⭕ | "Osaka" in both Kanji and Simplified Chinese | | 会員 | ❌ | ⭕ | "member, customer" in Kanji | | 会员 | ⭕ | ❌ | "member, customer" in Simplified Chinese | | 関西国際空港 | ❌ | ⭕ | "Kansai International Airport" in Kanji | | 関西国际空港 | ⭕ | ❌ | "Kansai International Airport" in Simplified Chinese |

    References

    opened by miiton 5
  • exposing raw_detect_script()

    exposing raw_detect_script()

    have to say i really love the work put into this very nice crate. the optimizations are very thoughtful.

    i'm working on a little side/pet project that analyzes commit messages and code comments in git repositories. and for my use case i'm interested in all the scripts found in the text. raw_detect_script() is perfect for me - as it returns the whole array of scripts detected and i can calculate their ratios, etc...

    this can probably benefit others as well, but ofcourse - i do not pretend to speak for anyone else.

    if preferable i can send this as a pull request - did not know what would be more appropriate here.

    opened by thed0ct0r 1
  • Evaluate other language identification methods.

    Evaluate other language identification methods.

    This is issue is a reminder for myself.

    Possible options:

    • Chars frequencies
    • 2-grams?
    • The most frequent words (100 or 1000)?
    • Smart/complex resolve between LangA and LangB by identifying traits that are present in one language and absent in another. - This could help when 2 languages have a very similar statistical characteristics.
    • Řehůřek and Kolkus (2009)

    See:

    opened by greyblake 0
  • Adding Punctuation in Devnagari Script (Hindi) reduces 'Confidence'.

    Adding Punctuation in Devnagari Script (Hindi) reduces 'Confidence'.

    I tried giving it a test for Hindi (in Devnagari Script).

    With a random sentence the confidence was 39.x%. So I tested it with a sentence "It is What Language.", I guessed might be close to home for this.

    In Devnagari Script, FullStop is written as '|'. When I tested the sentence without it, confidence was 100%. But on including it, the confidence dropped few points.

    I'm a n00b when it comes to Rust, but can try fixing it if you don't have time and can point me in right direction.

    Also, if you need any help training models and have a guide I can follow.

    PS: Thanks for this. I was looking for an interesting project to try restart Rust journey.

    Screenshot_20210211-082028_Chrome.jpgScreenshot_20210211-082019_Chrome.jpg

    opened by abhishekkr 2
  • Make crate no_std compatible

    Make crate no_std compatible

    This patch makes the crate compatible on no_std (embedded) environments. The API stays the same. If possible, please publish a new version after merging.

    opened by fschutt 4
Releases(v0.16.2)
  • v0.16.2(Oct 23, 2022)

  • v0.7.0(Mar 3, 2019)

    A new version of Whatlang (library for natural language recognition in rust) released.

    Changes

    • Support Afrikaans language (afr)
    • Get rid of build dependencies: installation is much faster now
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Mar 3, 2019)

    • Use hashbrown instead of fnv (detect() is 30% faster)
    • Use array on stack instead of vector for detect_script (1-2% faster)
    • Use build.rs to generate lang.rs file
    • Add property based testing
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Mar 3, 2019)

    • (breaking) Rename Lang::to_code(&self) to Lang::code(&self)
    • (fix) Fix bug with zero division in confidence calculation
    • (fix) Confidence can not exceed 1.0
    • Implement Lang::eng_name(&self) -> &str function
    • Implement Lang::name(&self) -> &str function
    • Implement Script::name(&self) -> &str function
    • Implement trait Dislpay for Script
    • Implement Display trait for Lang
    Source code(tar.gz)
    Source code(zip)
Owner
Sergey Potapov
In love with Rust.
Sergey Potapov
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022
A HDPSG-inspired symbolic natural language parser written in Rust

Treebender A symbolic natural language parsing library for Rust, inspired by HDPSG. What is this? This is a library for parsing natural or constructed

Theia Vogel 32 Dec 26, 2022
Semantic text segmentation. For sentence boundary detection, compound splitting and more.

NNSplit A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is a

Benjamin Minixhofer 273 Dec 29, 2022
Rust port of sentence-transformers (https://github.com/UKPLab/sentence-transformers)

Rust SBert Rust port of sentence-transformers using rust-bert and tch-rs. Supports both rust-tokenizers and Hugging Face's tokenizers. Supported model

null 41 Nov 13, 2022
Succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot

Warning: sticker is succeeded by SyntaxDot, which supports many new features: Multi-task learning. Pretrained transformer models, suchs as BERT and XL

stickeritis 25 Apr 28, 2022
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
The Reactive Extensions for the Rust Programming Language

This is an implementation of reactive streams, which, at the high level, is patterned off of the interfaces and protocols defined in http://reactive-s

ReactiveX 468 Dec 20, 2022
Simple, extendable and embeddable scripting language.

duckscript duckscript SDK CLI Simple, extendable and embeddable scripting language. Overview Language Goals Installation Homebrew Binary Release Ducks

Sagie Gur-Ari 356 Dec 24, 2022
Query textual streams with PromQL-like language

pq - query textual streams with PromQL Glossary Time Series - a stream of timestamped values, aka samples sharing the same metric name and, optionally

Ivan Velichko 310 Dec 23, 2022
frawk is a small programming language for writing short programs processing textual data

frawk frawk is a small programming language for writing short programs processing textual data. To a first approximation, it is an implementation of t

Eli Rosenthal 1k Jan 7, 2023
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

Catherine Koshka 13 Oct 31, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
Rust wrapper for the BlingFire tokenization library

BlingFire in Rust blingfire is a thin Rust wrapper for the BlingFire tokenization library. Add the library to Cargo.toml to get started cargo add blin

Re:infer 14 Sep 5, 2022
A Rust library containing an offline version of webster's dictionary.

webster-rs A Rust library containing an offline version of webster's dictionary. Add to Cargo.toml webster = 0.3.0 Simple example: fn main() { le

Grant Handy 12 Sep 27, 2022