Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Sergey Potapov

Last update: Dec 28, 2022

Overview

Whatlang

Natural language detection for Rust with focus on simplicity and performance.

Content

Features
Get started
Documentation
Supported languages
Feature toggles
How does it work?
- How language recognition works?
- How is_reliable calculated?
Running benchmark
Comparison with alternatives
Ports and clones
Derivation
License
Contributors

Features

Supports 78 languages
100% written in Rust
Lightweight, fast and simple
Recognizes not only a language, but also a script (Latin, Cyrillic, etc)
Provides reliability information

Get started

Add to you Cargo.toml:

[dependencies]

whatlang = "0.11.1"

Example:

extern crate whatlang;

use whatlang::{detect, Lang, Script};

fn main() {
    let text = "Ĉu vi ne volas eklerni Esperanton? Bonvolu! Estas unu de la plej bonaj aferoj!";

    let info = detect(text).unwrap();
    assert_eq!(info.lang(), Lang::Epo);
    assert_eq!(info.script(), Script::Latin);
    assert_eq!(info.confidence(), 1.0);
    assert!(info.is_reliable());
}

For more details (e.g. how to blacklist some languages) please check the documentation.

Feature toggles

Feature	Description
`enum-map`	`Lang` and `Script` implement `Enum` trait from enum-map

How does it work?

How does the language recognition work?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How is_reliable calculated?

It is based on the following factors:

How many unique trigrams are in the given text
How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

For more details, please check a blog article Introduction to Rust Whatlang Library and Natural Language Identification Algorithms.

Running benchmarks

This is mostly useful to test performance optimizations.

cargo bench

Comparison with alternatives

	Whatlang	CLD2	CLD3
Implementation language	Rust	C++	C++
Languages	87	83	107
Algorithm	trigrams	quadgrams	neural network
Supported Encoding	UTF-8	UTF-8	?
HTML support	no	yes	?

Ports and clones

whatlang-ffi - C bindings
whatlanggo - whatlang clone for Go language
whatlang-py - bindings for Python
whatlang-rb - bindings for Ruby

Derivation

Whatlang is a derivative work from Franc (JavaScript, MIT) by Titus Wormer.

License

MIT © Sergey Potapov

Contributors

greyblake Potapov Sergey - creator, maintainer.
Dr-Emann Zachary Dremann - optimization and improvements
BaptisteGelez Baptiste Gelez - improvements
Vishesh Chopra - designed the logo

Comments

Upgrade Hashbrown dependency

test detect::tests::test_detect_lang_ukrainian ... ok
test detect::tests::test_detect_with_options_with_whitelist ... ok
test detect::tests::test_detect_with_options_with_whitelist_mandarin_japanese ... ok
test detect::tests::test_detect_with_options_with_blacklist_mandarin_japanese ... ok
test detect::tests::test_detect_with_options_with_blacklist_none ... ok
test lang::tests::test_code ... ok
test detector::tests::test_detect_script ... ok
test lang::tests::test_from_code ... ok
test lang::tests::test_name ... ok
test script::tests::test_detect_script ... ok
test lang::tests::test_eng_name ... ok
test script::tests::test_is_bengali ... ok
test detect::tests::test_detect_spanish ... ok
test script::tests::test_is_ethiopic ... ok
test script::tests::test_is_cyrillic ... ok
test script::tests::test_is_georgian ... ok
test detector::tests::test_detect ... ok
test script::tests::test_is_gurmukhi ... ok
test script::tests::test_is_greek ... ok
test script::tests::test_is_hangul ... ok
test script::tests::test_is_gujarati ... ok
test script::tests::test_is_oriya ... ok
test script::tests::test_is_kannada ... ok
test script::tests::test_is_katakana ... ok
test script::tests::test_is_latin ... ok
test script::tests::test_is_hiragana ... ok
test script::tests::test_is_tamil ... ok
test script::tests::test_is_telugu ... ok
test script::tests::test_script_name ... ok
test detector::tests::test_detect_lang ... ok
test detect::tests::test_detect_with_options_with_blacklist ... ok
test trigrams::tests::test_count ... ok
test script::tests::test_is_thai ... ok
test trigrams::tests::test_get_trigrams_with_positions ... ok
test trigrams::tests::test_to_trigram_char ... ok
test utils::tests::test_is_top_char ... ok
test detect::tests::test_detect_with_random_text ... ok

test result: ok. 37 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/debug/deps/detect-e62082f7a21f59b1

running 2 tests
test test_with_russian_text ... ok
test test_with_multiple_examples ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/debug/deps/proptests-b01aa78ab877cbd6

running 1 test
test proptest_detect_does_not_crash ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

   Doc-tests whatlang

running 11 tests
test src/lang.rs - lang::Lang::eng_name (line 5032) ... ok
test src/lang.rs - lang::Lang::name (line 5021) ... ok
test src/detector.rs - detector::Detector (line 13) ... ok
test src/lang.rs - lang::Lang::code (line 5010) ... ok
test src/detector.rs - detector::Detector (line 24) ... ok
test src/lang.rs - lang::Lang::from_code (line 4999) ... ok
test src/detect.rs - detect::detect (line 13) ... ok
test src/detect.rs - detect::detect_lang (line 27) ... ok
test src/lib.rs -  (line 24) ... ok
test src/script.rs - script::detect_script (line 77) ... ok
test src/lib.rs -  (line 9) ... ok

test result: ok. 11 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

opened by batisteo 8

New API

Here is the proposal for a new API. The main advantage of it is that it's easier to use, and the desired result can be obtained with one line without limitation to pass additional options (whitelist/blaclist).

let text = "Bla bla bla";
let whitelist = [Lang::Epo, Lang::Spa];

// get Option<Result>
let result = whatlang::new(text).detect();

// get only language, Option<Lang>
let lang = whatlang::new(text).detect_lang();

// detect only script, Option<Script>
let lang = whatlang::new(text).detect_script();

// with whitelist specified (same syntax for black list)
let result = whatlang::new(text).whitelist(&whitelist).detect()

opened by greyblake 7

Failed to compile

    |
259 |     Lat = 84,
    |     --- not covered
    |
    = help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
    = note: the matched value is of type `whatlang::Lang`

using dockerfile from https://github.com/valeriansaliou/sonic/blob/master/Dockerfile (same locally on Mac Book Pro with homebrew

opened by AlexMikhalev 6

Generate language list without Tera

Tera is pulling quite a lot of dependencies, and is only used to generate one file during build. I removed it and replaced it with a pure Rust implementation. When cargo testing, I went from 132 dependencies to 96.

I can understand you refuse this change, as it may be less readable than the current version, but it makes build times shorter (I'm using whatlang in a project with 400 dependencies, and it takes 20 minutes to build, so if I could avoid Tera and its dependencies it would be great).

Another solution would be to generate this file once and for all, as it is probably not updated very often.

opened by elegaanz 6
Implemented Lang::from_code() via procedural macros
to_code() / from_code() implemenetation looked a bit weird to me and I wanted to get some expirience in Rust language programming so I've made this pull request :)

pattern matching operator in from_code() is now being generated via procedural macros docs: https://doc.rust-lang.org/book/procedural-macros.html

I wanted to make from_string() function universal, so EnumFromString could be applied to any enum that's why there is from_string() in EnumFromString trait and from_code() in concrete implemetation for Lang enum

I also wanted to move EnumFromString inside of the whatlang-derive library, but macro libraries can only export functions for now:

error: proc-macro crate types cannot export any items other than functions tagged with #[proc_macro_derive] currently

as for benchmarks:

before change:

running 2 tests test bench_detect ... bench: 29,613,380 ns/iter (+/- 307,791) test bench_detect_script ... bench: 227,694 ns/iter (+/- 689)

after change:

running 2 tests test bench_detect ... bench: 29,496,420 ns/iter (+/- 387,359) test bench_detect_script ... bench: 224,283 ns/iter (+/- 1,048)

it's almost just not changed
opened by alekseyl1992 6

Optimize latin languages detection

Summary

Optimize alphabet_calculate_scores function used during Latin Language detection.

Compute the score in two steps:

iterate over the text and scores character based on their frequency
then sum the character's scores in Language's scores

This avoids imbricated loops that make the compute complexity quadratic.

For now, I didn't do anything on the trigrams part, the behavior is more complicated to understand. 😅 But, I will probably try to optimize it in another PR.

Whatlang benchmarks

main branch

test bench_detect        ... bench:   8,120,987 ns/iter (+/- 119,807)
test bench_detect_script ... bench:     102,829 ns/iter (+/- 1,600)

test latin alphabet only ... bench:   3,533,305 ns/iter (+/- 102,068)

Commits

Replace sort_by

test bench_detect        ... bench:   7,984,651 ns/iter (+/- 82,686)
test bench_detect_script ... bench:      93,662 ns/iter (+/- 2,249)

test latin alphabet only ... bench:   3,542,116 ns/iter (+/- 45,714)

Use inverted mapping between char and Lang

test bench_detect        ... bench:   7,642,256 ns/iter (+/- 91,700)
test bench_detect_script ... bench:      94,344 ns/iter (+/- 2,501)

test latin alphabet only ... bench:   3,221,397 ns/iter (+/- 32,611)

Clamp score in normalization loop instead of creating intermediate vec

test bench_detect        ... bench:   7,594,489 ns/iter (+/- 95,791)
test bench_detect_script ... bench:      98,816 ns/iter (+/- 2,039)

test latin alphabet only ... bench:   3,333,266 ns/iter (+/- 29,036)

Increment a common score when a common char is found

test bench_detect        ... bench:   5,190,589 ns/iter (+/- 73,054)
test bench_detect_script ... bench:      98,731 ns/iter (+/- 1,395)

test latin alphabet only ... bench:     846,625 ns/iter (+/- 13,893)

Use binary search instead of iter find

test bench_detect        ... bench:   4,992,537 ns/iter (+/- 77,899)
test bench_detect_script ... bench:      99,535 ns/iter (+/- 5,190)

test latin alphabet only ... bench:     575,347 ns/iter (+/- 7,259)

Fix returned raw score

test bench_detect        ... bench:   4,999,493 ns/iter (+/- 69,303)
test bench_detect_script ... bench:      93,847 ns/iter (+/- 1,339)

test latin alphabet only ... bench:     580,022 ns/iter (+/- 7,469)

Use intermediate char score

test bench_detect        ... bench:   4,802,885 ns/iter (+/- 73,186)
test bench_detect_script ... bench:      93,860 ns/iter (+/- 2,912)

test latin alphabet only ... bench:     388,278 ns/iter (+/- 4,427)

Count Max score in char score Loop

test bench_detect        ... bench:   4,837,791 ns/iter (+/- 71,111)
test bench_detect_script ... bench:      94,130 ns/iter (+/- 715)

test latin alphabet only ... bench:     354,250 ns/iter (+/- 11,778)

Make lang score access O(1) when iterating over char scores

test bench_detect        ... bench:   4,769,341 ns/iter (+/- 78,060)
test bench_detect_script ... bench:      93,748 ns/iter (+/- 706)

test latin alphabet only ... bench:     308,811 ns/iter (+/- 10,769)

opened by ManyTheFish 5

Slovak language support

Hello there!

Using whatlang as part of sonic language detection system. It works great overall, thanks a lot for your work, and for adding Latin recently, which has been implemented in sonic.

I've got an user on my end requesting Slovak to be added to sonic. Do you think this is something possible from whatlang, is there any reason it's not there (I see that Slovene is supported, while Slovak is not there).

Ref: https://github.com/valeriansaliou/sonic/issues/178

opened by valeriansaliou 5
feature request : add function to return language/script direction

Most western languages are read from left to right, however some languages (Persian, Arabic...) are read from right to left, which impact how they should be displayed. I would find useful if this crate could, in addition to detect languages, map a Language or a Script to the direction it should be displayed (which could be a simple enum Direction::Ltr or Direction::Rtl for instance)

opened by trinity-1686a 5
Ruby codegen

This is an alternative for #34 that generates the languages list with the Ruby helper in misc. It just needs to be run after a language is added to the list, as before. It removes the needs for almost all build dependencies, since all CSV and JSON parsing are now done in Ruby.

opened by elegaanz 5
Using generic function to compute alphabet score for Cyrillic and latin

#111

it degrades performance a bit, mostly because we re-compute alphabet lang map instead of having static variable. It's probably possible to optimise

on my pc cargo bench went from 5kk to 8kk ns

opened by egorgrachev 4
bump dependencies

Instead of getting rid of hashbrown dependency (#98), upgrade to the latest release instead.

Also upgraded enum-map, serde_json and proptest to their latest release.

opened by jqnatividad 4
Can distinguish between Simplified Chinese and Japanese Kanji?
I'm from the Meilisearch community. ( related: meilisearch/meilisearch/issues/2403 )

Wouldn't it be possible to distinguish between Simplified Chinese(Mandarin) and Kanji(Japanese) where the strings consists only of Hanzi/Kanji?

For example

whatlang v0.16.0 detects...

| Word | Cmn | Jpn | Mean in english | | ------------ |:---:|:---:| ---------------------------------------------------- | | 東京 | ⭕ | ❌ | "Tokyo" in Kanji | | 东京 | ⭕ | ❌ | "Tokyo" in Simplified Chinese | | 大阪 | ⭕ | ❌ | "Osaka" in both Kanji and Simplified Chiinese | | 会員 | ⭕ | ❌ | "member, customer" in Kanji | | 会员 | ⭕ | ❌ | "member, customer" in Simplified Chinese | | 関西国際空港 | ⭕ | ❌ | "Kansai International Airport" in Kanji | | 関西国际空港 | ⭕ | ❌ | "Kansai International Airport" in Simplified Chinese |

My expected result is...

| Word | Cmn | Jpn | Mean in english | | ------------ |:---:|:---:| ---------------------------------------------------- | | 東京 | ❌ | ⭕ | "Tokyo" in Kanji | | 东京 | ⭕ | ❌ | "Tokyo" in Simplified Chinese | | 大阪 | ⭕ | ⭕ | "Osaka" in both Kanji and Simplified Chinese | | 会員 | ❌ | ⭕ | "member, customer" in Kanji | | 会员 | ⭕ | ❌ | "member, customer" in Simplified Chinese | | 関西国際空港 | ❌ | ⭕ | "Kansai International Airport" in Kanji | | 関西国际空港 | ⭕ | ❌ | "Kansai International Airport" in Simplified Chinese |

References

Hanzi and Kanji: Differences in the Chinese and Japanese Character Sets Today | East Asia Student
opened by miiton 5
exposing raw_detect_script()

have to say i really love the work put into this very nice crate. the optimizations are very thoughtful.

i'm working on a little side/pet project that analyzes commit messages and code comments in git repositories. and for my use case i'm interested in all the scripts found in the text. raw_detect_script() is perfect for me - as it returns the whole array of scripts detected and i can calculate their ratios, etc...

this can probably benefit others as well, but ofcourse - i do not pretend to speak for anyone else.

if preferable i can send this as a pull request - did not know what would be more appropriate here.

opened by thed0ct0r 1
Evaluate other language identification methods.
This is issue is a reminder for myself.

Possible options:

Chars frequencies

2-grams?

The most frequent words (100 or 1000)?

Smart/complex resolve between LangA and LangB by identifying traits that are present in one language and absent in another. - This could help when 2 languages have a very similar statistical characteristics.

Řehůřek and Kolkus (2009)

See:

Language Identification (wiki article)
opened by greyblake 0
Adding Punctuation in Devnagari Script (Hindi) reduces 'Confidence'.

I tried giving it a test for Hindi (in Devnagari Script).

With a random sentence the confidence was 39.x%. So I tested it with a sentence "It is What Language.", I guessed might be close to home for this.

In Devnagari Script, FullStop is written as '|'. When I tested the sentence without it, confidence was 100%. But on including it, the confidence dropped few points.

I'm a n00b when it comes to Rust, but can try fixing it if you don't have time and can point me in right direction.

Also, if you need any help training models and have a guide I can follow.

PS: Thanks for this. I was looking for an interesting project to try restart Rust journey.

opened by abhishekkr 2
Make crate no_std compatible

This patch makes the crate compatible on no_std (embedded) environments. The API stays the same. If possible, please publish a new version after merging.

opened by fschutt 4

Releases(v0.16.2)

v0.16.2(Oct 23, 2022)
Changes:

Add optional Arbitrary support

Source code(tar.gz)
Source code(zip)
v0.7.0(Mar 3, 2019)
A new version of Whatlang (library for natural language recognition in rust) released.

Changes

Support Afrikaans language (afr)

Get rid of build dependencies: installation is much faster now

Source code(tar.gz)
Source code(zip)
v0.6.0(Mar 3, 2019)
Use hashbrown instead of fnv (detect() is 30% faster)

Use array on stack instead of vector for detect_script (1-2% faster)

Use build.rs to generate lang.rs file

Add property based testing

Source code(tar.gz)
Source code(zip)
v0.5.0(Mar 3, 2019)
(breaking) Rename Lang::to_code(&self) to Lang::code(&self)

(fix) Fix bug with zero division in confidence calculation

(fix) Confidence can not exceed 1.0

Implement Lang::eng_name(&self) -> &str function

Implement Lang::name(&self) -> &str function

Implement Script::name(&self) -> &str function

Implement trait Dislpay for Script

Implement Display trait for Lang

Source code(tar.gz)
Source code(zip)

Owner

Sergey Potapov

In love with Rust.

GitHub https://www.greyblake.com/whatlang/

lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

7 Dec 30, 2022

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

34 Dec 20, 2022

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

496 Jan 8, 2023

Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

211 Dec 28, 2022

A HDPSG-inspired symbolic natural language parser written in Rust

Treebender A symbolic natural language parsing library for Rust, inspired by HDPSG. What is this? This is a library for parsing natural or constructed

32 Dec 26, 2022

Semantic text segmentation. For sentence boundary detection, compound splitting and more.

NNSplit A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is a

273 Dec 29, 2022

Rust port of sentence-transformers (https://github.com/UKPLab/sentence-transformers)

Rust SBert Rust port of sentence-transformers using rust-bert and tch-rs. Supports both rust-tokenizers and Hugging Face's tokenizers. Supported model

41 Nov 13, 2022

Succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot

Warning: sticker is succeeded by SyntaxDot, which supports many new features: Multi-task learning. Pretrained transformer models, suchs as BERT and XL

25 Apr 28, 2022

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

165 Jan 1, 2023

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Related tags

Overview

Whatlang

Content

Features

Get started

Feature toggles

How does it work?

How does the language recognition work?

How is_reliable calculated?

Running benchmarks

Comparison with alternatives

Ports and clones

Derivation

License

Contributors

Comments

Summary

Whatlang benchmarks

main branch

Commits

Replace sort_by

Use inverted mapping between char and Lang

Clamp score in normalization loop instead of creating intermediate vec

Increment a common score when a common char is found

Use binary search instead of iter find

Fix returned raw score

Use intermediate char score

Count Max score in char score Loop

Make lang score access O(1) when iterating over char scores

References

Releases(v0.16.2)

v0.16.2(Oct 23, 2022)

v0.7.0(Mar 3, 2019)

Changes

v0.6.0(Mar 3, 2019)

v0.5.0(Mar 3, 2019)

Owner

Sergey Potapov

lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

Natural Language Processing for Rust

A HDPSG-inspired symbolic natural language parser written in Rust

Semantic text segmentation. For sentence boundary detection, compound splitting and more.

Rust port of sentence-transformers (https://github.com/UKPLab/sentence-transformers)

Succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

The Reactive Extensions for the Rust Programming Language

Simple, extendable and embeddable scripting language.

Query textual streams with PromQL-like language

frawk is a small programming language for writing short programs processing textual data

Ultra-fast, spookily accurate text summarizer that works on any language

An efficient and powerful Rust library for word wrapping text.

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

A Rust library for generically joining iterables with a separator

Rust wrapper for the BlingFire tokenization library

A Rust library containing an offline version of webster's dictionary.