👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Peter M. Stahl

Last update: Jan 3, 2023

Related tags

Text processing rust natural-language-processing language-detection rust-library language-recognition rust-crate language-classification

Overview

What does this library do?
Why does this library exist?
Which languages are supported?
How good is it?
Why is it better than other libraries?
Test report generation
How to add it to your project?
How to build?
How to use?
What's next for version 1.2.0?
Contributions

1. What does this library do? ^{Top ▲}

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

2. Why does this library exist? ^{Top ▲}

Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.

So far, the only other comprehensive open source libraries in the Rust ecosystem for this task are CLD2 and Whatlang. Unfortunately, they have two major drawbacks:

Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it does not provide adequate results.
The more languages take part in the decision process, the less accurate are the detection results.

Lingua aims at eliminating these problems. It nearly does not need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

3. Which languages are supported? ^{Top ▲}

Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 75 languages are supported:

A
- Afrikaans
- Albanian
- Arabic
- Armenian
- Azerbaijani
B
- Basque
- Belarusian
- Bengali
- Norwegian Bokmal
- Bosnian
- Bulgarian
C
- Catalan
- Chinese
- Croatian
- Czech
D
- Danish
- Dutch
E
- English
- Esperanto
- Estonian
F
- Finnish
- French
G
- Ganda
- Georgian
- German
- Greek
- Gujarati
H
- Hebrew
- Hindi
- Hungarian
I
- Icelandic
- Indonesian
- Irish
- Italian
J
- Japanese
K
- Kazakh
- Korean
L
- Latin
- Latvian
- Lithuanian
M
- Macedonian
- Malay
- Maori
- Marathi
- Mongolian
N
- Norwegian Nynorsk
P
- Persian
- Polish
- Portuguese
- Punjabi
R
- Romanian
- Russian
S
- Serbian
- Shona
- Slovak
- Slovene
- Somali
- Sotho
- Spanish
- Swahili
- Swedish
T
- Tagalog
- Tamil
- Telugu
- Thai
- Tsonga
- Tswana
- Turkish
U
- Ukrainian
- Urdu
V
- Vietnamese
W
- Welsh
X
- Xhosa
Y
- Yoruba
Z
- Zulu

4. How good is it? ^{Top ▲}

Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:

a list of single words with a minimum length of 5 characters
a list of word pairs with a minimum length of 10 characters
a list of complete grammatical sentences of various lengths

Both the language models and the test data have been created from separate documents of the Wortschatz corpora offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.

Given the generated test data, I have compared the detection results of Lingua, CLD2 and Whatlang running over the data of Lingua's supported 75 languages. Languages that are not supported by CLD2 or Whatlang are simply ignored for the respective library during the detection process.

The box plot below shows the distribution of the averaged accuracy values for all three performed tasks: Single word detection, word pair detection and sentence detection. Lingua clearly outperforms its contender. Bar plots for each language and further box plots for the separate detection tasks can be found in the file ACCURACY_PLOTS.md. Detailed statistics including mean, median and standard deviation values for each language and classifier are available in the file ACCURACY_TABLE.md.

5. Why is it better than other libraries? ^{Top ▲}

Every language detector uses a probabilistic n-gram model trained on the character distribution in some training corpus. Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. The shorter the input text is, the less n-grams are available. The probabilities estimated from such few n-grams are not reliable. This is why Lingua makes use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct language.

A second important difference is that Lingua does not only use such a statistical model, but also a rule-based engine. This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore. In any case, the rule-based engine filters out languages that do not satisfy the conditions of the input text. Only then, in a second step, the probabilistic n-gram model is taken into consideration. This makes sense because loading less language models means less memory consumption and better runtime performance.

In general, it is always a good idea to restrict the set of languages to be considered in the classification process using the respective api methods. If you know beforehand that certain languages are never to occur in an input text, do not let those take part in the classifcation process. The filtering mechanism of the rule-based engine is quite good, however, filtering based on your own knowledge of the input text is always preferable.

6. Test report generation ^{Top ▲}

If you want to reproduce the accuracy results above, you can generate the test reports yourself for both classifiers and all languages by doing:

cargo run --release --example accuracy_reports

It is important to use the --release flag here because loading the language models in debug mode takes too much time. For each detector and language, a test report file is then written into /accuracy-reports, to be found next to the src directory. As an example, here is the current output of the Lingua German report:

##### German #####

>>> Accuracy on average: 89.1%

>> Detection of 1000 single words (average length: 9 chars)
Accuracy: 73.6%
Erroneously classified as Dutch: 2.3%, Danish: 2.1%, English: 2.1%, Latin: 2%, Bokmal: 1.6%, Basque: 1.2%, French: 1.2%, Italian: 1.2%, Esperanto: 1.1%, Swedish: 1%, Afrikaans: 0.8%, Tsonga: 0.7%, Nynorsk: 0.6%, Portuguese: 0.6%, Estonian: 0.5%, Finnish: 0.5%, Sotho: 0.5%, Welsh: 0.5%, Yoruba: 0.5%, Icelandic: 0.4%, Irish: 0.4%, Polish: 0.4%, Spanish: 0.4%, Swahili: 0.4%, Tswana: 0.4%, Bosnian: 0.3%, Catalan: 0.3%, Tagalog: 0.3%, Albanian: 0.2%, Croatian: 0.2%, Indonesian: 0.2%, Lithuanian: 0.2%, Romanian: 0.2%, Slovak: 0.2%, Xhosa: 0.2%, Zulu: 0.2%, Latvian: 0.1%, Malay: 0.1%, Slovene: 0.1%, Somali: 0.1%, Turkish: 0.1%

>> Detection of 1000 word pairs (average length: 18 chars)
Accuracy: 94%
Erroneously classified as Dutch: 0.9%, Latin: 0.8%, English: 0.7%, Swedish: 0.6%, Danish: 0.5%, French: 0.4%, Bokmal: 0.3%, Irish: 0.2%, Swahili: 0.2%, Tagalog: 0.2%, Afrikaans: 0.1%, Esperanto: 0.1%, Estonian: 0.1%, Finnish: 0.1%, Icelandic: 0.1%, Italian: 0.1%, Nynorsk: 0.1%, Somali: 0.1%, Tsonga: 0.1%, Turkish: 0.1%, Welsh: 0.1%, Zulu: 0.1%

>> Detection of 1000 sentences (average length: 112 chars)
Accuracy: 99.7%
Erroneously classified as Dutch: 0.2%, Latin: 0.1%

7. How to add it to your project? ^{Top ▲}

Add Lingua to your Cargo.toml file like so:

[dependencies]
lingua = "1.1.0"

8. How to build? ^{Top ▲}

In order to build the source code yourself, you need the stable Rust toolchain installed on your machine so that cargo, the Rust package manager is available.

git clone https://github.com/pemistahl/lingua-rs.git
cd lingua-rs
cargo build

The source code is accompanied by an extensive unit test suite. To run them, simply say:

cargo test --lib

9. How to use? ^{Top ▲}

9.1 Basic usage

use lingua::{Language, LanguageDetector, LanguageDetectorBuilder};
use lingua::Language::{English, French, German, Spanish};

let languages = vec![English, French, German, Spanish];
let detector: LanguageDetector = LanguageDetectorBuilder::from_languages(&languages).build();
let detected_language: Option<Language> = detector.detect_language_of("languages are awesome");

assert_eq!(detected_language, Some(English));

All instances of LanguageDetector within a single application share the same language models and have synchronized access to them. So you can safely have multiple instances without worrying about consuming too much memory.

9.2 Minimum relative distance

By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language. The word prologue, for instance, is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed up probabilities for each possible language have to satisfy. It can be stated in the following way:

use lingua::LanguageDetectorBuilder;
use lingua::Language::{English, French, German, Spanish};

let detector = LanguageDetectorBuilder::from_languages(&[English, French, German, Spanish])
    .with_minimum_relative_distance(0.25) // minimum: 0.00 maximum: 0.99 default: 0.00
    .build();
let detected_language = detector.detect_language_of("languages are awesome");

assert_eq!(detected_language, None);

Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high. Otherwise None will be returned most of the time as in the example above. This is the return value for cases where language detection is not reliably possible.

9.3 Confidence values

Knowing about the most likely language is nice but how reliable is the computed likelihood? And how less likely are the other examined languages in comparison to the most likely one? These questions can be answered as well:

use lingua::{LanguageDetectorBuilder, Language};
use lingua::Language::{English, French, German, Spanish};
use float_cmp::approx_eq;

let languages = vec![English, French, German, Spanish];
let detector = LanguageDetectorBuilder::from_languages(&languages).build();
let confidence_values: Vec<(Language, f64)> = detector.compute_language_confidence_values(
    "languages are awesome"
);

// The more readable version of the assertions below:
// assert_eq!(
//     confidence_values,
//     vec![(English, 1.0), (French, 0.79), (German, 0.75), (Spanish, 0.72)]
// );

assert_eq!(confidence_values[0], (English, 1.0_f64));

assert_eq!(confidence_values[1].0, French);
assert!(approx_eq!(f64, confidence_values[1].1, 0.7917282993701181, ulps = 2));

assert_eq!(confidence_values[2].0, German);
assert!(approx_eq!(f64, confidence_values[2].1, 0.7532048914992281, ulps = 2));

assert_eq!(confidence_values[3].0, Spanish);
assert!(approx_eq!(f64, confidence_values[3].1, 0.7229637749926444, ulps = 2));

In the example above, a vector of all possible languages is returned, sorted by their confidence value in descending order. The values that the detector computes are part of a relative confidence metric, not of an absolute one. Each value is a number between 0.0 and 1.0. The most likely language is always returned with value 1.0. All other languages get values assigned which are lower than 1.0, denoting how less likely those languages are in comparison to the most likely language.

The vector returned by this method does not necessarily contain all languages which the calling instance of LanguageDetector was built from. If the rule-based engine decides that a specific language is truly impossible, then it will not be part of the returned vector. Likewise, if no ngram probabilities can be found within the detector's languages for the given input text, the returned vector will be empty. The confidence value for each language not being part of the returned vector is assumed to be 0.0.

9.4 Methods to build the LanguageDetector

There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance (what a surprise :-). The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages:

use lingua::{LanguageDetectorBuilder, Language, IsoCode639_1, IsoCode639_3};

// Including all languages available in the library
// consumes approximately 2GB of memory and might
// lead to slow runtime performance.
LanguageDetectorBuilder::from_all_languages();

// Include only languages that are not yet extinct (= currently excludes Latin).
LanguageDetectorBuilder::from_all_spoken_languages();

// Include only languages written with Cyrillic script.
LanguageDetectorBuilder::from_all_languages_with_cyrillic_script();

// Exclude only the Spanish language from the decision algorithm.
LanguageDetectorBuilder::from_all_languages_without(&[Language::Spanish]);

// Only decide between English and German.
LanguageDetectorBuilder::from_languages(&[Language::English, Language::German]);

// Select languages by ISO 639-1 code.
LanguageDetectorBuilder::from_iso_codes_639_1(&[IsoCode639_1::EN, IsoCode639_1::DE]);

// Select languages by ISO 639-3 code.
LanguageDetectorBuilder::from_iso_codes_639_3(&[IsoCode639_3::ENG, IsoCode639_3::DEU]);

10. What's next for version 1.2.0? ^{Top ▲}

Take a look at the planned issues.

11. Contributions ^{Top ▲}

Josh Rotenberg has written a wrapper for using Lingua with the Elixir programming language.

Comments

Add WASM support

This isn't quite specific to lingua-rs but I've been looking into WebAssembly lately and it would be great to be able to use Lingua-rs into a wasm project. I did an initial test but it failed on a problem with bzip2-sys. I'll have to keep looking into it and let you know what I find but I thought you might be interested.
enhancement

opened by zacharywhitley 12
Bloat binary size for fantastic cold start performance
As the slightly ironic title hopefully conveys, I'm not very convinced of this idea, I'm just putting it here for the sake of discussion.

Motivation

Lingua stores language models (which really are just n-grams and their associated probabilities) as zipped JSON files in the binary. Depending on user preference either upfront at startup or lazily on demand, it will unzip, parse and then transform this data into the in-memory representation HashMap<Ngram, f64> used by part of the detection pipeline. That carries a certain computational cost, which is negligible for most use cases where the detector is reused and this initial cost can be offset over its lifetime as all following detections can use the already "warmed up" detector.

In certain environments however, for example serverless computing or when using WebAssembly in plugins for other software, the detector can not be used in such a way. In these cases our code (containing a call to Lingua) is very short-lived and no reuse over subsequent invocations is possible, meaning each time it has to be constructed from scratch. In these permanent "cold start" situations, the described startup cost ends up dominating the overall runtime needed for detection. In my case, where I invoke Lingua from C# by means of wasmtime, that results in taking 540 ms for detecting the language of a short text, most of which is spent in startup. The following flamegraph illustrates this nicely. You can see some time being spent deflating the zipped data, some more deserializing the JSON and then some constructing the hash map.

I would like to point out that this is not a weakness of Lingua. It's a problem that comes with the domain of language detection and I'm not aware of a library that has solved this. And as I mentioned, it's really only an issue in specialized use cases.

Solution

Idea

We could save a lot of time by embedding the language models into the binary in a way that is significantly cheaper to read. With rkyv's zero-copy deserialization it's possible to encode a data structure in a format that is the same as its in-memory representation. This means there's no additional work to be done, as language models can be read directly from the binary's data segment where they are stored.

Implementation

I hacked the demo together by adding a build script for language crates. It reads the JSON models for a language and writes them to binary files in their rkyv encoding. These bytes are then simply statically included and accessed at runtime. Instead of exposing a Dir<'static>, every language crate now has a [Option<&'static [u8]>; 5], an array of bytes for all 5 possible n-gram types. The bytes are just the archived representation of HashMap<Ngram, f64>.

To read the JSON models from the build script I had to move the Fraction and Ngram types out into a separate common crate used by both the build script and Lingua.

Results

I measured the C# scenario I described earlier.

Benchmark implemented with BenchmarkDotNet on .NET 6.0

Uses wasmtime to call a WebAssembly module built with lingua to detect the language of a single short text

Instance of WebAssembly module (its memory space) is destroyed after each iteration, while the JIT-compiled module (bytecode) itself can be reused

Only 4 languages were included in the binary

The effect of zero-copy deserialization of the language models in this cold start scenario is as follows:

| | before | after | change | |-------------|--------|--------|--------------| | Binary size | 8.8 MB | 33 MB | x3.75 bigger | | Time | 543 ms | 8.9 ms | x61 faster |

The new flamegraph looks much less informative, mainly because there's just not a whole lot going on anymore. Some time is spent on what I think is language runtime stuff and the rest in the actual detection, there is nothing obvious to optimize away anymore.

These results are impressive both in the positive and negative sense. The build takes much longer now because it's serializing a bunch of pretty sizable files to disk for all languages. And the actual binary grows to a size that is no longer convenient to deliver in all environments, for example if you imagine it having to be downloaded over a mobile connection to a phone. On the other hand it's pretty amazing to see that you can initialize and run a sophisticated language detection suite in just under 9 ms.
opened by martindisch 7

Long runtime of language detection

Hi,

I'm evaluating the lingua-rs libraray and I discovered a long running time of the following program:

#[macro_use]
extern crate lazy_static;

use std::env;
use std::fs;

use lingua::{Language, LanguageDetector, LanguageDetectorBuilder};

lazy_static! {
    static ref DETECTOR: LanguageDetector = {
        LanguageDetectorBuilder::from_languages(&[
            Language::Dutch,
            Language::English,
            Language::French,
            Language::German,
            Language::Hungarian,
            Language::Italian,
            Language::Portuguese,
            Language::Russian,
            Language::Spanish,
            Language::Finnish,
            Language::Swedish,
        ])
        .build()
    };
}

fn main() {
    let args: Vec<String> = env::args().collect();
    if args.len() != 2 {
        eprintln!("missing argument");
        std::process::exit(1);
    }

    let filename = args.get(1).unwrap();
    match fs::read_to_string(filename) {
        Ok(content) => match DETECTOR.detect_language_of(&content) {
            Some(lang) => println!("{},{}", filename, lang.iso_code_639_3()),
            _ => println!("{},", filename),
        },
        _ => println!("{},", filename),
    };

    std::process::exit(0);
}

First I thought the reason for the long running time could be the language detector. But even after moving this part into an lazy_static block, the runtime is very slow. Is a running time of 9.82 seconds to be expected with lingua for an article of 26,712 words? Are there ways to speed up the program?

I would welcome your response, Nico

question

opened by niko2342 7

Significant startup time

Hi,

I am trying to embed this into my program but I am seeing a very long startup time setting up the detector. Can I ask (on the high level) why that is (other than the files being larger than other libraries due to the use of 5-grams). Anything we can do to speed up the initialization time?

Happy to help contribute if needed.

opened by lhr0909 7
feature flag to disable wasm

Oddly I had to add a feature flag to lingua-rs to disable wasm so that I could use it in a WebAssembly project. It was't actually disabling wasm as much as it's disabling the wasm-bindgen. I'm running it in a non-browser setting and the code generated from bindgen causes problems. Let me know if you'd like me to put in a PR. It was only a couple of minor changes.

opened by zacharywhitley 3
Bug in iso_code_639_3() debug print
println!("{:?} -- {} -- {:?}", i, i.iso_code_639_3(), i.iso_code_639_3()); Italian -- ita -- ITA

Debug print shouldn't be giving a different value than a normal print.

Also it seems impossible to do anything with the value returned:

30 | if i == "Italian" { println!("Match"); } | ^^ no implementation for `Language == str 30 | if i.to_owned() == "Italian" { println!("Match"); } | ^^^^^^^^^ expected enum `Language`, found `&str` 30 | if i.display() == "Italian" { println!("Match"); } | ^^^^^^^ method not found in `&Language`

It appears that we're missing display()... Do I just need to use the strum_macros crate? Should I attempt to make a PR to add the display() trait?

The accuracy is looking really good; I've tested so far with about 50 subtitles (grabbing 10 lines evenly distributed throughout the file) and it's 100% accurate so far. Thanks for such an awesome library, and your help using it. :)
opened by mabushey 3

error[E0308]: mismatched types

Hello,

This crate looks awesome! I plan on using it to detect languages of subtitles. Thanks for writing this crate! I added it to cargo.toml, but I get this on cargo build:

error[E0308]: mismatched types
   --> /home/michael/.cargo/registry/src/github.com-1ecc6299db9ec823/lingua-1.3.2/src/json.rs:265:32
    |
262 | fn get_language_models_directory(language: Language) -> Dir<'static> {
    |                                                         ------------ expected `include_dir::Dir<'static>` because of return type
...
265 |         Language::Afrikaans => AFRIKAANS_MODELS_DIRECTORY,
    |                                ^^^^^^^^^^^^^^^^^^^^^^^^^^ expected struct `include_dir::Dir`, found struct `include_dir::dir::Dir`
    |
    = note: perhaps two different versions of crate `include_dir` are being used?

For more information about this error, try `rustc --explain E0308`.

Here is my full cargo.toml dependency list:

clap = { version = "3.0", features = ["derive"] }
regex = "1"
tmdb = "3.0.0"
log = "0.4"
pretty_env_logger = "0.4.0"
walkdir = "2"
dirs = "3.0"
serde = { version = "1.0", features = ["derive"]   }
serde_yaml = "0.8"
md-5 = "0.9"
unicode-truncate = "0.2.0"
lazy_static = "1.4.0"
common-path = "1.0.0"
lingua = "1.3.2"

$ rustc --version
rustc 1.58.1 (db9d1b20b 2022-01-20)

I'm kind of a Rust noob so I'm not sure how to fix this...

opened by mabushey 3

added string to iso code using strum

I am trying to add a feature to dynamically add languages in lingua-node, and needed the ability to add language codes from the outside of Rust context.

I saw there is strum so I thought to add the macro into the isocode file and it works like a charm (used a regex find/replace to add this many annotations lol)! Added test make sure the features work.

Let me know if you have any questions on the PR.

Cheers!

opened by lhr0909 3
Let LanguageDetectorBuilder::build() return a Result

One of the things I've been working on is an HTTP service wrapper around lingua. My implementation allows the caller to define the various pieces of a detector or just use the default for the service instance. If they specify invalid options (say, an unrecognized language or out of range minimum relative distance) I have to either validate the options manually before attempting to create the builder or catch the panic. If validation was done in LanguageDetectorBuilder::build and that returned a Result<LanguageDetector, Error> or the like I could rely on the library itself for validation.
enhancement

opened by joshrotenberg 3

Failing to compile lingua 1.0.2

When trying to run a file that uses lingua 1.0.2 as a dependency, compiling fails with the following error. This is a new rust project, with no other dependencies and just a println, as I just wanted to check out the library

Compiling lingua v1.0.2
error[E0603]: module `export` is private
   --> C:\Users\luana\.cargo\registry\src\github.com-1ecc6299db9ec823\lingua-1.0.2\src\ngram.rs:19:12
    |
19  | use serde::export::Formatter;
    |            ^^^^^^ private module
    |
note: the module `export` is defined here
   --> C:\Users\luana\.cargo\registry\src\github.com-1ecc6299db9ec823\serde-1.0.119\src\lib.rs:275:5
    |
275 | use self::__private as export;
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0603`.
error: could not compile `lingua`.

bug

opened by luananama 3

Memory leak

First, thanks for your work on Lingua.

I found a memory leak when using lingua with the tokio runtime. I think the problem comes from the fact that the language detector might be Send or Sync to multiples threads.

I tried using with_preloaded_language_models() on the builder with no success. Even when I build a new LanguageDetector in each method call I still have a memory leak.

opened by nathanielsimard 2
Add builder pattern to WASM API

The current WASM API does not strictly follow the builder pattern used in the Rust API. That's because I did not know back then how to accomplish it with wasm-bindgen. While working on the WASM API for grex, I found out how to implement the builder pattern. Before creating the JavaScript distribution for Lingua, these changes have to be applied first.
enhancement

opened by pemistahl 0
Reduce resources to load language models

Currently, the language models are parsed from json files and loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python. Perhaps it is even possible to store those data structures in some kind of binary format on disk which can be loaded faster than the current json files.

One promising candidate could be ndarray.
enhancement

opened by pemistahl 2
Add absolute confidence metric

In addition to the current relative confidence metric, an absolute confidence metric shall be implemented which is able to say how likely it is that a given text is written in a specific language, independently from all the other languages.
new feature

opened by pemistahl 0
Add low accuracy mode
Lingua's high detection accuracy comes at the cost of being noticeably slower than other language detectors. The large language models also consume significant amounts of memory. These requirements might not be feasible for systems running low on resources.

For users who want to classify mostly long texts or need to save resources, a so-called low accuracy mode will be implemented that loads only a small subset of the language models into memory. The API will be as follows:

LanguageDetectorBuilder::from_all_languages().with_low_accuracy_mode().build();

The downside of this approach is that detection accuracy for short texts consisting of less than 120 characters will drop significantly. However, detection accuracy for texts which are longer than 120 characters will remain mostly unaffected.
new feature
opened by pemistahl 0
Add confidence metric for single language

Hello. I am doing a cryptology library. I need to detect if the text is english or not. Could you allow confidence value for a single language please?
new feature

opened by gogo2464 2
Load models slightly more eagerly and reuse for all ngrams during detection.
The library puts emphasis on lazy loading in order to be efficient in certain servless environments. My use-case requires maximum performance, and I can sacrify slower startup for better runtime performance. The LanguageDetectorBuilder has a flag with_preloaded_language_models that forces eager loading of models. However, during detection the LanguageDetector loops through every ngram and calls the function load_language_models, that has to take read lock to check whether the model is loaded. Read locks are cheap, but not free. As an experiment I completely eliminated lazy loading and stored models in the LanguageDetector. Single threaded benchmark improved by 1.2x-2x and multi-threaded by 2x-4x depending on the system.

Patch

This patch works with lazy loading design. Instead of lazy-loading model for every ngram, I load models slightly more eagerly in compute_sum_of_ngram_probabilities and in count_unigrams for the specified language, and reuse the models for ngrams in the loop. For this to work I had to use Arc instead of Box in the BoxedLanguageModel

Benchmark

For the benchmark I used the accuracy reports test data. The benchmark code is here. I tested both single-threaded and multi-threaded/parallel mode.

Results

I tested the patch on two machines

Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz on Linux

M1 Max on Mac

The numbers in the columns Before and After are throughput as detections per second.

Single threaded benchmark

cargo run --release --bin bench -- --max-examples 30000

| Machine | Before | After | Change | | ------------- |:-------------:| -----: | -----: | | i7-6800K | 1162 | 1294 | 1.11x | | M1 Max | 1592 | 1844 | 1.15x |

Multi threaded benchmark

cargo run --release --bin bench -- --max-examples 50000 --parallel

| Machine | Before | After | Change | | ------------- |:-------------:| -----: | -----: | | i7-6800K | 4226 | 8033 | 1.9x | | M1 Max | 1383 | 5858 | 4.23x |

I am really surprised by the numbers on M1 chip. Before, single-threaded benchmark runs faster than multi-threaded, which means RwLocks on M1-Mac are slow.

I ran accuracy reports and checked against the current main branch diff -r accuracy-reports/lingua/ ../lingua-rs-main/accuracy-reports/lingua/ The command found no differences.
opened by serega 1

Releases(v1.4.0)

v1.4.0(Apr 8, 2022)
Features

The library can now be compiled to WebAssembly and be used in any JavaScript project. Big thanks to @martindisch for bringing this forward. (#14)

Improvements

Some minor performance tweaks have been applied to the rule engine.

Source code(tar.gz)
Source code(zip)
v1.3.3(Feb 22, 2022)
Bug Fixes

This release updates outdated dependencies and fixes an incompatibility between different versions of the include_dir crate which are used in the main lingua crate and the language model crates.

Source code(tar.gz)
Source code(zip)
v1.3.2(Oct 19, 2021)
Bug Fixes

Another compilation error has been fixed which occurred when the Latin language was left out as Cargo feature.

Source code(tar.gz)
Source code(zip)
v1.3.1(Oct 19, 2021)
Bug Fixes

When Chinese, Japanese or Korean were left out as Cargo features, there were compilation errors. This has been fixed.

Source code(tar.gz)
Source code(zip)
v1.3.0(Oct 19, 2021)
Features

The language model dependencies are separate Cargo features now. Users can decide which languages shall be downloaded and used in the library. (#12)

Improvements

The code that does the lazy-loading of the language models has been refactored significantly, making the code more stable and less error-prone.

Bug Fixes

In very rare cases, the language returned by the detector was non-deterministic. This has been fixed. Big thanks to @asg0451 for identifying this problem. (#17)

Source code(tar.gz)
Source code(zip)
v1.2.2(Jun 2, 2021)
Features

The enums Language, IsoCode639_1 and IsoCode639_3 now implement std::str::FromStr in order to instantiate enum variants by string values. This comes in handy for JavaScript bindings and the like. (#15)

Improvements

The performance of preloading the language models has been improved.

Bug Fixes

Language detection for sentences with more than 120 characters was supposed to be done by iterating through trigrams only but this was never the case. This has been corrected.

Source code(tar.gz)
Source code(zip)
v1.2.1(May 8, 2021)
Improvements

Language detection for sentences with more than 120 characters now performs more quickly by iterating through trigrams only which is enough to achieve high detection accuracy.

Textual input that includes logograms from Chinese, Japanese or Korean is now split at each logogram and not only at whitespace. This provides for more reliable language detection for sentences that include multi-language content.

Bug Fixes

Errors in the rule engine for the Latvian language have been resolved.

Corrupted characters in the Latvian test data have been corrected.

Source code(tar.gz)
Source code(zip)
v1.2.0(Apr 8, 2021)
Features

A LanguageDetector can now be built with lazy-loading required language models on demand (default) or with preloading all language models at once by calling LanguageDetectorBuilder.with_preloaded_language_models(). (#10)

Source code(tar.gz)
Source code(zip)
v1.1.0(Jan 31, 2021)
Languages

The Maori language is now supported. Thanks to @eekkaiia for the contribution. (#5)

Performance

Loading and searching the language models has been quite slow so far. Using parallel iterators from the Rayon library, this process is now at least 50% faster, depending on how many CPU cores are available. (#8)

Accuracy Reports

Accuracy reports are now also generated for the CLD2 library and included in the language detector comparison plots. (#6)

Source code(tar.gz)
Source code(zip)
v1.0.3(Jan 15, 2021)
Bug Fixes

Lingua could not be used within other projects because of a private serde module that was accidentally tried to be exposed. Thanks to @luananama for reporting this bug. (#9)

Source code(tar.gz)
Source code(zip)
v1.0.2(Nov 22, 2020)
Bug Fixes

Accidentally, bug #3 was only partially fixed. This has been corrected.

Source code(tar.gz)
Source code(zip)
v1.0.1(Nov 22, 2020)
Bug Fixes

When trying to create new language models, the LanguageModelFilesWriter panicked when it recognized characters in a text corpus that consist of multiple bytes. Thanks to @eekkaiia for reporting this bug. (#3)

Source code(tar.gz)
Source code(zip)
v1.0.0(Nov 21, 2020)

This is the very first release of Lingua for Rust. Took me 5 months of hard work in my free time. Hope you find it useful. :)
Source code(tar.gz)
Source code(zip)

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Related tags

Overview

Table of Contents

1. What does this library do? Top ▲

2. Why does this library exist? Top ▲

3. Which languages are supported? Top ▲

4. How good is it? Top ▲

5. Why is it better than other libraries? Top ▲

6. Test report generation Top ▲

7. How to add it to your project? Top ▲

8. How to build? Top ▲

9. How to use? Top ▲

9.1 Basic usage

9.2 Minimum relative distance

9.3 Confidence values

9.4 Methods to build the LanguageDetector

10. What's next for version 1.2.0? Top ▲

11. Contributions Top ▲

Comments

Motivation

Solution

Idea

Implementation

Results

Patch

Benchmark

Results

Single threaded benchmark

Multi threaded benchmark

Releases(v1.4.0)

v1.4.0(Apr 8, 2022)

Features

Improvements

v1.3.3(Feb 22, 2022)

Bug Fixes

v1.3.2(Oct 19, 2021)

Bug Fixes

v1.3.1(Oct 19, 2021)

Bug Fixes

v1.3.0(Oct 19, 2021)

Features

Improvements

Bug Fixes

v1.2.2(Jun 2, 2021)

Features

Improvements

Bug Fixes

v1.2.1(May 8, 2021)

Improvements

Bug Fixes

v1.2.0(Apr 8, 2021)

Features

v1.1.0(Jan 31, 2021)

Languages

Performance

Accuracy Reports

v1.0.3(Jan 15, 2021)

Bug Fixes

v1.0.2(Nov 22, 2020)

Bug Fixes

v1.0.1(Nov 22, 2020)

Bug Fixes

v1.0.0(Nov 21, 2020)

Owner

Peter M. Stahl

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

Ultra-fast, spookily accurate text summarizer that works on any language

Semantic text segmentation. For sentence boundary detection, compound splitting and more.

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

frawk is a small programming language for writing short programs processing textual data

Natural Language Processing for Rust

A HDPSG-inspired symbolic natural language parser written in Rust

Text Expression Runner – Readable and easy to use text expressions

Blazingly fast framework for in-process microservices on top of Tower ecosystem

An efficient and powerful Rust library for word wrapping text.

Text calculator with support for units and conversion

Find and replace text in source files

1. What does this library do? ^{Top ▲}

2. Why does this library exist? ^{Top ▲}

3. Which languages are supported? ^{Top ▲}

4. How good is it? ^{Top ▲}

5. Why is it better than other libraries? ^{Top ▲}

6. Test report generation ^{Top ▲}

7. How to add it to your project? ^{Top ▲}

8. How to build? ^{Top ▲}

9. How to use? ^{Top ▲}

10. What's next for version 1.2.0? ^{Top ▲}

11. Contributions ^{Top ▲}