Truly universal encoding detector in pure Rust - port of Python version

Last update: Oct 9, 2023

Overview

Charset Normalizer

A library that helps you read text from an unknown charset encoding.
Motivated by original Python version of charset-normalizer, I'm trying to resolve the issue by taking a new approach. All IANA character set names for which the Rust encoding library provides codecs are supported.

This project is port of original Pyhon version of Charset Normalizer. The biggest difference between Python and Rust versions - number of supported encodings as each langauge has own encoding / decoding library. In Rust version only encoding from WhatWG standard are supported. Python version supports more encodings, but a lot of them are old almost unused ones.

⚡ Performance

This package offer better performance than Python version (4 times faster, than MYPYC version of charset-normalizer, 8 times faster than usual Python version). In comparison with chardet and chardetng packages it has approximately the same speed but more accurate. Here are some numbers.

Package	Accuracy	Mean per file (ms)	File per sec (est)
chardet	82.6 %	3 ms	333 file/sec
chardetng	90.7 %	1.6 ms	625 file/sec
charset-normalizer-rs	97.1 %	1.5 ms	666 file/sec
charset-normalizer (Python + MYPYC version)	98 %	8 ms	125 file/sec

Package	99th percentile	95th percentile	50th percentile
chardet	8 ms	2 ms	0.2 ms
chardetng	14 ms	5 ms	0.5 ms
charset-normalizer-rs	12 ms	5 ms	0.7 ms
charset-normalizer (Python + MYPYC version)	94 ms	37 ms	3 ms

Stats are generated using 400+ files using default parameters. These results might change at any time. The dataset can be updated to include more files. The actual delays heavily depends on your CPU capabilities. The factors should remain the same. Rust version dataset has been reduced as number of supported encodings is lower than in Python version.

There is a still possibility to speed up library, so I'll appreciate any contributions.

✨ Installation

Library installation:

cargo add charset-normalizer-rs

Binary CLI tool installation:

cargo install charset-normalizer-rs

🚀 Basic Usage

CLI

This package comes with a CLI, which supposes to be compatible with Python version CLI tool.

normalizer -h
Usage: normalizer [OPTIONS] <FILES>...

Arguments:
  <FILES>...  File(s) to be analysed

Options:
  -v, --verbose                Display complementary information about file if any. Stdout will contain logs about the detection process
  -a, --with-alternative       Output complementary possibilities if any. Top-level JSON WILL be a list
  -n, --normalize              Permit to normalize input file. If not set, program does not write anything
  -m, --minimal                Only output the charset detected to STDOUT. Disabling JSON output
  -r, --replace                Replace file when trying to normalize it instead of creating a new one
  -f, --force                  Replace file without asking if you are sure, use this flag with caution
  -t, --threshold <THRESHOLD>  Define a custom maximum amount of chaos allowed in decoded content. 0. <= chaos <= 1 [default: 0.2]
  -h, --help                   Print help
  -V, --version                Print version

normalizer ./data/sample.1.fr.srt

🎉 The CLI produces easily usable stdout result in JSON format (should be the same as in Python version).

{
    "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
    "encoding": "cp1252",
    "encoding_aliases": [
        "1252",
        "windows_1252"
    ],
    "alternative_encodings": [
        "cp1254",
        "cp1256",
        "cp1258",
        "iso8859_14",
        "iso8859_15",
        "iso8859_16",
        "iso8859_3",
        "iso8859_9",
        "latin_1",
        "mbcs"
    ],
    "language": "French",
    "alphabets": [
        "Basic Latin",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.149,
    "coherence": 97.152,
    "unicode_path": null,
    "is_preferred": true
}

Rust

Library offers two main methods. First one is from_bytes, which processes text using bytes as input parameter:

use charset_normalizer_rs::from_bytes;

fn test_from_bytes() {
    let result = from_bytes(&vec![0x84, 0x31, 0x95, 0x33], None);
    let best_guess = result.get_best();
    assert_eq!(
        best_guess.unwrap().encoding(),
        "gb18030",
    );
}
test_from_bytes();

from_path processes text using filename as input parameter:

use std::path::PathBuf;
use charset_normalizer_rs::from_path;

fn test_from_path() {
    let result = from_path(&PathBuf::from("src/tests/data/samples/sample-chinese.txt"), None).unwrap();
    let best_guess = result.get_best();
    assert_eq!(
        best_guess.unwrap().encoding(),
        "big5",
    );
}
test_from_path();

😇 Why

When I started using Chardet (Python version), I noticed that it was not suited to my expectations, and I wanted to propose a reliable alternative using a completely different method. Also! I never back down on a good challenge!

I don't care about the originating charset encoding, because two different tables can produce two identical rendered string. What I want is to get readable text, the best I can.

In a way, I'm brute forcing text decoding. How cool is that? 😎

🍰 How

Discard all charset encoding table that could not fit the binary content.
Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
Extract matches with the lowest mess detected.
Additionally, we measure coherence / probe for a language.

Wait a minute, what is noise/mess and coherence according to YOU?

Noise : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to improve or rewrite it.

Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.

⚡ Known limitations

Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.

👤 Contributing

Contributions, issues and feature requests are very much welcome.
Feel free to check issues page if you want to contribute.

📝 License

Copyright © Nikolay Yarovoy @nickspring - porting to Rust.
Copyright © Ahmed TAHRI @Ousret - original Python version and some parts of this document.
This project is MIT licensed.

Comments

Idiomatic 3
Of these commits 3d4d4c0 is the most interesting.

As is, only the bare minimum necessary to change for into iter was done. It isnt faster as is, but the following could be done to optimize :

since there are no outside side effects or dependency between elements (i think) rayon could be used.

filter_map is used in the bare minimum way. this could be combined with filter_maps to replace unwraps() or replace if else.

General optimization of code could be done with regard to the current map(). for instance, I wonder if encoding[encoding.len() - 2] or if encoding.len() == 1 && encoding.first().unwrap() == "largesets" couldn't be changed? I think a closer look is necessary, and I didn't want to change something without knowing and break it.
opened by chris-ha458 11
Idiomatic fixes 2

c918f37 and 0b3ff3b need special attention.

Since the length + 1 was not necessary, i removed it and changed some code (length - 1) downstream.

This does affect the match range, and so i changed the range to match the original behavior, but it looks strange now (0~510 instead of 512?)

other than that it is mostly sideeffectless changes, except 7d65129 which changed exit behavior.

If the original exit code 1 is strictly necessary there are other ways to do recover the behavior as well.

opened by chris-ha458 10
Fixes

Being more strict with cargo clippy and cargo test looking more into cargo clippy -- -Wclippy::pedantic

As said before, pedantic is just that. Pedantic. There are many false positives and we don't need to fix all of them. But it still would be valuable to understand why they are false positives and document them if necessary.

opened by chris-ha458 8
Recover accuracy

If mess_difference <0.01 we see if coherence_difference > 0.02 and return partialord based on that. If not, we try to use multibyte usage difference if it is big enough.

comparing with multibyte_usage_a.abs() > f32::epsilon is idiomatic and includes when the value is 0.0 or some value very close to it.

However, it does not change the final accuracy at all.

opened by chris-ha458 7
[BUG] unreachable code?

Describe the bug This part of the code seems to be unreachable

The outer if ensures that cohrerence_difference is not 0.0. So that particular if statement should never be reached.

To Reproduce I don't know the intended case for these statements so I cannot reproduce the "negative case"

Expected behavior maybe the outer if needs to be different. In the original python code, chaos_difference == 0.0 and self.coherence == other.coherence this comparison is used.
bug help wanted

opened by chris-ha458 4
md fixes

Strips should_strip_sig_or_bom some blackmagic boolean logic 4899bf3

I doubled checked this with pen and paper, and tests all pass But still it would be a good idea to get another set of eyes here.

opened by chris-ha458 2
Idiomatic

a85c5fe1dd249ef52c873e58d061fbe73c776a5c is worth looking at.

This would only be a problem when the file size is around usize::MAX which is around 2^64 bytes.

opened by chris-ha458 1
Lint fixes

Using cargo clippy -- -Wclippy::pedantic to find some further low hanging fruit. The false positive rate is quite high, so I'm tempted to include some line allows like #![allow(cargo::range_plus_one)]

However I'm not sure you would agree that it is a good idea since the default cargo clippy does not complain. One compromise would be to have it enabled in a separate branch like pedantic or dev where we could use to merge recent fixes to look through the course in a strict(pedantic), yet context aware(we know some are false positives) way.

opened by chris-ha458 1
Lib.rs

I'll be very honest here. most of these changes are superficial.

I hoped to isolate atleast some parts of the codes into smaller functions but it seems when we include the debug and trace that have long reaching dependencies, it is not easy.

One thing i'm curious about are all the fallbacks what are they used for and how are they different? It seems like they might be some artifact of previous implementation.

If they are necessary I'm wondering if some kind of enum couldn't be used instead?

opened by chris-ha458 1
Idiomatic fixes

Many of these changes revolve around unwrap.

There are more changes that could be done to improve performance, readability or correctness, but most would need atleast some kind of consideration as to the correct implementation.

If fixes like this is welcome and merged, I would open issues that would discuss potential changes and their considerations

opened by chris-ha458 1
Fix bench

Hi. I'm taking a look into your code and it seems bench does not work atm.

I am not yet familiar with the ins and outs of the code so I did what I felt was the minimum i could do.

opened by chris-ha458 1
[BUG] correct behavior for “Ё” (U+0401)
Describe the bug In test_is_accentuated https://github.com/nickspring/charset-normalizer-rs/blob/cbe086f3df38a16815033309ae80455eef64cef4/src/tests/utils.rs#L28 This case is tested to see if it is false. “Ё” (U+0401) Cyrillic Capital Letter Io

The code being tested is here https://github.com/nickspring/charset-normalizer-rs/blob/cbe086f3df38a16815033309ae80455eef64cef4/src/utils.rs#L118

The problem here is that it is considered to have an diaeresis under current correct unicode decomposition rules (both NFKD and NFD) https://www.compart.com/en/unicode/U+0401 https://graphemica.com/%D0%81

(BTW this is different from almost exactly looking Unicode Character “Ë” (U+00CB) Latin Capital Letter E with Diaeresis

To Reproduce the icu4x crates can be used to decompose in rust. cargo add icu_normalizer I am actually trying to reimplement some parts of the code and that is how i discovered it.

pub(crate) fn is_accentuated(ch: char) -> bool { let nfd = icu_normalizer::DecomposingNormalizer::new_nfkd(); let denormalized_string: String = nfd.normalize(ch.to_string().as_str()); denormalized_string .chars() .any(|decomposed| match decomposed { '\u{0300}' // "WITH GRAVE" |'\u{0301}' // "WITH ACUTE" |'\u{0302}' // "WITH CIRCUMFLEX" |'\u{0303}' // "WITH TILDE" |'\u{0308}' // "WITH DIAERESIS" |'\u{0327}' // "WITH CEDILLA" => true, _=> false, }) }

This new implementation directly tries to directly decompose the input character and try to see if unicode characters that indicate accents exist. Since “Ё” (U+0401) Cyrillic Capital Letter Io decomposes into Е Cyrillic Capital Letter Ie + Diaeresis '\u{0308}' the new code returns true, while the old code returns false (since diaeresis is not in the name)

Expected behavior “Ё” (U+0401) should return true.

Additional context Unicode standard is fast moving. A new standard every year and especially for CJK there are new codepoints added constantly. I think it is valuable to have an implementation that is up to date.

Btw I have almost finished my implementation using various components from https://github.com/unicode-org/icu4x It is a pure rust codebase worked on by both standard bodies and industry supporters such as google, so I feel like it would be a good library to rely upon.
bug help wanted
opened by chris-ha458 5
Improvements : Idiomatic code

Re : #3

This codebase has been ported from Python and a lot of the design patterns could be improved to be more idiomatic rust code. Such a move will make it easier to improve speed and maintainability, ensure correct operation from a rust point of view.

Some examples would be avoiding for loops, using matches instead of if chains etc.

Many require deeper consideration.

For example, this codebase has extensive use of f32. Unless using intrinsics, f64 are as fast as or faster than f32 in rust. Moreover, trying to cast to and back for f32 and f64 can harm performance and make it difficult to ensure correct code. For instance there are instances of exact compare between f32 and f64 variables, and this is very unlikely to operate in the intended way. If it is intended, it would be valuable to have documentation regarding that, suppressing relevant lints as well. However, if there is a need to maintain ABI compatibility or follow a specification it might be inevitable. Also, on-disk size could be a consideration. In summary f32 vs f64 handling could serve as both idiomatic code and speed but only if done right.

I will try to prepare some PRs that change some things. Despite my best efforts, I am sure that many of my changes or views might be based on a flawed understanding of the code, so feel free to explain why things were done the way they were. In such cases I will help with documentation.

opened by chris-ha458 10
Improvements : Speed
As per our discussion in #2 for speed improvements the following has been suggested

calc coherence & mess in threads

or calc mess for plugins in threads (or some async?)

or something other...

The paths I had in mind was these:

Related to threads idea : use Rayon

Replace HashMap with concurrent DashMap (Current std HashMap implements rayon so not strictly necessary, but might be useful to look into regardless)

Use replace hashing algorithm used in HashMap with FxHash, AHash, HighwayHash

aHash implemented #14

~Replace sort() with sort_unstable()~ #6

Identfiy preallocation opportunities. For instance, replace Vec::new() with Vec::with_capacity()

Many of these are low hanging fruit and related to refactoring the code to idiomatic Rust code. For example, there are many for loops in this code. Iterator based code is more idiomatic, easier to improve with rayon, and interact better with allocation. (pushing items from within a for loop can cause multiple allocs and copies, while collecting an iterator can allow fewer allocations.)
opened by chris-ha458 14

Releases(1.0.6)

1.0.6(Sep 28, 2023)
Idiomatic changes

Speed improvements

Source code(tar.gz)
Source code(zip)
1.0.5(Sep 22, 2023)

Crate size was reduced by excluding tests & test data from package.
Source code(tar.gz)
Source code(zip)
1.0.3(Sep 22, 2023)
All functionality was ported from Python version

Documentation was added

Source code(tar.gz)
Source code(zip)

Owner

Nikolay Yarovoy

GitHub https://crates.io/crates/charset-normalizer-rs

A truly zero-dependency crate providing quick, easy, reliable, and scalable access to the name "jordin"

jordin Finally! A truly zero-dependency crate providing quick, easy, reliable, and scalable access to the name "jordin". Additionally, this one-of-a-k

2 Aug 4, 2022

An apocalypse-resistant data storage format for the truly paranoid.

Carbonado An apocalypse-resistant data storage format for the truly paranoid. Designed to keep encrypted, durable, compressed, provably replicated con

30 Dec 29, 2022

Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

dbz A library (dbz-lib) and CLI tool (dbz-cli) for working with Databento Binary Encoding (DBZ) files. Python bindings for dbz-lib are provided in the

15 Nov 4, 2022

Use your computer as a cosmic ray detector! One of the memory errors Rust does not protect against.

Your computer can double up as a cosmic ray detector. Yes, really! Cosmic rays hit your computer all the time. If they hit the RAM, this can sometimes

110 Jun 16, 2023

Wikit - A universal dictionary

Wikit - A universal dictionary What is it? To be short, Wikit is a tool which can (fully, may be in future) render and create dictionary file in MDX/M

120 Dec 3, 2022

Universal Windows library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D3D12).

Shroud Universal library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D