Truly universal encoding detector in pure Rust - port of Python version

Overview

Charset Normalizer

charset-normalizer-rs on docs.rs charset-normalizer-rs on crates.io

A library that helps you read text from an unknown charset encoding.
Motivated by original Python version of charset-normalizer, I'm trying to resolve the issue by taking a new approach. All IANA character set names for which the Rust encoding library provides codecs are supported.

This project is port of original Pyhon version of Charset Normalizer. The biggest difference between Python and Rust versions - number of supported encodings as each langauge has own encoding / decoding library. In Rust version only encoding from WhatWG standard are supported. Python version supports more encodings, but a lot of them are old almost unused ones.

⚑ Performance

This package offer better performance than Python version (4 times faster, than MYPYC version of charset-normalizer, 8 times faster than usual Python version). In comparison with chardet and chardetng packages it has approximately the same speed but more accurate. Here are some numbers.

Package Accuracy Mean per file (ms) File per sec (est)
chardet 82.6 % 3 ms 333 file/sec
chardetng 90.7 % 1.6 ms 625 file/sec
charset-normalizer-rs 97.1 % 1.5 ms 666 file/sec
charset-normalizer (Python + MYPYC version) 98 % 8 ms 125 file/sec
Package 99th percentile 95th percentile 50th percentile
chardet 8 ms 2 ms 0.2 ms
chardetng 14 ms 5 ms 0.5 ms
charset-normalizer-rs 12 ms 5 ms 0.7 ms
charset-normalizer (Python + MYPYC version) 94 ms 37 ms 3 ms

Stats are generated using 400+ files using default parameters. These results might change at any time. The dataset can be updated to include more files. The actual delays heavily depends on your CPU capabilities. The factors should remain the same. Rust version dataset has been reduced as number of supported encodings is lower than in Python version.

There is a still possibility to speed up library, so I'll appreciate any contributions.

✨ Installation

Library installation:

cargo add charset-normalizer-rs

Binary CLI tool installation:

cargo install charset-normalizer-rs

πŸš€ Basic Usage

CLI

This package comes with a CLI, which supposes to be compatible with Python version CLI tool.

normalizer -h
Usage: normalizer [OPTIONS] <FILES>...

Arguments:
  <FILES>...  File(s) to be analysed

Options:
  -v, --verbose                Display complementary information about file if any. Stdout will contain logs about the detection process
  -a, --with-alternative       Output complementary possibilities if any. Top-level JSON WILL be a list
  -n, --normalize              Permit to normalize input file. If not set, program does not write anything
  -m, --minimal                Only output the charset detected to STDOUT. Disabling JSON output
  -r, --replace                Replace file when trying to normalize it instead of creating a new one
  -f, --force                  Replace file without asking if you are sure, use this flag with caution
  -t, --threshold <THRESHOLD>  Define a custom maximum amount of chaos allowed in decoded content. 0. <= chaos <= 1 [default: 0.2]
  -h, --help                   Print help
  -V, --version                Print version
normalizer ./data/sample.1.fr.srt

πŸŽ‰ The CLI produces easily usable stdout result in JSON format (should be the same as in Python version).

{
    "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
    "encoding": "cp1252",
    "encoding_aliases": [
        "1252",
        "windows_1252"
    ],
    "alternative_encodings": [
        "cp1254",
        "cp1256",
        "cp1258",
        "iso8859_14",
        "iso8859_15",
        "iso8859_16",
        "iso8859_3",
        "iso8859_9",
        "latin_1",
        "mbcs"
    ],
    "language": "French",
    "alphabets": [
        "Basic Latin",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.149,
    "coherence": 97.152,
    "unicode_path": null,
    "is_preferred": true
}

Rust

Library offers two main methods. First one is from_bytes, which processes text using bytes as input parameter:

use charset_normalizer_rs::from_bytes;

fn test_from_bytes() {
    let result = from_bytes(&vec![0x84, 0x31, 0x95, 0x33], None);
    let best_guess = result.get_best();
    assert_eq!(
        best_guess.unwrap().encoding(),
        "gb18030",
    );
}
test_from_bytes();

from_path processes text using filename as input parameter:

use std::path::PathBuf;
use charset_normalizer_rs::from_path;

fn test_from_path() {
    let result = from_path(&PathBuf::from("src/tests/data/samples/sample-chinese.txt"), None).unwrap();
    let best_guess = result.get_best();
    assert_eq!(
        best_guess.unwrap().encoding(),
        "big5",
    );
}
test_from_path();

πŸ˜‡ Why

When I started using Chardet (Python version), I noticed that it was not suited to my expectations, and I wanted to propose a reliable alternative using a completely different method. Also! I never back down on a good challenge!

I don't care about the originating charset encoding, because two different tables can produce two identical rendered string. What I want is to get readable text, the best I can.

In a way, I'm brute forcing text decoding. How cool is that? 😎

🍰 How

  • Discard all charset encoding table that could not fit the binary content.
  • Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
  • Extract matches with the lowest mess detected.
  • Additionally, we measure coherence / probe for a language.

Wait a minute, what is noise/mess and coherence according to YOU?

Noise : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to improve or rewrite it.

Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.

⚑ Known limitations

  • Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
  • Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.

πŸ‘€ Contributing

Contributions, issues and feature requests are very much welcome.
Feel free to check issues page if you want to contribute.

πŸ“ License

Copyright Β© Nikolay Yarovoy @nickspring - porting to Rust.
Copyright Β© Ahmed TAHRI @Ousret - original Python version and some parts of this document.
This project is MIT licensed.

Characters frequencies used in this project Β© 2012 Denny VrandečiΔ‡

Comments
  • Idiomatic 3

    Idiomatic 3

    Of these commits 3d4d4c0 is the most interesting.

    As is, only the bare minimum necessary to change for into iter was done. It isnt faster as is, but the following could be done to optimize :

    1. since there are no outside side effects or dependency between elements (i think) rayon could be used.
    2. filter_map is used in the bare minimum way. this could be combined with filter_maps to replace unwraps() or replace if else.
    3. General optimization of code could be done with regard to the current map(). for instance, I wonder if encoding[encoding.len() - 2] or if encoding.len() == 1 && encoding.first().unwrap() == "largesets" couldn't be changed? I think a closer look is necessary, and I didn't want to change something without knowing and break it.
    opened by chris-ha458 11
  • Idiomatic fixes 2

    Idiomatic fixes 2

    c918f37 and 0b3ff3b need special attention.

    Since the length + 1 was not necessary, i removed it and changed some code (length - 1) downstream.

    This does affect the match range, and so i changed the range to match the original behavior, but it looks strange now (0~510 instead of 512?)

    other than that it is mostly sideeffectless changes, except 7d65129 which changed exit behavior.

    If the original exit code 1 is strictly necessary there are other ways to do recover the behavior as well.

    opened by chris-ha458 10
  • Fixes

    Fixes

    Being more strict with cargo clippy and cargo test looking more into cargo clippy -- -Wclippy::pedantic

    As said before, pedantic is just that. Pedantic. There are many false positives and we don't need to fix all of them. But it still would be valuable to understand why they are false positives and document them if necessary.

    opened by chris-ha458 8
  • Recover accuracy

    Recover accuracy

    If mess_difference <0.01 we see if coherence_difference > 0.02 and return partialord based on that. If not, we try to use multibyte usage difference if it is big enough.

    comparing with multibyte_usage_a.abs() > f32::epsilon is idiomatic and includes when the value is 0.0 or some value very close to it.

    However, it does not change the final accuracy at all.

    opened by chris-ha458 7
  • [BUG] unreachable code?

    [BUG] unreachable code?

    Describe the bug This part of the code seems to be unreachable

    The outer if ensures that cohrerence_difference is not 0.0. So that particular if statement should never be reached.

    To Reproduce I don't know the intended case for these statements so I cannot reproduce the "negative case"

    Expected behavior maybe the outer if needs to be different. In the original python code, chaos_difference == 0.0 and self.coherence == other.coherence this comparison is used.

    bug help wanted 
    opened by chris-ha458 4
  • md fixes

    md fixes

    Strips should_strip_sig_or_bom some blackmagic boolean logic 4899bf3

    I doubled checked this with pen and paper, and tests all pass But still it would be a good idea to get another set of eyes here.

    opened by chris-ha458 2
  • Idiomatic

    Idiomatic

    a85c5fe1dd249ef52c873e58d061fbe73c776a5c is worth looking at.

    This would only be a problem when the file size is around usize::MAX which is around 2^64 bytes.

    opened by chris-ha458 1
  • Lint fixes

    Lint fixes

    Using cargo clippy -- -Wclippy::pedantic to find some further low hanging fruit. The false positive rate is quite high, so I'm tempted to include some line allows like #![allow(cargo::range_plus_one)]

    However I'm not sure you would agree that it is a good idea since the default cargo clippy does not complain. One compromise would be to have it enabled in a separate branch like pedantic or dev where we could use to merge recent fixes to look through the course in a strict(pedantic), yet context aware(we know some are false positives) way.

    opened by chris-ha458 1
  • Lib.rs

    Lib.rs

    I'll be very honest here. most of these changes are superficial.

    I hoped to isolate atleast some parts of the codes into smaller functions but it seems when we include the debug and trace that have long reaching dependencies, it is not easy.

    One thing i'm curious about are all the fallbacks what are they used for and how are they different? It seems like they might be some artifact of previous implementation.

    If they are necessary I'm wondering if some kind of enum couldn't be used instead?

    opened by chris-ha458 1
  • Idiomatic fixes

    Idiomatic fixes

    Many of these changes revolve around unwrap.

    There are more changes that could be done to improve performance, readability or correctness, but most would need atleast some kind of consideration as to the correct implementation.

    If fixes like this is welcome and merged, I would open issues that would discuss potential changes and their considerations

    opened by chris-ha458 1
  • Fix bench

    Fix bench

    Hi. I'm taking a look into your code and it seems bench does not work atm.

    I am not yet familiar with the ins and outs of the code so I did what I felt was the minimum i could do.

    opened by chris-ha458 1
  • [BUG] correct behavior for β€œΠβ€ (U+0401)

    [BUG] correct behavior for β€œΠβ€ (U+0401)

    Describe the bug In test_is_accentuated https://github.com/nickspring/charset-normalizer-rs/blob/cbe086f3df38a16815033309ae80455eef64cef4/src/tests/utils.rs#L28 This case is tested to see if it is false. β€œΠβ€ (U+0401) Cyrillic Capital Letter Io

    The code being tested is here https://github.com/nickspring/charset-normalizer-rs/blob/cbe086f3df38a16815033309ae80455eef64cef4/src/utils.rs#L118

    The problem here is that it is considered to have an diaeresis under current correct unicode decomposition rules (both NFKD and NFD) https://www.compart.com/en/unicode/U+0401 https://graphemica.com/%D0%81

    (BTW this is different from almost exactly looking Unicode Character β€œΓ‹β€ (U+00CB) Latin Capital Letter E with Diaeresis

    To Reproduce the icu4x crates can be used to decompose in rust. cargo add icu_normalizer I am actually trying to reimplement some parts of the code and that is how i discovered it.

    pub(crate) fn is_accentuated(ch: char) -> bool {
        let nfd = icu_normalizer::DecomposingNormalizer::new_nfkd();
        let denormalized_string: String = nfd.normalize(ch.to_string().as_str());
        denormalized_string
            .chars()
            .any(|decomposed| match decomposed {
            '\u{0300}' // "WITH GRAVE"
            |'\u{0301}' // "WITH ACUTE"
            |'\u{0302}' // "WITH CIRCUMFLEX"
            |'\u{0303}' // "WITH TILDE"
            |'\u{0308}' //  "WITH DIAERESIS"
            |'\u{0327}' // "WITH CEDILLA"
             => true,
            _=> false,
        })
    }
    

    This new implementation directly tries to directly decompose the input character and try to see if unicode characters that indicate accents exist. Since β€œΠβ€ (U+0401) Cyrillic Capital Letter Io decomposes into Π• Cyrillic Capital Letter Ie + Diaeresis '\u{0308}' the new code returns true, while the old code returns false (since diaeresis is not in the name)

    Expected behavior β€œΠβ€ (U+0401) should return true.

    Additional context Unicode standard is fast moving. A new standard every year and especially for CJK there are new codepoints added constantly. I think it is valuable to have an implementation that is up to date.

    Btw I have almost finished my implementation using various components from https://github.com/unicode-org/icu4x It is a pure rust codebase worked on by both standard bodies and industry supporters such as google, so I feel like it would be a good library to rely upon.

    bug help wanted 
    opened by chris-ha458 5
  • Improvements : Idiomatic code

    Improvements : Idiomatic code

    Re : #3

    This codebase has been ported from Python and a lot of the design patterns could be improved to be more idiomatic rust code. Such a move will make it easier to improve speed and maintainability, ensure correct operation from a rust point of view.

    Some examples would be avoiding for loops, using matches instead of if chains etc.

    Many require deeper consideration.

    For example, this codebase has extensive use of f32. Unless using intrinsics, f64 are as fast as or faster than f32 in rust. Moreover, trying to cast to and back for f32 and f64 can harm performance and make it difficult to ensure correct code. For instance there are instances of exact compare between f32 and f64 variables, and this is very unlikely to operate in the intended way. If it is intended, it would be valuable to have documentation regarding that, suppressing relevant lints as well. However, if there is a need to maintain ABI compatibility or follow a specification it might be inevitable. Also, on-disk size could be a consideration. In summary f32 vs f64 handling could serve as both idiomatic code and speed but only if done right.

    I will try to prepare some PRs that change some things. Despite my best efforts, I am sure that many of my changes or views might be based on a flawed understanding of the code, so feel free to explain why things were done the way they were. In such cases I will help with documentation.

    opened by chris-ha458 10
  • Improvements : Speed

    Improvements : Speed

    As per our discussion in #2 for speed improvements the following has been suggested

    • calc coherence & mess in threads
    • or calc mess for plugins in threads (or some async?)
    • or something other...

    The paths I had in mind was these:

    • Related to threads idea : use Rayon
      • Replace HashMap with concurrent DashMap (Current std HashMap implements rayon so not strictly necessary, but might be useful to look into regardless)
    • Use replace hashing algorithm used in HashMap with FxHash, AHash, HighwayHash
      • aHash implemented #14
    • ~Replace sort() with sort_unstable()~ #6
    • Identfiy preallocation opportunities. For instance, replace Vec::new() with Vec::with_capacity()

    Many of these are low hanging fruit and related to refactoring the code to idiomatic Rust code. For example, there are many for loops in this code. Iterator based code is more idiomatic, easier to improve with rayon, and interact better with allocation. (pushing items from within a for loop can cause multiple allocs and copies, while collecting an iterator can allow fewer allocations.)

    opened by chris-ha458 14
Releases(1.0.6)
A truly zero-dependency crate providing quick, easy, reliable, and scalable access to the name "jordin"

jordin Finally! A truly zero-dependency crate providing quick, easy, reliable, and scalable access to the name "jordin". Additionally, this one-of-a-k

jordin 2 Aug 4, 2022
An apocalypse-resistant data storage format for the truly paranoid.

Carbonado An apocalypse-resistant data storage format for the truly paranoid. Designed to keep encrypted, durable, compressed, provably replicated con

diba-io 30 Dec 29, 2022
Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

dbz A library (dbz-lib) and CLI tool (dbz-cli) for working with Databento Binary Encoding (DBZ) files. Python bindings for dbz-lib are provided in the

Databento, Inc. 15 Nov 4, 2022
Use your computer as a cosmic ray detector! One of the memory errors Rust does not protect against.

Your computer can double up as a cosmic ray detector. Yes, really! Cosmic rays hit your computer all the time. If they hit the RAM, this can sometimes

Johanna SΓΆrngΓ₯rd 110 Jun 16, 2023
A universal load testing framework for Rust, with real-time tui support.

rlt A Rust Load Testing framework with real-time tui support. rlt provides a simple way to create load test tools in Rust. It is designed to be a univ

Wenxuan 129 Jul 20, 2024
Wikit - A universal dictionary

Wikit - A universal dictionary What is it? To be short, Wikit is a tool which can (fully, may be in future) render and create dictionary file in MDX/M

bugnofree 120 Dec 3, 2022
Universal Windows library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D3D12).

Shroud Universal library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D

Chase 6 Dec 10, 2022
ABQ is a universal test runner that runs test suites in parallel. It’s the best tool for splitting test suites into parallel jobs locally or on CI

?? abq.build   ?? @rwx_research   ?? discord   ?? documentation ABQ is a universal test runner that runs test suites in parallel. It’s the best tool f

RWX 13 Apr 7, 2023
πŸ—½ Universal Node Package Manager

?? NY Universal Node Package Manager node β€’ yarn β€’ pnpm Features Universal - Picks the right package manager for you based on the lockfile in your fol

Kris Kaczor 46 Oct 12, 2023
Python PEP-440 Version Parsing

PyVer (WIP) Python PEP-440 Version Parsing This package allows for parsing Python PEP-440 version numbers and comparisons between PEP-440 Versions Usa

Allstreamer 3 Sep 18, 2022
Python PEP-440 Version Parsing

PyVer Python PEP-440 Version Parser This package allows for parsing Python PEP-440 version numbers and for comparisons between PEP-440 version numbers

null 3 Sep 18, 2022
A library for python version numbers and specifiers, implementing PEP 440

PEP440 in rust A library for python version numbers and specifiers, implementing PEP 440 Not yet on crates.io due to PyO3/pyo3#2786. use std::str::Fro

konstin 9 Dec 22, 2022
Run the right version of python, in the right environment, for your project

rpy Do you deal with lots of virtual python environments? rpy is for you! Before rpy: ~/dev/prj$ env PYTHONPATH=src/py path/to/my/interpreter src/py/m

Aquatic Capital Management 2 Dec 8, 2022
Rust Imaging Library's Python binding: A performant and high-level image processing library for Python written in Rust

ril-py Rust Imaging Library for Python: Python bindings for ril, a performant and high-level image processing library written in Rust. What's this? Th

Cryptex 13 Dec 6, 2022
Base 32 + 64 encoding and decoding identifiers + bytes in rust, quickly

fast32 Base32 and base64 encoding in Rust. Primarily for integer (u64, u128) and UUID identifiers (behind feature uuid), as well as arbitrary byte arr

Chris Rogus 9 Dec 18, 2023
A small command-line utility for encoding and decoding bech32 strings

A small command-line utility for encoding and decoding bech32 strings.

Charlie Moog 5 Dec 26, 2022
A tool to convert old and outdated "characters" into the superior Rustcii-Encoding.

rustcii A tool to convert old and outdated "characters" into the superior Rustcii-Encoding. Speak your mind. Blazingly ( ?? ) fast ( ?? ). Github | cr

null 8 Nov 16, 2022
High performance wlroots screen recording, featuring hardware encoding

wl-screenrec High performance wlroots based screen recorder. Uses dma-buf transfers to get surface, and uses the GPU to do both the pixel format conve

Russell Greene 32 Jan 21, 2023
Base64 encoding for Aztec's noir language

Noir base64 Noir base64 contains functions for base64 string and u8 array encoding for Aztec's Noir language Installation In your Nargo.toml file, add

zkWorks 11 Dec 12, 2023