SIMD-accelerated UTF-8 validation for Rust.

Overview

CI crates.io docs.rs

simdutf8 – High-speed UTF-8 validation for Rust

Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs.

Disclaimer

This software should not (yet) be used in production, though it has been tested with sample data as well as fuzzing and there are no known bugs.

Features

  • basic API for the fastest validation, optimized for valid UTF-8
  • compat API as a fully compatible replacement for std::str::from_utf8()
  • Up to 22 times faster than the std library on non-ASCII, up to three times faster on ASCII
  • As fast as or faster than the original simdjson implementation
  • Supports AVX 2 and SSE 4.2 implementations on x86 and x86-64. ARMv7 and ARMv8 neon support is planned
  • Selects the fastest implementation at runtime based on CPU support
  • Written in pure Rust
  • No dependencies
  • No-std support
  • Falls back to the excellent std implementation if SIMD extensions are not supported

Quick start

Add the dependency to your Cargo.toml file:

[dependencies]
simdutf8 = { version = "0.1.1" }

Use simdutf8::basic::from_utf8 as a drop-in replacement for std::str::from_utf8().

use simdutf8::basic::from_utf8;

println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());

If you need detailed information on validation failures, use simdutf8::compat::from_utf8 instead.

use simdutf8::compat::from_utf8;

let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));

APIs

Basic flavor

Use the basic API flavor for maximum speed. It is fastest on valid UTF-8, but only checks for errors after processing the whole byte sequence and does not provide detailed information if the data is not valid UTF-8. simdutf8::basic::Utf8Error is a zero-sized error struct.

Compat flavor

The compat flavor is fully API-compatible with std::str::from_utf8. In particular, simdutf8::compat::from_utf8() returns a simdutf8::compat::Utf8Error, which has valid_up_to() and error_len() methods. The first is useful for verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.

It also fails early: errors are checked on-the-fly as the string is processed and once an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data. This comes at a performance penality compared to the basic API even if the input is valid UTF-8.

Implementation selection

The fastest implementation is selected at runtime using the std::is_x86_feature_detected! macro unless the CPU targeted by the compiler supports the fastest available implementation. So if you compile with RUSTFLAGS="-C target-cpu=native" on a recent x86-64 machine, the AVX 2 implementation is selected at compile time and runtime selection is disabled.

For no-std support (compiled with --no-default-features) the implementation is always selected at compile time based on the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2" for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2" for the SSE 4.2 implementation.

If you want to be able to call a SIMD implementation directly, use the public_imp feature flag. The validation implementations are then accessible via simdutf8::(basic|compat)::imp::x86::(avx2|sse42)::validate_utf8().

When not to use

This library uses unsafe code which has not been battle-tested and should not (yet) be used in production.

Minimum Supported Rust Version (MSRV)

This crate's minimum supported Rust version is 1.38.0.

Benchmarks

The benchmarks have been done with criterion, the tables are created with critcmp. Source code and data are in the bench directory.

The name schema is id-charset/size. 0-empty is the empty byte slice, x-error/66536 is a 64KiB slice where the very first character is invalid UTF-8. All benchmarks were run on a laptop with an Intel Core i7-10750H CPU (Comet Lake) on Windows with Rust 1.51.0 if not otherwise stated. Library versions are simdutf8 v0.1.1 and simdjson v0.9.2. When comparing with simdjson simdutf8 is compiled with #inline(never).

simdutf8 basic vs std library UTF-8 validation

critcmp stimdutf8 v0.1.1 basic vs std lib simdutf8 performs better or as well as the std library.

simdutf8 basic vs simdjson UTF-8 validation on Intel Comet Lake

critcmp stimdutf8 v0.1.1 basic vs simdjson WSL simdutf8 beats simdjson on almost all inputs on this CPU. This benchmark is run on WSL since I could not get simdjson to reach maximum performance on Windows with any C++ toolchain (see also simdjson issues 847 and 848).

simdutf8 basic vs simdjson UTF-8 validation on AMD Zen 2

critcmp stimdutf8 v0.1.1 basic vs simdjson AMD Zen 2

On AMD Zen 2 aligning reads apparently does not matter at all. The extra step for aligning even hurts performance a bit around an input size of 4096.

simdutf8 basic vs simdutf8 compat UTF-8 validation

image There is a small performance penalty to continuously checking the error status while processing data, but detecting errors early provides a huge benefit for the x-error/66536 benchmark.

Technical details

On X86 for inputs shorter than 64 bytes validation is delegated to core::str::from_utf8().

The SIMD implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson on some CPUs. This alignment means that an incomplete block needs to be processed before the aligned data is read, which leads to worse performance on byte sequences shorter than 2048 bytes. Thus, aligned reads are only used with 2048 bytes of data or more. Incomplete reads for the first unaligned and the last incomplete block are done in two aligned 64-byte buffers.

For the compat API we need to check the error buffer on each 64-byte block instead of just aggregating it. If an error is found, the last bytes of the previous block are checked for a cross-block continuation and then std::str::from_utf8() is run to find the exact location of the error.

Care is taken that all functions are properly inlined up to the public interface.

Thanks

  • to the authors of simdjson for coming up with the high-performance SIMD implementation.
  • to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust.

License

This code is dual-licensed under the Apache License 2.0 and the MIT License.

It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under the MIT license and Apache 2.0 license as well.

simdjson itself is distributed under the Apache License 2.0.

References

John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021

Comments
  • Research on aarch64 support

    Research on aarch64 support

    I did some initial research on adding aarch64 support. Many neon intrinsics seem missing from core::arch::aarch64 and will need to be implemented locally until upstreamed.

    opened by chadbrewbaker 12
  • Adds support for WASM 128-bit SIMD

    Adds support for WASM 128-bit SIMD

    There are some TODOs (see below), but the implementation and test suite integration has been done.

    • Adds feature wasm32_simd128 consistent with aarch64_neon as there is no auto-detection for SIMD in WASM.
    • Updates README.md with WASM build instructions.
    • Integrates with tests--though the should_panic test case will be ignored due to limitations of the test driver with wasm32-wasi.
    • Adds ignore for IntelliJ based IDEs (e.g., CLion).

    TODO

    • [x] Integrate with benchmarks. Naively targeting wasm32-wasi for the benchmarks have issues with both wasmer and wasmtime. I suspect this is partially because of criterion expecting wasm32 targets to run to have wasm-bindgen. One option would be to compile a shim of the library as a staticlib crate targeting wasm32-unknown-unknown and using wasmer or wasmtime to compile/link the function and benchmark it in the current framework.
    • [x] Validate inlining.
    • [x] Integrate with CI workflow scripts.
    • [ ] Run the fuzzer.

    Implementation for #36

    opened by almann 11
  • Performance on short strings

    Performance on short strings

    If you are only processing short byte sequences (less than 64 bytes), the excellent scalar algorithm in the standard library is likely faster. If there is no native implementation for your platform (yet), use the standard library instead.

    To my knowledge, there is no hard engineering reason why you'd ever be slower irrespective of the string length. In the worst case, you can always do...

    if(short string) {
      do that
    } else {
      do this
    }
    

    This adds one predictable branch.

    enhancement 
    opened by lemire 11
  • aarch64 support with nightly Rust (prototype)

    aarch64 support with nightly Rust (prototype)

    So it turns out all the necessary intrinsics are there and after the big refactoring today adding aarch64 support was quite easy.

    The unit tests on my Raspberry Pi 4 are the only testing it has got. As Linus famously said: "If it compiles, it is good; if it boots up, it is perfect."

    opened by hkratz 8
  • AArch64 SIMD intrinsics are now stable

    AArch64 SIMD intrinsics are now stable

    https://github.com/rust-lang/stdarch/pull/1266 has landed and shipped in v1.59

    Note that documentation still displays the intrinsics as unstable due to a rustdoc bug: https://github.com/rust-lang/stdarch/issues/1268

    It should be possible to remove the feature gate now and enable SIMD on ARM by default.

    opened by Shnatsel 3
  • The functions `validate_utf8_basic` should not be labeled unsafe

    The functions `validate_utf8_basic` should not be labeled unsafe

    We already assume that calling validate_utf8_basic is safe, since it's called within from_utf8 and from_utf8_mut which are labelled safe. Thus, it doesn't make any sense for validate_utf8_basic to be labelled an unsafe API.

    opened by ArniDagur 3
  • Integrating into simd-json

    Integrating into simd-json

    Would make a lot of sense to use this in simd-json. It should be straight forward to do but might have to make some of the internals public. Let me know what you think and I can help kick this off.

    enhancement 
    opened by CJP10 3
  • Deserialising unicode escape gives non-UTF8 String

    Deserialising unicode escape gives non-UTF8 String

    For tracking, this is a forward of https://github.com/simd-lite/simd-json/issues/228 that @5225225 opened. The following test, if added to simdutf8 fails:

        test_invalid_after_prefixes(br#""\uDE71""#, 0, None, Some(5));
    

    It is an invalid UTF8 sequence but passes as valid.

    opened by Licenser 2
  • Remove internal memcpy implementation

    Remove internal memcpy implementation

    LLVM is unrolling it for the 128+ bytes case which never happens. The speedup with this patch is negligable but code size is reduced quite a bit.

    It is a shame that all attempts to do this the Rust way (using copy_from_slice() or read_unaligned()/write_unaligned()) have resulted in non-inlined calls to memcpy with noticable slowdown.

    This implementation is optimally auto-vectorized and inlined. The compiler can even prove that the len is less < 32 at the first call site and optimizes that check away.

    opened by hkratz 2
  • Validated ring buffer iterator

    Validated ring buffer iterator

    It would be nice to have a validated ring buffer iterator. Not sure if the consumer would want (pointer, byteLength) or (pointer, ignoreNPrefixBytes, byteLength) to keep them L1 cache aligned.

    enhancement question 
    opened by chadbrewbaker 2
  • UTF-8 reordering and deletion detector

    UTF-8 reordering and deletion detector

    See https://trojansource.codes/trojan-source.pdf

    A SIMD version of the UTF-8 reordering/deletion detector is now critical infrastructure. https://imperceptible.ml/detector

    opened by chadbrewbaker 1
  • Heads-up: const_err lint is going away

    Heads-up: const_err lint is going away

    This crate carries a allow(const_err). That lint is going away since it becomes a hard error, which causes a warning due to the removed lint being used, which then triggers deny(warnings).

    The crate does not actually seem to trigger const_err (according to crater), so the hard error itself should not be a problem. The allow(const_err) can likely just be removed.

    opened by RalfJung 3
  • Run Fuzzer on wasm32 Targeted Code

    Run Fuzzer on wasm32 Targeted Code

    As part of #56, there is a remaining TODO to integrate with the fuzzer. based on the README for rust-fuzz x86-64 is required so we cannot run the fuzzer natively on something like wasm32-wasi.

    https://github.com/rust-fuzz/cargo-fuzz/blob/63730da7f95cfb21f6f5a9b0a74532f98d3983a4/README.md?plain=1#L13-L16

    In order to integrate with the fuzzer, we may want to take an approach similar to the benchmarking (shim to the WASM and use a WASM runtime to embed the functionality).

    opened by almann 3
  • Mislink on Windows with lld and thinlto

    Mislink on Windows with lld and thinlto

    It appears that validate_utf8_basic and similar functions trigger a mislink on Windows with lld and thinlto. This is not a terribly uncommon combination, so it may be worth exploring alternatives that do not cause this mislink.

    This is an issue @Kixiron ran into while using bytecheck, which uses simdutf8 for fast string validation. The issue was traced back to simdutf8 using a release build with debug symbols and WinDbg, then the memory backing the AtomicPtr was rewound to the beginning of the application and verified to be invalid. This indicates that the function pointer placed in it was not relocated to the correct address.

    opened by djkoloski 2
Releases(v0.1.4)
  • v0.1.4(Apr 2, 2022)

    New features

    • WASM (wasm32) support thanks to @almann!

    Improvements

    • Make aarch64 SIMD implementation work on Rust 1.59/1.60 with create feature aarch64_neon
    • For Rust Nightly the aarch64 SIMD implementation is enabled out of the box.
    • Starting with Rust 1.61 the aarch64 SIMD implementation is expected to be enabled out of the box as well.

    Performance

    • Prefetch was disabled for aarch64 since the requisite intrinsics have not been stabilized.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(May 14, 2021)

  • v0.1.2(May 9, 2021)

    New features

    • Aarch64 support (e.g. Apple Silicon, Raspberry Pi 4, ...) with nightly Rust and crate feature aarch64_neon

    Performance

    • Another speedup on pure ASCII data
    • Aligned reads have been removed as the performance was worse overall.
    • Prefetch is used selectively on AVX 2, where it provides a slight benefit on some Intel CPUs.

    Comparison vs v0.1.1 on x86-64

    Other

    • Refactored SIMD integration to allow easy implementation for new architectures
    • Full test coverage
    • Thoroughly fuzz-tested
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Apr 26, 2021)

    Performance

    • Large speedup on small inputs from delegation to std lib
    • Up to 50% better peak throughput on ASCII
    • #[inline] main entry points for a small general speedup.

    Benchmark against v0.1.0

    Other

    • Make both Utf8Error variants implement std::error::Error
    • Make basic::Utf8Error implement core::fmt::Display
    • Document Minimum Supported Rust Version (1.38.0).
    • Reduce package size.
    • Documentation updates.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Apr 21, 2021)

  • v0.0.3(Apr 21, 2021)

  • v0.0.2(Apr 20, 2021)

  • v0.0.1(Apr 20, 2021)

Owner
null
bottom encodes UTF-8 text into a sequence comprised of bottom emoji

bottom encodes UTF-8 text into a sequence comprised of bottom emoji (with , sprinkled in for good measure) followed by ????. It can encode any valid UTF-8 - being a bottom transcends language, after all - and decode back into UTF-8.

Bottom Software Foundation 345 Dec 30, 2022
Sorta Text Format in UTF-8

STFU-8: Sorta Text Format in UTF-8 STFU-8 is a hacky text encoding/decoding protocol for data that might be not quite UTF-8 but is still mostly UTF-8.

Rett Berg 18 Sep 4, 2022
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

?? ?? lightmotif A lightweight platform-accelerated library for biological motif scanning using position weight matrices. ??️ Overview Motif scanning

Martin Larralde 16 May 4, 2023
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

Snips 327 Dec 26, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 662 Dec 31, 2022
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022