SIMD-accelerated UTF-8 validation for Rust.

Last update: Jan 8, 2023

Related tags

Overview

simdutf8 – High-speed UTF-8 validation for Rust

Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs.

Disclaimer

This software should not (yet) be used in production, though it has been tested with sample data as well as fuzzing and there are no known bugs.

Features

basic API for the fastest validation, optimized for valid UTF-8
compat API as a fully compatible replacement for std::str::from_utf8()
Up to 22 times faster than the std library on non-ASCII, up to three times faster on ASCII
As fast as or faster than the original simdjson implementation
Supports AVX 2 and SSE 4.2 implementations on x86 and x86-64. ARMv7 and ARMv8 neon support is planned
Selects the fastest implementation at runtime based on CPU support
Written in pure Rust
No dependencies
No-std support
Falls back to the excellent std implementation if SIMD extensions are not supported

Quick start

Add the dependency to your Cargo.toml file:

[dependencies]
simdutf8 = { version = "0.1.1" }

Use simdutf8::basic::from_utf8 as a drop-in replacement for std::str::from_utf8().

use simdutf8::basic::from_utf8;

println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());

If you need detailed information on validation failures, use simdutf8::compat::from_utf8 instead.

use simdutf8::compat::from_utf8;

let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));

APIs

Basic flavor

Use the basic API flavor for maximum speed. It is fastest on valid UTF-8, but only checks for errors after processing the whole byte sequence and does not provide detailed information if the data is not valid UTF-8. simdutf8::basic::Utf8Error is a zero-sized error struct.

Compat flavor

The compat flavor is fully API-compatible with std::str::from_utf8. In particular, simdutf8::compat::from_utf8() returns a simdutf8::compat::Utf8Error, which has valid_up_to() and error_len() methods. The first is useful for verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.

It also fails early: errors are checked on-the-fly as the string is processed and once an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data. This comes at a performance penality compared to the basic API even if the input is valid UTF-8.

Implementation selection

The fastest implementation is selected at runtime using the std::is_x86_feature_detected! macro unless the CPU targeted by the compiler supports the fastest available implementation. So if you compile with RUSTFLAGS="-C target-cpu=native" on a recent x86-64 machine, the AVX 2 implementation is selected at compile time and runtime selection is disabled.

For no-std support (compiled with --no-default-features) the implementation is always selected at compile time based on the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2" for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2" for the SSE 4.2 implementation.

If you want to be able to call a SIMD implementation directly, use the public_imp feature flag. The validation implementations are then accessible via simdutf8::(basic|compat)::imp::x86::(avx2|sse42)::validate_utf8().

When not to use

This library uses unsafe code which has not been battle-tested and should not (yet) be used in production.

Minimum Supported Rust Version (MSRV)

This crate's minimum supported Rust version is 1.38.0.

Benchmarks

The benchmarks have been done with criterion, the tables are created with critcmp. Source code and data are in the bench directory.

The name schema is id-charset/size. 0-empty is the empty byte slice, x-error/66536 is a 64KiB slice where the very first character is invalid UTF-8. All benchmarks were run on a laptop with an Intel Core i7-10750H CPU (Comet Lake) on Windows with Rust 1.51.0 if not otherwise stated. Library versions are simdutf8 v0.1.1 and simdjson v0.9.2. When comparing with simdjson simdutf8 is compiled with #inline(never).

simdutf8 basic vs std library UTF-8 validation

simdutf8 performs better or as well as the std library.

simdutf8 basic vs simdjson UTF-8 validation on Intel Comet Lake

simdutf8 beats simdjson on almost all inputs on this CPU. This benchmark is run on WSL since I could not get simdjson to reach maximum performance on Windows with any C++ toolchain (see also simdjson issues 847 and 848).

simdutf8 basic vs simdjson UTF-8 validation on AMD Zen 2

On AMD Zen 2 aligning reads apparently does not matter at all. The extra step for aligning even hurts performance a bit around an input size of 4096.

simdutf8 basic vs simdutf8 compat UTF-8 validation

There is a small performance penalty to continuously checking the error status while processing data, but detecting errors early provides a huge benefit for the x-error/66536 benchmark.

Technical details

On X86 for inputs shorter than 64 bytes validation is delegated to core::str::from_utf8().

The SIMD implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson on some CPUs. This alignment means that an incomplete block needs to be processed before the aligned data is read, which leads to worse performance on byte sequences shorter than 2048 bytes. Thus, aligned reads are only used with 2048 bytes of data or more. Incomplete reads for the first unaligned and the last incomplete block are done in two aligned 64-byte buffers.

For the compat API we need to check the error buffer on each 64-byte block instead of just aggregating it. If an error is found, the last bytes of the previous block are checked for a cross-block continuation and then std::str::from_utf8() is run to find the exact location of the error.

Care is taken that all functions are properly inlined up to the public interface.

Thanks

to the authors of simdjson for coming up with the high-performance SIMD implementation.
to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust.

License

This code is dual-licensed under the Apache License 2.0 and the MIT License.

It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under the MIT license and Apache 2.0 license as well.

simdjson itself is distributed under the Apache License 2.0.

References

John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021

Comments

Research on aarch64 support

I did some initial research on adding aarch64 support. Many neon intrinsics seem missing from core::arch::aarch64 and will need to be implemented locally until upstreamed.

opened by chadbrewbaker 12
Adds support for WASM 128-bit SIMD
There are some TODOs (see below), but the implementation and test suite integration has been done.

Adds feature wasm32_simd128 consistent with aarch64_neon as there is no auto-detection for SIMD in WASM.

Updates README.md with WASM build instructions.

Integrates with tests--though the should_panic test case will be ignored due to limitations of the test driver with wasm32-wasi.

Adds ignore for IntelliJ based IDEs (e.g., CLion).

TODO

[x] Integrate with benchmarks. Naively targeting wasm32-wasi for the benchmarks have issues with both wasmer and wasmtime. I suspect this is partially because of criterion expecting wasm32 targets to run to have wasm-bindgen. One option would be to compile a shim of the library as a staticlib crate targeting wasm32-unknown-unknown and using wasmer or wasmtime to compile/link the function and benchmark it in the current framework.

[x] Validate inlining.

[x] Integrate with CI workflow scripts.

[ ] Run the fuzzer.

Implementation for #36
opened by almann 11
Performance on short strings
If you are only processing short byte sequences (less than 64 bytes), the excellent scalar algorithm in the standard library is likely faster. If there is no native implementation for your platform (yet), use the standard library instead.

To my knowledge, there is no hard engineering reason why you'd ever be slower irrespective of the string length. In the worst case, you can always do...

if(short string) { do that } else { do this }

This adds one predictable branch.
enhancement
opened by lemire 11
aarch64 support with nightly Rust (prototype)

So it turns out all the necessary intrinsics are there and after the big refactoring today adding aarch64 support was quite easy.

The unit tests on my Raspberry Pi 4 are the only testing it has got. As Linus famously said: "If it compiles, it is good; if it boots up, it is perfect."

opened by hkratz 8
AArch64 SIMD intrinsics are now stable

https://github.com/rust-lang/stdarch/pull/1266 has landed and shipped in v1.59

Note that documentation still displays the intrinsics as unstable due to a rustdoc bug: https://github.com/rust-lang/stdarch/issues/1268

It should be possible to remove the feature gate now and enable SIMD on ARM by default.

opened by Shnatsel 3
The functions `validate_utf8_basic` should not be labeled unsafe

We already assume that calling validate_utf8_basic is safe, since it's called within from_utf8 and from_utf8_mut which are labelled safe. Thus, it doesn't make any sense for validate_utf8_basic to be labelled an unsafe API.

opened by ArniDagur 3
Integrating into simd-json

Would make a lot of sense to use this in simd-json. It should be straight forward to do but might have to make some of the internals public. Let me know what you think and I can help kick this off.
enhancement

opened by CJP10 3
Deserialising unicode escape gives non-UTF8 String
For tracking, this is a forward of https://github.com/simd-lite/simd-json/issues/228 that @5225225 opened. The following test, if added to simdutf8 fails:

test_invalid_after_prefixes(br#""\uDE71""#, 0, None, Some(5));

It is an invalid UTF8 sequence but passes as valid.
opened by Licenser 2
Remove internal memcpy implementation

LLVM is unrolling it for the 128+ bytes case which never happens. The speedup with this patch is negligable but code size is reduced quite a bit.

It is a shame that all attempts to do this the Rust way (using copy_from_slice() or read_unaligned()/write_unaligned()) have resulted in non-inlined calls to memcpy with noticable slowdown.

This implementation is optimally auto-vectorized and inlined. The compiler can even prove that the len is less < 32 at the first call site and optimizes that check away.

opened by hkratz 2
Validated ring buffer iterator

It would be nice to have a validated ring buffer iterator. Not sure if the consumer would want (pointer, byteLength) or (pointer, ignoreNPrefixBytes, byteLength) to keep them L1 cache aligned.
enhancement question

opened by chadbrewbaker 2
UTF-8 reordering and deletion detector

See https://trojansource.codes/trojan-source.pdf

A SIMD version of the UTF-8 reordering/deletion detector is now critical infrastructure. https://imperceptible.ml/detector

opened by chadbrewbaker 1
Heads-up: const_err lint is going away

This crate carries a allow(const_err). That lint is going away since it becomes a hard error, which causes a warning due to the removed lint being used, which then triggers deny(warnings).

The crate does not actually seem to trigger const_err (according to crater), so the hard error itself should not be a problem. The allow(const_err) can likely just be removed.

opened by RalfJung 3
Run Fuzzer on wasm32 Targeted Code

As part of #56, there is a remaining TODO to integrate with the fuzzer. based on the README for rust-fuzz x86-64 is required so we cannot run the fuzzer natively on something like wasm32-wasi.

https://github.com/rust-fuzz/cargo-fuzz/blob/63730da7f95cfb21f6f5a9b0a74532f98d3983a4/README.md?plain=1#L13-L16

In order to integrate with the fuzzer, we may want to take an approach similar to the benchmarking (shim to the WASM and use a WASM runtime to embed the functionality).

opened by almann 3
Mislink on Windows with lld and thinlto

It appears that validate_utf8_basic and similar functions trigger a mislink on Windows with lld and thinlto. This is not a terribly uncommon combination, so it may be worth exploring alternatives that do not cause this mislink.

This is an issue @Kixiron ran into while using bytecheck, which uses simdutf8 for fast string validation. The issue was traced back to simdutf8 using a release build with debug symbols and WinDbg, then the memory backing the AtomicPtr was rewound to the beginning of the application and verified to be invalid. This indicates that the function pointer placed in it was not relocated to the correct address.

opened by djkoloski 2

Releases(v0.1.4)

v0.1.4(Apr 2, 2022)
New features

WASM (wasm32) support thanks to @almann!

Improvements

Make aarch64 SIMD implementation work on Rust 1.59/1.60 with create feature aarch64_neon

For Rust Nightly the aarch64 SIMD implementation is enabled out of the box.

Starting with Rust 1.61 the aarch64 SIMD implementation is expected to be enabled out of the box as well.

Performance

Prefetch was disabled for aarch64 since the requisite intrinsics have not been stabilized.

Source code(tar.gz)
Source code(zip)
v0.1.3(May 14, 2021)
New features

Low-level streaming validation API in simdutf8::basic::imp

Source code(tar.gz)
Source code(zip)
v0.1.2(May 9, 2021)
New features

Aarch64 support (e.g. Apple Silicon, Raspberry Pi 4, ...) with nightly Rust and crate feature aarch64_neon

Performance

Another speedup on pure ASCII data

Aligned reads have been removed as the performance was worse overall.

Prefetch is used selectively on AVX 2, where it provides a slight benefit on some Intel CPUs.

Comparison vs v0.1.1 on x86-64

Other

Refactored SIMD integration to allow easy implementation for new architectures

Full test coverage

Thoroughly fuzz-tested

Source code(tar.gz)
Source code(zip)
v0.1.1(Apr 26, 2021)
Performance

Large speedup on small inputs from delegation to std lib

Up to 50% better peak throughput on ASCII

#[inline] main entry points for a small general speedup.

Benchmark against v0.1.0

Other

Make both Utf8Error variants implement std::error::Error

Make basic::Utf8Error implement core::fmt::Display

Document Minimum Supported Rust Version (1.38.0).

Reduce package size.

Documentation updates.

Source code(tar.gz)
Source code(zip)
v0.1.0(Apr 21, 2021)

Documentation updates only.

0.1.x releases will have API compatibility.
Source code(tar.gz)
Source code(zip)
v0.0.3(Apr 21, 2021)

Documentation update only.
Source code(tar.gz)
Source code(zip)
v0.0.2(Apr 20, 2021)

Documentation update only.
Source code(tar.gz)
Source code(zip)
v0.0.1(Apr 20, 2021)

Initial release.
Source code(tar.gz)
Source code(zip)

Owner

GitHub

bottom encodes UTF-8 text into a sequence comprised of bottom emoji

bottom encodes UTF-8 text into a sequence comprised of bottom emoji (with , sprinkled in for good measure) followed by ????. It can encode any valid UTF-8 - being a bottom transcends language, after all - and decode back into UTF-8.

345 Dec 30, 2022

Sorta Text Format in UTF-8

STFU-8: Sorta Text Format in UTF-8 STFU-8 is a hacky text encoding/decoding protocol for data that might be not quite UTF-8 but is still mostly UTF-8.

18 Sep 4, 2022

Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

20 Dec 29, 2022

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

?? ?? lightmotif A lightweight platform-accelerated library for biological motif scanning using position weight matrices. ??️ Overview Motif scanning

16 May 4, 2023

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

34 Dec 20, 2022

Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

207 Dec 26, 2022

Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

212 Dec 16, 2022

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

322 Dec 26, 2022

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

81 Dec 6, 2022

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

2.6k Jan 8, 2023

SIMD-accelerated UTF-8 validation for Rust.

Related tags

Overview

simdutf8 – High-speed UTF-8 validation for Rust

Disclaimer

Features

Quick start

APIs

Basic flavor

Compat flavor

Implementation selection

When not to use

Minimum Supported Rust Version (MSRV)

Benchmarks

simdutf8 basic vs std library UTF-8 validation

simdutf8 basic vs simdjson UTF-8 validation on Intel Comet Lake

simdutf8 basic vs simdjson UTF-8 validation on AMD Zen 2

simdutf8 basic vs simdutf8 compat UTF-8 validation

Technical details

Thanks

License

References

Comments

TODO

Releases(v0.1.4)

v0.1.4(Apr 2, 2022)

New features

Improvements

Performance

v0.1.3(May 14, 2021)

New features

v0.1.2(May 9, 2021)

New features

Performance

Other

v0.1.1(Apr 26, 2021)

Performance

Other

v0.1.0(Apr 21, 2021)

v0.0.3(Apr 21, 2021)

v0.0.2(Apr 20, 2021)

v0.0.1(Apr 20, 2021)

Owner

bottom encodes UTF-8 text into a sequence comprised of bottom emoji

Sorta Text Format in UTF-8

Viterbi-based accelerated tokenizer (Python wrapper)

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

Fast suffix arrays for Rust (with Unicode support).

Elastic tabstops for Rust.

An efficient and powerful Rust library for word wrapping text.

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Multilingual implementation of RAKE algorithm for Rust

A Rust library for generically joining iterables with a separator

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Snips NLU rust implementation

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

A fast implementation of Aho-Corasick in Rust.

Natural Language Processing for Rust