Count and convert between different indexing schemes on utf8 string slices

Nathan Vegdahl

Last update: Dec 25, 2022

Related tags

Utilities str_indices

Overview

Str Indices

Count and convert between different indexing schemes on utf8 string slices.

The following schemes are currently supported:

Chars (or "Unicode scalar values").
UTF16 code units.
Lines, with three options for recognized line break characters:
- Line feed only.
- Line feed and carriage return.
- All Unicode line break characters, as specified in Unicode Annex #14.

License

This project is licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contributing

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in Str Indices by you will be licensed as above, without any additional terms or conditions.

This crate is no-std, doesn't allocate, and has zero dependencies, and aims to remain that way. Please adhere to this in any submitted contributions.

Comments

Improve performance when used in rope libraries

Hey! I'm using the str_utils code in another project and I was reading the sse code to understand it better.

In ByteChunk for sse2::__m128i the code currently does this:

    #[inline(always)]
    fn sum_bytes(&self) -> usize {
        const ONES: u64 = std::u64::MAX / 0xFF;
        let tmp = unsafe { std::mem::transmute::<Self, (u64, u64)>(*self) };
        let a = tmp.0.wrapping_mul(ONES) >> (7 * 8);
        let b = tmp.1.wrapping_mul(ONES) >> (7 * 8);
        (a + b) as usize
    }

.. Which is a neat trick, but it makes the "vertical" loop have to build the accumulator every 31 iterations so you don't overflow. I'm no expert at this stuff, but some reading recommended using PSADBW(x, 0) ("Compute sum of absolute differences") instead to accumulate into the array.

So changing the code to this:

    #[inline(always)]
    fn max_acc() -> usize {
        255
    }

    #[inline(always)]
    fn sum_bytes(&self) -> usize {
        unsafe {
            let zero = sse2::_mm_setzero_si128();
            let diff = sse2::_mm_sad_epu8(*self, zero);
            let (low, high) = std::mem::transmute::<Self, (u64, u64)>(diff);
            (low + high) as usize
        }
    }

This yields a (modest) performance improvement on my ryzen 5800:

ropey:master $ taskset 0x1 nice -10 RUSTFLAGS=-C target-cpu=native cargo criterion -- --measurement-time=10 index_convert
   Compiling ropey v1.3.2 (/home/seph/3rdparty/ropey)
    Finished bench [optimized] target(s) in 37.01s
index_convert/byte_to_char                                                                             
                        time:   [41.762 ns 41.799 ns 41.837 ns]
                        change: [-1.0722% -0.8577% -0.6697%] (p = 0.00 < 0.05)
                        Change within noise threshold.
index_convert/byte_to_line                                                                            
                        time:   [103.24 ns 103.25 ns 103.27 ns]
                        change: [+1.1863% +1.2842% +1.3631%] (p = 0.00 < 0.05)
                        Performance has regressed.
index_convert/char_to_byte                                                                             
                        time:   [87.674 ns 87.701 ns 87.730 ns]
                        change: [-1.6249% -1.5190% -1.4211%] (p = 0.00 < 0.05)
                        Performance has improved.
index_convert/char_to_line                                                                            
                        time:   [153.53 ns 153.55 ns 153.57 ns]
                        change: [-1.4996% -1.3924% -1.2970%] (p = 0.00 < 0.05)
                        Performance has improved.
index_convert/line_to_byte                                                                            
                        time:   [143.57 ns 143.65 ns 143.77 ns]
                        change: [-7.6773% -7.5422% -7.3956%] (p = 0.00 < 0.05)
                        Performance has improved.
index_convert/line_to_char                                                                            
                        time:   [143.31 ns 143.34 ns 143.39 ns]
                        change: [-7.9232% -7.8228% -7.7185%] (p = 0.00 < 0.05)
                        Performance has improved.

Is this code change correct?

opened by josephg 35

chars::count() seems to be slower than std (Apple M1 Pro)

I did some basic benchmarks comparing str_indices to the iterator-based equivalents in std (see #7), and chars::count() seems to take about twice as long on average.

This is on an Apple M1 Pro CPU, for which there is no SIMD optimization implemented. Unfortunately, the ARM64 equivalents of the SSE2 integer vector types are not available in stable Rust yet, but it's possible that using [u64; 2] as the chunk type instead of usize could hint the optimizer to use vector instructions?

Anyway - thanks for the work on this library, I've found it very useful!

opened by simonask 7
Add std benchmarks for reference

This PR adds benchmarks of std functionality equivalent to chars::count(), chars::to_byte_idx(), and chars::from_byte_idx(). This can be used to measure if this library actually does better than the std approach using iterators.

opened by simonask 2
Add benchmarks.

To help measure and improve performance, we should add benchmarks.

Note: since this library is mainly targeting shorter strings (around 100-1000 bytes) rather than full documents, the benchmarks should reflect that so we can optimize for those use-cases.

opened by cessen 1
Add property tests.

There are basic unit tests included already, but given some of the corner-cases that can potentially come up it would be a good idea to add a property testing suite as well.

opened by cessen 1
Add additional line modules.
Currently only the full Unicode definition of line breaks is supported. But it is often useful to have a more limited definition, for both compatibility and performance reasons.

Add two new modules:

[x] lines_lf: only recognizes line feed characters. This will also coincidentally support CRLF graphemes, by virtue of simply ignoring the carriage return. (Added in 4ec6e60ff674792859c0639a7831fe92d2b79e7c.)

[x] lines_crlf: recognizes both carriage return and line feed characters individually, as well as properly handling CRLF graphemes. (Added in d78afc3701c5693678d0c0a0aa5bfade68117a38.)

This should(?) cover all the common use cases.
opened by cessen 1
Add explicit SIMD for aarch64 platforms.

Aarch64 simd intrinsics are stable as of Rust 1.59 (although due to a documentation bug are falsely still marked as unstable: https://github.com/rust-lang/stdarch/issues/1268). This means we can now add proper SIMD acceleration for those platforms.

I don't currently have an aarch64 system to test on, so I won't attempt to implement myself. But if anyone would like to take a crack at it, it mostly just consists of making a new impl of the ByteChunk trait in src/byte_chunk.rs for the appropriate core::arch::aarch64 type. Although there may be some complexities with CPU features, etc.

opened by cessen 0

Owner

Nathan Vegdahl

GitHub

An efficient method of heaplessly converting numbers into their string representations, storing the representation within a reusable byte array.

NumToA #![no_std] Compatible with Zero Heap Allocations The standard library provides a convenient method of converting numbers into strings, but thes

42 Sep 6, 2022

Rust library to detect bots using a user-agent string

8 Dec 21, 2022

Count and convert between different indexing schemes on utf8 string slices

Related tags

Overview

Str Indices

License

Contributing

Comments

Improve performance when used in rope libraries

chars::count() seems to be slower than std (Apple M1 Pro)

Add std benchmarks for reference

Add benchmarks.

Add property tests.

Add additional line modules.

Add explicit SIMD for aarch64 platforms.

Owner

Nathan Vegdahl

SubStrings, Slices and Random String Access in Rust

Adapters to convert between different writable APIs.

A compatibility layer to smooth the transition between different versions of embedded-hal

A Rust implementation of fractional indexing.

The efficient and elegant crate to count variants of Rust's Enum.

Count zeroes on a disk or a file

Count lines from files in a directory.

Output the individual word-count statistics from a set of files

A string truncator and scroller written in Rust

Framework is a detector for different frameworks in one projects

Modeling is a tools to analysis different languages by Ctags

Fibonacci, but different

Rust Stream::buffer_unordered where each future can have a different weight.

It's like Circus but totally different.

microtemplate - A fast, microscopic helper crate for runtime string interpolation.

A memory efficient immutable string type that can store up to 24* bytes on the stack

A simple string interner / symbol table for Rust projects.

An efficient method of heaplessly converting numbers into their string representations, storing the representation within a reusable byte array.

Rust library to detect bots using a user-agent string