Count and convert between different indexing schemes on utf8 string slices

Overview

Str Indices

Latest Release Documentation

Count and convert between different indexing schemes on utf8 string slices.

The following schemes are currently supported:

  • Chars (or "Unicode scalar values").
  • UTF16 code units.
  • Lines, with three options for recognized line break characters:
    • Line feed only.
    • Line feed and carriage return.
    • All Unicode line break characters, as specified in Unicode Annex #14.

License

This project is licensed under either of

at your option.

Contributing

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in Str Indices by you will be licensed as above, without any additional terms or conditions.

This crate is no-std, doesn't allocate, and has zero dependencies, and aims to remain that way. Please adhere to this in any submitted contributions.

Comments
  • Improve performance when used in rope libraries

    Improve performance when used in rope libraries

    Hey! I'm using the str_utils code in another project and I was reading the sse code to understand it better.

    In ByteChunk for sse2::__m128i the code currently does this:

        #[inline(always)]
        fn sum_bytes(&self) -> usize {
            const ONES: u64 = std::u64::MAX / 0xFF;
            let tmp = unsafe { std::mem::transmute::<Self, (u64, u64)>(*self) };
            let a = tmp.0.wrapping_mul(ONES) >> (7 * 8);
            let b = tmp.1.wrapping_mul(ONES) >> (7 * 8);
            (a + b) as usize
        }
    

    .. Which is a neat trick, but it makes the "vertical" loop have to build the accumulator every 31 iterations so you don't overflow. I'm no expert at this stuff, but some reading recommended using PSADBW(x, 0) ("Compute sum of absolute differences") instead to accumulate into the array.

    So changing the code to this:

        #[inline(always)]
        fn max_acc() -> usize {
            255
        }
    
        #[inline(always)]
        fn sum_bytes(&self) -> usize {
            unsafe {
                let zero = sse2::_mm_setzero_si128();
                let diff = sse2::_mm_sad_epu8(*self, zero);
                let (low, high) = std::mem::transmute::<Self, (u64, u64)>(diff);
                (low + high) as usize
            }
        }
    

    This yields a (modest) performance improvement on my ryzen 5800:

    ropey:master $ taskset 0x1 nice -10 RUSTFLAGS=-C target-cpu=native cargo criterion -- --measurement-time=10 index_convert
       Compiling ropey v1.3.2 (/home/seph/3rdparty/ropey)
        Finished bench [optimized] target(s) in 37.01s
    index_convert/byte_to_char                                                                             
                            time:   [41.762 ns 41.799 ns 41.837 ns]
                            change: [-1.0722% -0.8577% -0.6697%] (p = 0.00 < 0.05)
                            Change within noise threshold.
    index_convert/byte_to_line                                                                            
                            time:   [103.24 ns 103.25 ns 103.27 ns]
                            change: [+1.1863% +1.2842% +1.3631%] (p = 0.00 < 0.05)
                            Performance has regressed.
    index_convert/char_to_byte                                                                             
                            time:   [87.674 ns 87.701 ns 87.730 ns]
                            change: [-1.6249% -1.5190% -1.4211%] (p = 0.00 < 0.05)
                            Performance has improved.
    index_convert/char_to_line                                                                            
                            time:   [153.53 ns 153.55 ns 153.57 ns]
                            change: [-1.4996% -1.3924% -1.2970%] (p = 0.00 < 0.05)
                            Performance has improved.
    index_convert/line_to_byte                                                                            
                            time:   [143.57 ns 143.65 ns 143.77 ns]
                            change: [-7.6773% -7.5422% -7.3956%] (p = 0.00 < 0.05)
                            Performance has improved.
    index_convert/line_to_char                                                                            
                            time:   [143.31 ns 143.34 ns 143.39 ns]
                            change: [-7.9232% -7.8228% -7.7185%] (p = 0.00 < 0.05)
                            Performance has improved.
    

    Is this code change correct?

    opened by josephg 35
  • chars::count() seems to be slower than std (Apple M1 Pro)

    chars::count() seems to be slower than std (Apple M1 Pro)

    I did some basic benchmarks comparing str_indices to the iterator-based equivalents in std (see #7), and chars::count() seems to take about twice as long on average.

    This is on an Apple M1 Pro CPU, for which there is no SIMD optimization implemented. Unfortunately, the ARM64 equivalents of the SSE2 integer vector types are not available in stable Rust yet, but it's possible that using [u64; 2] as the chunk type instead of usize could hint the optimizer to use vector instructions?

    Anyway - thanks for the work on this library, I've found it very useful!

    opened by simonask 7
  • Add std benchmarks for reference

    Add std benchmarks for reference

    This PR adds benchmarks of std functionality equivalent to chars::count(), chars::to_byte_idx(), and chars::from_byte_idx(). This can be used to measure if this library actually does better than the std approach using iterators.

    opened by simonask 2
  • Add benchmarks.

    Add benchmarks.

    To help measure and improve performance, we should add benchmarks.

    Note: since this library is mainly targeting shorter strings (around 100-1000 bytes) rather than full documents, the benchmarks should reflect that so we can optimize for those use-cases.

    opened by cessen 1
  • Add property tests.

    Add property tests.

    There are basic unit tests included already, but given some of the corner-cases that can potentially come up it would be a good idea to add a property testing suite as well.

    opened by cessen 1
  • Add additional line modules.

    Add additional line modules.

    Currently only the full Unicode definition of line breaks is supported. But it is often useful to have a more limited definition, for both compatibility and performance reasons.

    Add two new modules:

    • [x] lines_lf: only recognizes line feed characters. This will also coincidentally support CRLF graphemes, by virtue of simply ignoring the carriage return. (Added in 4ec6e60ff674792859c0639a7831fe92d2b79e7c.)
    • [x] lines_crlf: recognizes both carriage return and line feed characters individually, as well as properly handling CRLF graphemes. (Added in d78afc3701c5693678d0c0a0aa5bfade68117a38.)

    This should(?) cover all the common use cases.

    opened by cessen 1
  • Add explicit SIMD for aarch64 platforms.

    Add explicit SIMD for aarch64 platforms.

    Aarch64 simd intrinsics are stable as of Rust 1.59 (although due to a documentation bug are falsely still marked as unstable: https://github.com/rust-lang/stdarch/issues/1268). This means we can now add proper SIMD acceleration for those platforms.

    I don't currently have an aarch64 system to test on, so I won't attempt to implement myself. But if anyone would like to take a crack at it, it mostly just consists of making a new impl of the ByteChunk trait in src/byte_chunk.rs for the appropriate core::arch::aarch64 type. Although there may be some complexities with CPU features, etc.

    opened by cessen 0
Owner
Nathan Vegdahl
Nathan Vegdahl
SubStrings, Slices and Random String Access in Rust

SubStrings, Slices and Random String Access in Rust This is a simple way to do it. Description Rust string processing is kind of hard, because text in

João Nuno Carvalho 2 Oct 24, 2021
Adapters to convert between different writable APIs.

I/O adapters This crate provides adapters to compose writeable traits in the standard library. The following conversions are available: fmt::Write ->

Alex Saveau 16 Dec 21, 2023
A compatibility layer to smooth the transition between different versions of embedded-hal

Embedded HAL Compatibility Layer A compatibility layer to smooth the transition between different versions of embedded-hal (specifically 0.2.x and 1.0

Ryan 7 Sep 11, 2022
A Rust implementation of fractional indexing.

fractional_index This crate implements fractional indexing, a term coined by Figma in their blog post Realtime Editing of Ordered Sequences. Specifica

null 18 Dec 21, 2022
The efficient and elegant crate to count variants of Rust's Enum.

variant-counter The efficient and elegant crate to count variants of Rust's Enum. Get started #[derive(VariantCount)] #[derive(VariantCount)] pub enum

Folyd 16 Sep 29, 2022
Count zeroes on a disk or a file

Count zeroes on a disk or a file

Cecile Tonglet 1 Dec 12, 2021
Count lines from files in a directory.

rust-cloc Count lines from files in a directory. Features Count the number of empty and non-empty lines in total from all files in a directory. Count

Daniel Liu 2 Apr 27, 2022
Output the individual word-count statistics from a set of files

Output the individual word-count statistics from a set of files, or generate a curated word list

Johnny Tidemand Vestergaard 1 Apr 3, 2022
A string truncator and scroller written in Rust

scissrs A string truncator and scroller written in Rust. Usage scissrs --help covers the definitions of this program's flags.

Skybbles 5 Aug 3, 2022
Framework is a detector for different frameworks in one projects

Framework is a detector for different frameworks in one projects Usage use

Inherd OS Team (硬核开源小组) 3 Oct 24, 2022
Modeling is a tools to analysis different languages by Ctags

Modeling Modeling is a tools to analysis different languages by Ctags process: generate to opt call ctags with opt analysis ctags logs output resulse

Inherd OS Team (硬核开源小组) 13 Sep 13, 2022
Fibonacci, but different

n-days Fibonacci, but different? Problem You're given a workout in the 12 Days of Christmas style: 1. Burpee Bar Muscle-Up 2. Thrusters 3. Power Clean

Phillip Copley 0 Dec 24, 2021
Rust Stream::buffer_unordered where each future can have a different weight.

buffer-unordered-weighted buffer_unordered_weighted is a variant of buffer_unordered, where each future can be assigned a different weight. This crate

null 15 Dec 28, 2022
It's like Circus but totally different.

Read I do not know Rust. If you see something that is being done in a suboptimal way at a language-level, I'd love to hear it. If you want to argue ab

zkxjzmswkwl 4 Oct 5, 2023
microtemplate - A fast, microscopic helper crate for runtime string interpolation.

microtemplate A fast, microscopic helper crate for runtime string interpolation. Design Goals Very lightweight: I want microtemplate to do exactly one

_iPhoenix_ 13 Jan 31, 2022
A memory efficient immutable string type that can store up to 24* bytes on the stack

compact_str A memory efficient immutable string type that can store up to 24* bytes on the stack. * 12 bytes for 32-bit architectures About A CompactS

Parker Timmerman 342 Jan 2, 2023
A simple string interner / symbol table for Rust projects.

Symbol Interner A small Rust crate that provides a naïve string interner. Consult the documentation to learn about the types that are exposed. Install

Ryan Chandler 1 Nov 18, 2021
An efficient method of heaplessly converting numbers into their string representations, storing the representation within a reusable byte array.

NumToA #![no_std] Compatible with Zero Heap Allocations The standard library provides a convenient method of converting numbers into strings, but thes

Michael Murphy 42 Sep 6, 2022
Rust library to detect bots using a user-agent string

Rust library to detect bots using a user-agent string

Bryan Morgan 8 Dec 21, 2022