Code for blog post "{n} times faster than C, where n = 128"

Overview

Code for {n} times faster than C, where n = 128

Actually, n = 290 🤯

Benchmark Setup

Rust version: rustc 1.70.0 (90c541806 2023-05-31)
Run test: cargo test
Run benchmark: cargo bench

Machine: Apple MacBook Pro 14-inch (2021)
Processor: Apple M1 Pro
Memory: 16 GB

Input size: 1,000,000 characters
Input generation: s and p are chosen with randomly 50% probability.

Benchmark Result

Function Time Throughput Relative speed
baseline_unicode 3.7511 ms 254.24 MiB/s 0.88
baseline 3.3316 ms 286.25 MiB/s 1
opt1_idiomatic 227.33 µs 4.0968 GiB/s 14.7
opt2_count_s 152.44 µs 6.1096 GiB/s 21.9
opt3_count_s_branchless 72.902 µs 12.775 GiB/s 45.7
opt4_simd 43.131 µs 21.593 GiB/s 77.2
opt5_simd_unrolled_2x 32.810 µs 28.385 GiB/s 101.5
opt5_simd_unrolled_4x 28.524 µs 32.650 GiB/s 116.8
opt5_simd_unrolled_8x 26.518 µs 35.120 GiB/s 125.64
opt5_simd_unrolled_10x 26.070 µs 35.724 GiB/s 127.8 🎉
opt5_simd_unrolled_12x 27.833 µs 33.461 GiB/s 119.7
opt5_simd_unrolled_16x 27.157 µs 34.293 GiB/s 122.7
opt6_chunk_count1 14.597 µs 63.802 GiB/s 228.2
opt6_chunk_exact_count 2 11.489 µs 81.060 GiB/s 290.0 🚀

Credit

Thanks u/DavidM603 and u/Sharlinator for contributing even faster and cleaner solutions.

Thanks @PeterFaiman for catching and fixing an overflow problem in multiple optimizations (#1).

Footnotes

  1. Credit to Reddit user u/DavidM603.

  2. Credit to Reddit user u/Sharlinator.

You might also like...
On-the-fly machine code executing from Deno

jit from js Execute raw machine code from JavaScript. I hope you know what you're doing :) const inst = new Uint8Array([0xC3]); // ret const noop = ji

a cheat-sheet for mathematical notation in Rust 🦀 code form

math-as-rust 🦀 Based on math-as-code This is a reference to ease developers into mathematical notation by showing comparisons with Rust code.

Some UwU and OwO for your Rust code

UwU Types Some UwU and OwO for your Rust code This is a Rust crate inspired by this tweet from @thingskatedid / @katef. Credits Some extra functionali

The source code that accompanies Hands-on Rust: Effective Learning through 2D Game Development and Play by Herbert Wolverson
The source code that accompanies Hands-on Rust: Effective Learning through 2D Game Development and Play by Herbert Wolverson

Hands-on Rust Source Code This repository contains the source code for the examples found in Hands-on Rust. These are also available from my publisher

Source code from Atlas, our 64k demo presented at Revision 2019 with Macau Exports

Atlas source code dump This is a dump of the source code for the engine, graphics tool and player for Atlas, our 64k demo released with Macau Exports

Complete code for the larger example programs from the book.

Code Examples for Programming Rust This repository contains complete code for the larger example programs from the book “Programming Rust”, by Jim Bla

This repository contains the Rust source code for the algorithms in the textbook Algorithms, 4th Edition

Overview This repository contains the Rust source code for the algorithms in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

Source code for the book Rust in Action

Welcome to Rust in Action source code This source code repository is a companion to the Rust in Action book available from Manning Publications. Suppo

Code to follow along the "Zero To Production" book on API development in Rust.

Zero To Production / Code (Chapter 10 - Part 1) Zero To Production In Rust is an opinionated introduction to backend development using Rust. This repo

Comments
  • fix simd accumulator overflow

    fix simd accumulator overflow

    The intrinsics version (opt4_simd) has a bug:

    // Flush accumulator every 256 iterations to avoid overflow
    if block_i % (u8::MAX as usize + 1) == u8::MAX as usize {
    

    The function is logically the same as our previous algorithm, with the only change being that of flushing our vector accumulator every 256 iterations to prevent overflowing each u8 lane.

    Flushing every 256 iterations isn't enough, because you can potentially encounter 256 's' in 256 iterations, thus overflowing the max of 255. Using the function definitions from the article:

    fn main() {
        let input = "s".repeat(1024*1024);
        println!("{}", baseline(&input));
        println!("{}", opt4_simd(&input));
    }
    

    Produces:

    1048576
    -1040384
    

    The fix, performance untested:

    // Flush accumulator every 255 iterations to avoid overflow
    if block_i % u8::MAX as usize == (u8::MAX - 1) as usize {
    

    Correctly produces:

    1048576
    1048576
    

    https://lobste.rs/s/sqn7m0/n_times_faster_than_c_where_n_128#c_iooycv

    opened by PeterFaiman 6
  • Try target-cpu=native

    Try target-cpu=native

    Setting target-cpu to native might enable the compiler to apply more optimization.

    I had some significant improvements, but also some regressions on my machine.

    My local bench results from adding this Created using the compare target in the justfile.

    compare.log

    running 3 tests
    test tests::test_all_s ... ignored
    test tests::test_large ... ignored
    test tests::test_simple ... ignored
    
    test result: ok. 0 passed; 0 failed; 3 ignored; 0 measured; 0 filtered out; finished in 0.00s
    
    run_switches/baseline_unicode
                            time:   [3.5822 ms 3.5846 ms 3.5870 ms]
                            thrpt:  [265.87 MiB/s 266.05 MiB/s 266.23 MiB/s]
                     change:
                            time:   [-0.8116% -0.7452% -0.6735%] (p = 0.00 < 0.05)
                            thrpt:  [+0.6781% +0.7508% +0.8182%]
                            Change within noise threshold.
    run_switches/baseline   time:   [3.1508 ms 3.1551 ms 3.1597 ms]
                            thrpt:  [301.82 MiB/s 302.26 MiB/s 302.68 MiB/s]
                     change:
                            time:   [-0.3682% -0.1833% +0.0162%] (p = 0.06 > 0.05)
                            thrpt:  [-0.0162% +0.1837% +0.3695%]
                            No change in performance detected.
    Found 3 outliers among 100 measurements (3.00%)
      2 (2.00%) high mild
      1 (1.00%) high severe
    run_switches/opt1_idiomatic
                            time:   [152.82 µs 152.86 µs 152.91 µs]
                            thrpt:  [6.0908 GiB/s 6.0927 GiB/s 6.0944 GiB/s]
                     change:
                            time:   [-76.164% -76.123% -76.061%] (p = 0.00 < 0.05)
                            thrpt:  [+317.72% +318.81% +319.54%]
                            Performance has improved.
    Found 11 outliers among 100 measurements (11.00%)
      5 (5.00%) low severe
      3 (3.00%) low mild
      1 (1.00%) high mild
      2 (2.00%) high severe
    run_switches/opt2_count_s
                            time:   [93.749 µs 93.887 µs 94.042 µs]
                            thrpt:  [9.9032 GiB/s 9.9196 GiB/s 9.9342 GiB/s]
                     change:
                            time:   [-66.435% -66.356% -66.271%] (p = 0.00 < 0.05)
                            thrpt:  [+196.48% +197.23% +197.93%]
                            Performance has improved.
    Found 24 outliers among 100 measurements (24.00%)
      2 (2.00%) low severe
      13 (13.00%) low mild
      1 (1.00%) high mild
      8 (8.00%) high severe
    run_switches/opt3_count_s_branchless
                            time:   [74.144 µs 74.154 µs 74.165 µs]
                            thrpt:  [12.557 GiB/s 12.559 GiB/s 12.561 GiB/s]
                     change:
                            time:   [-69.834% -69.725% -69.658%] (p = 0.00 < 0.05)
                            thrpt:  [+229.58% +230.30% +231.49%]
                            Performance has improved.
    Found 16 outliers among 100 measurements (16.00%)
      6 (6.00%) high mild
      10 (10.00%) high severe
    run_switches/opt6_chunk_count
                            time:   [19.981 µs 19.993 µs 20.005 µs]
                            thrpt:  [46.554 GiB/s 46.583 GiB/s 46.610 GiB/s]
                     change:
                            time:   [+6.1832% +6.6117% +7.2029%] (p = 0.00 < 0.05)
                            thrpt:  [-6.7189% -6.2017% -5.8231%]
                            Performance has regressed.
    Found 6 outliers among 100 measurements (6.00%)
      2 (2.00%) low mild
      2 (2.00%) high mild
      2 (2.00%) high severe
    run_switches/opt6_chunk_exact_count
                            time:   [14.590 µs 14.596 µs 14.603 µs]
                            thrpt:  [63.777 GiB/s 63.807 GiB/s 63.832 GiB/s]
                     change:
                            time:   [+1.1436% +1.6182% +1.9169%] (p = 0.00 < 0.05)
                            thrpt:  [-1.8808% -1.5924% -1.1307%]
                            Performance has regressed.
    Found 16 outliers among 100 measurements (16.00%)
      1 (1.00%) low severe
      2 (2.00%) low mild
      6 (6.00%) high mild
      7 (7.00%) high severe
    
    opened by Skgland 0
A bunch of links to blog posts, articles, videos, etc for learning Rust

rust-learning A bunch of links to blog posts, articles, videos, etc for learning Rust. Feel free to submit a pull request if you have some links/resou

Camille TJHOA 9k Jan 4, 2023
Blog posts, mostly about Rust.

Sean Chen's Blog ?? Blog posts, mostly about Rust. Posts Date Title 2021-04-06 A Beginner's Guide to Handling Errors in Rust 2021-01-23 Implementing a

Sean Chen 13 Sep 4, 2022
A Matrix bot which can generate "This Week in X" like blog posts

hebbot A Matrix bot which can help to generate periodic / recurrent summary blog posts (also known as "This Week in X"). The bot was inspired by twim-

Häcker Felix 43 Dec 17, 2022
This blog provides detailed status updates and useful information about Theseus OS and its development

The Theseus OS Blog This blog provides detailed status updates and useful information about Theseus OS and its development. Attribution This blog was

Theseus OS 1 Apr 14, 2022
Zomby7e's Blog - Backend

7eblog_backend Zomby7e's Blog - Backend, is just a micro blog backend. This project is written in Rust, it depends on Actix, uses SQLite to store data

Zomby7e 2 Aug 26, 2022
An inquiry into nondogmatic software development. An experiment showing double performance of the code running on JVM comparing to equivalent native C code.

java-2-times-faster-than-c An experiment showing double performance of the code running on JVM comparing to equivalent native C code ⚠️ The title of t

xemantic 49 Aug 14, 2022
Leetcode Solutions in Rust, Advent of Code Solutions in Rust and more

RUST GYM Rust Solutions Leetcode Solutions in Rust AdventOfCode Solutions in Rust This project demostrates how to create Data Structures and to implem

Larry Fantasy 635 Jan 3, 2023
:crab: Small exercises to get you used to reading and writing Rust code!

rustlings ?? ❤️ Greetings and welcome to rustlings. This project contains small exercises to get you used to reading and writing Rust code. This inclu

The Rust Programming Language 33.1k Jan 2, 2023
Crabzilla provides a simple interface for running JavaScript modules alongside Rust code.

Crabzilla Crabzilla provides a simple interface for running JavaScript modules alongside Rust code. Example use crabzilla::*; use std::io::stdin; #[i

Andy 14 Feb 19, 2022
Shuttle is a library for testing concurrent Rust code

Shuttle Shuttle is a library for testing concurrent Rust code. It is an implementation of a number of randomized concurrency testing techniques, inclu

Amazon Web Services - Labs 373 Dec 27, 2022