Code for blog post "{n} times faster than C, where n = 128"

Thomas Ip

Last update: Jul 24, 2023

Related tags

Learning Resources n_times_faster_than_c

Overview

Code for {n} times faster than C, where n = 128

Actually, n = 290 🤯

Benchmark Setup

Rust version: rustc 1.70.0 (90c541806 2023-05-31)
Run test: cargo test
Run benchmark: cargo bench

Machine: Apple MacBook Pro 14-inch (2021)
Processor: Apple M1 Pro
Memory: 16 GB

Input size: 1,000,000 characters
Input generation: s and p are chosen with randomly 50% probability.

Benchmark Result

Function	Time	Throughput	Relative speed
`baseline_unicode`	3.7511 ms	254.24 MiB/s	0.88
`baseline`	3.3316 ms	286.25 MiB/s	1
`opt1_idiomatic`	227.33 µs	4.0968 GiB/s	14.7
`opt2_count_s`	152.44 µs	6.1096 GiB/s	21.9
`opt3_count_s_branchless`	72.902 µs	12.775 GiB/s	45.7
`opt4_simd`	43.131 µs	21.593 GiB/s	77.2
`opt5_simd_unrolled_2x`	32.810 µs	28.385 GiB/s	101.5
`opt5_simd_unrolled_4x`	28.524 µs	32.650 GiB/s	116.8
`opt5_simd_unrolled_8x`	26.518 µs	35.120 GiB/s	125.64
`opt5_simd_unrolled_10x`	26.070 µs	35.724 GiB/s	127.8 🎉
`opt5_simd_unrolled_12x`	27.833 µs	33.461 GiB/s	119.7
`opt5_simd_unrolled_16x`	27.157 µs	34.293 GiB/s	122.7
`opt6_chunk_count`¹	14.597 µs	63.802 GiB/s	228.2
`opt6_chunk_exact_count` ²	11.489 µs	81.060 GiB/s	290.0 🚀

Credit

Thanks u/DavidM603 and u/Sharlinator for contributing even faster and cleaner solutions.

Thanks @PeterFaiman for catching and fixing an overflow problem in multiple optimizations (#1).

Credit to Reddit user u/DavidM603. ↩
Credit to Reddit user u/Sharlinator. ↩

On-the-fly machine code executing from Deno

jit from js Execute raw machine code from JavaScript. I hope you know what you're doing :) const inst = new Uint8Array([0xC3]); // ret const noop = ji

5 Apr 26, 2021

a cheat-sheet for mathematical notation in Rust 🦀 code form

math-as-rust 🦀 Based on math-as-code This is a reference to ease developers into mathematical notation by showing comparisons with Rust code.

13 Jan 4, 2023

Some UwU and OwO for your Rust code

UwU Types Some UwU and OwO for your Rust code This is a Rust crate inspired by this tweet from @thingskatedid / @katef. Credits Some extra functionali

12 Feb 8, 2022

The source code that accompanies Hands-on Rust: Effective Learning through 2D Game Development and Play by Herbert Wolverson

Hands-on Rust Source Code This repository contains the source code for the examples found in Hands-on Rust. These are also available from my publisher

261 Dec 14, 2022

Source code from Atlas, our 64k demo presented at Revision 2019 with Macau Exports

Atlas source code dump This is a dump of the source code for the engine, graphics tool and player for Atlas, our 64k demo released with Macau Exports

65 Jan 2, 2023

Complete code for the larger example programs from the book.

Code Examples for Programming Rust This repository contains complete code for the larger example programs from the book “Programming Rust”, by Jim Bla

670 Jan 1, 2023

This repository contains the Rust source code for the algorithms in the textbook Algorithms, 4th Edition

Overview This repository contains the Rust source code for the algorithms in the textbook Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne.

549 Dec 26, 2022

Source code for the book Rust in Action

Welcome to Rust in Action source code This source code repository is a companion to the Rust in Action book available from Manning Publications. Suppo

1.3k Dec 30, 2022

Code to follow along the "Zero To Production" book on API development in Rust.

Zero To Production / Code (Chapter 10 - Part 1) Zero To Production In Rust is an opinionated introduction to backend development using Rust. This repo

2.8k Dec 31, 2022

Comments

fix simd accumulator overflow
The intrinsics version (opt4_simd) has a bug:

// Flush accumulator every 256 iterations to avoid overflow if block_i % (u8::MAX as usize + 1) == u8::MAX as usize {

The function is logically the same as our previous algorithm, with the only change being that of flushing our vector accumulator every 256 iterations to prevent overflowing each u8 lane.

Flushing every 256 iterations isn't enough, because you can potentially encounter 256 's' in 256 iterations, thus overflowing the max of 255. Using the function definitions from the article:

fn main() { let input = "s".repeat(1024*1024); println!("{}", baseline(&input)); println!("{}", opt4_simd(&input)); }

Produces:

1048576 -1040384

The fix, performance untested:

// Flush accumulator every 255 iterations to avoid overflow if block_i % u8::MAX as usize == (u8::MAX - 1) as usize {

Correctly produces:

1048576 1048576

https://lobste.rs/s/sqn7m0/n_times_faster_than_c_where_n_128#c_iooycv
opened by PeterFaiman 6

Try target-cpu=native

Setting target-cpu to native might enable the compiler to apply more optimization.

I had some significant improvements, but also some regressions on my machine.

My local bench results from adding this

Created using the compare target in the justfile.

compare.log

running 3 tests
test tests::test_all_s ... ignored
test tests::test_large ... ignored
test tests::test_simple ... ignored

test result: ok. 0 passed; 0 failed; 3 ignored; 0 measured; 0 filtered out; finished in 0.00s

run_switches/baseline_unicode
                        time:   [3.5822 ms 3.5846 ms 3.5870 ms]
                        thrpt:  [265.87 MiB/s 266.05 MiB/s 266.23 MiB/s]
                 change:
                        time:   [-0.8116% -0.7452% -0.6735%] (p = 0.00 < 0.05)
                        thrpt:  [+0.6781% +0.7508% +0.8182%]
                        Change within noise threshold.
run_switches/baseline   time:   [3.1508 ms 3.1551 ms 3.1597 ms]
                        thrpt:  [301.82 MiB/s 302.26 MiB/s 302.68 MiB/s]
                 change:
                        time:   [-0.3682% -0.1833% +0.0162%] (p = 0.06 > 0.05)
                        thrpt:  [-0.0162% +0.1837% +0.3695%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
run_switches/opt1_idiomatic
                        time:   [152.82 µs 152.86 µs 152.91 µs]
                        thrpt:  [6.0908 GiB/s 6.0927 GiB/s 6.0944 GiB/s]
                 change:
                        time:   [-76.164% -76.123% -76.061%] (p = 0.00 < 0.05)
                        thrpt:  [+317.72% +318.81% +319.54%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
run_switches/opt2_count_s
                        time:   [93.749 µs 93.887 µs 94.042 µs]
                        thrpt:  [9.9032 GiB/s 9.9196 GiB/s 9.9342 GiB/s]
                 change:
                        time:   [-66.435% -66.356% -66.271%] (p = 0.00 < 0.05)
                        thrpt:  [+196.48% +197.23% +197.93%]
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  2 (2.00%) low severe
  13 (13.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe
run_switches/opt3_count_s_branchless
                        time:   [74.144 µs 74.154 µs 74.165 µs]
                        thrpt:  [12.557 GiB/s 12.559 GiB/s 12.561 GiB/s]
                 change:
                        time:   [-69.834% -69.725% -69.658%] (p = 0.00 < 0.05)
                        thrpt:  [+229.58% +230.30% +231.49%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe
run_switches/opt6_chunk_count
                        time:   [19.981 µs 19.993 µs 20.005 µs]
                        thrpt:  [46.554 GiB/s 46.583 GiB/s 46.610 GiB/s]
                 change:
                        time:   [+6.1832% +6.6117% +7.2029%] (p = 0.00 < 0.05)
                        thrpt:  [-6.7189% -6.2017% -5.8231%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
run_switches/opt6_chunk_exact_count
                        time:   [14.590 µs 14.596 µs 14.603 µs]
                        thrpt:  [63.777 GiB/s 63.807 GiB/s 63.832 GiB/s]
                 change:
                        time:   [+1.1436% +1.6182% +1.9169%] (p = 0.00 < 0.05)
                        thrpt:  [-1.8808% -1.5924% -1.1307%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe

opened by Skgland 0

Owner

Thomas Ip

GitHub https://ipthomas.com/blog/2023/07/n-times-faster-than-c-where-n-128/

A bunch of links to blog posts, articles, videos, etc for learning Rust

rust-learning A bunch of links to blog posts, articles, videos, etc for learning Rust. Feel free to submit a pull request if you have some links/resou

9k Jan 4, 2023

Blog posts, mostly about Rust.

Sean Chen's Blog ?? Blog posts, mostly about Rust. Posts Date Title 2021-04-06 A Beginner's Guide to Handling Errors in Rust 2021-01-23 Implementing a

13 Sep 4, 2022

A Matrix bot which can generate "This Week in X" like blog posts

hebbot A Matrix bot which can help to generate periodic / recurrent summary blog posts (also known as "This Week in X"). The bot was inspired by twim-

43 Dec 17, 2022

This blog provides detailed status updates and useful information about Theseus OS and its development

The Theseus OS Blog This blog provides detailed status updates and useful information about Theseus OS and its development. Attribution This blog was

1 Apr 14, 2022

Zomby7e's Blog - Backend

7eblog_backend Zomby7e's Blog - Backend, is just a micro blog backend. This project is written in Rust, it depends on Actix, uses SQLite to store data

2 Aug 26, 2022

An inquiry into nondogmatic software development. An experiment showing double performance of the code running on JVM comparing to equivalent native C code.

java-2-times-faster-than-c An experiment showing double performance of the code running on JVM comparing to equivalent native C code ⚠️ The title of t

49 Aug 14, 2022

Code for blog post "{n} times faster than C, where n = 128"

Related tags

Overview

Code for {n} times faster than C, where n = 128

Benchmark Setup

Benchmark Result

Credit

You might also like...

On-the-fly machine code executing from Deno

a cheat-sheet for mathematical notation in Rust 🦀 code form

Some UwU and OwO for your Rust code

The source code that accompanies Hands-on Rust: Effective Learning through 2D Game Development and Play by Herbert Wolverson

Source code from Atlas, our 64k demo presented at Revision 2019 with Macau Exports

Complete code for the larger example programs from the book.

This repository contains the Rust source code for the algorithms in the textbook Algorithms, 4th Edition

Source code for the book Rust in Action

Code to follow along the "Zero To Production" book on API development in Rust.

Comments

fix simd accumulator overflow

Try target-cpu=native

Owner

Thomas Ip

A bunch of links to blog posts, articles, videos, etc for learning Rust

Blog posts, mostly about Rust.

A Matrix bot which can generate "This Week in X" like blog posts

This blog provides detailed status updates and useful information about Theseus OS and its development

Zomby7e's Blog - Backend

An inquiry into nondogmatic software development. An experiment showing double performance of the code running on JVM comparing to equivalent native C code.

Leetcode Solutions in Rust, Advent of Code Solutions in Rust and more

:crab: Small exercises to get you used to reading and writing Rust code!

Crabzilla provides a simple interface for running JavaScript modules alongside Rust code.

Shuttle is a library for testing concurrent Rust code

Code for blog post "{n} times faster than C, where n = 128"

Related tags

Overview

Code for {n} times faster than C, where n = 128

Benchmark Setup

Benchmark Result

Credit

Footnotes

You might also like...

On-the-fly machine code executing from Deno

a cheat-sheet for mathematical notation in Rust 🦀 code form

Some UwU and OwO for your Rust code

The source code that accompanies Hands-on Rust: Effective Learning through 2D Game Development and Play by Herbert Wolverson

Source code from Atlas, our 64k demo presented at Revision 2019 with Macau Exports

Complete code for the larger example programs from the book.

This repository contains the Rust source code for the algorithms in the textbook Algorithms, 4th Edition

Source code for the book Rust in Action

Code to follow along the "Zero To Production" book on API development in Rust.

Comments

fix simd accumulator overflow

Try target-cpu=native

Owner

Thomas Ip

A bunch of links to blog posts, articles, videos, etc for learning Rust

Blog posts, mostly about Rust.

A Matrix bot which can generate "This Week in X" like blog posts

This blog provides detailed status updates and useful information about Theseus OS and its development

Zomby7e's Blog - Backend

An inquiry into nondogmatic software development. An experiment showing double performance of the code running on JVM comparing to equivalent native C code.

Leetcode Solutions in Rust, Advent of Code Solutions in Rust and more

:crab: Small exercises to get you used to reading and writing Rust code!

Crabzilla provides a simple interface for running JavaScript modules alongside Rust code.

Shuttle is a library for testing concurrent Rust code