Benchmarking C, Python, and Rust on the "sp" problem

Eddie Antonio Santos

Last update: Jul 13, 2023

Related tags

Command-line python rust benchmarking numpy simd

Overview

Fast SP

Various implementations of the problem in this blog post.

Requirements

To run this, you will need Rust Nightly and Python 3.8+ with numpy.

Rust (nightly)

Use [rustup](https://www.rust-lang.org/tools/install) to install a Rust toolchain, then install a nightly toolchain:

rustup update -- nightly

Then run rustup override set nightly to use the Rust nightly in the current directory.

Python 3.8+ with NumPy

The test cases were generated using NumPy shenanigans.

You probably want to create a virtual environment, and install NumPy inside it. Here's one way to do it:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Running the benchmarks

Note: the cargo bench must be run before profiling Python (I'm sorry).

Benchmark C and Rust:

cargo bench

Benchmark Python (ensure NumPy is installed):

python3 python/benchmark-python.py

"Data analysis"

If you want to try analyzing results, you will need to install additional Python packages. In your virtual environment, run this:

python3 -m pip install -r requirements-dev.txt

This was really janky. So I just copy-pasted the output from the benchmarks straight from my terminal into a file called output.txt, but I guess you can do this:

cargo bench | tee output.txt &&\
  python3 python/benchmark-python.py | tee -a output.txt

And then run the script to parse this file:

python3 ./analyze-data-from-cargo-bench-output.py output.txt

Implementations

c_original — the original implementation from the blog post.
c_for_loop — a straightforward C implementation, with buffer size given (no need to find the null-terminator).
c_while_loop — a slight variation on the original.
rust_for_loop — a Rust implementation that uses a for loop and mutable state.
rust_iter — a Rust implementation that uses a for loop and mutable state.
rust_simd — a Rust implementation that uses Portable SIMD.
python_for_loop — Python code to analyze buffer byte-by-byte.
python_numpy — solution that uses NumPy.

Benchmarks

There were two test cases that I used:

random_printable: 12 MiB of random printable ASCII characters.
random_sp: 12 MiB of either the ASCII character s or p.

I chose 12 MiB as the size of the test data, as that is the size of the L2 cache on the M1's performance cores (allegedly).

Here's how fast various implementation strategies work on my machine (from fastest to slowest):

Language	Implementation	Test case	Throughput (GiB/s)	Time per iteration
Rust	portable_simd	random_sp	19.874	589,654 ns/iter ± 5,769
Rust	portable_simd	random_printable	19.870	589,766 ns/iter ± 5,726
Python	numpy	random_printable	5.800	2,020,549 ns/iter ± 28,156
Python	numpy	random_sp	5.711	2,052,063 ns/iter ± 113,433
Rust	iter	random_printable	3.979	2,944,916 ns/iter ± 80,495
Rust	iter	random_sp	3.978	2,946,037 ns/iter ± 33,849
Rust	emulate_numpy	random_sp	3.031	3,866,700 ns/iter ± 116,109
Rust	emulate_numpy	random_printable	3.026	3,872,550 ns/iter ± 103,355
C	while_loop	random_printable	1.487	7,878,845 ns/iter ± 105,742
C	while_loop	random_sp	1.486	7,886,922 ns/iter ± 158,662
C	for_loop	random_sp	1.482	7,905,779 ns/iter ± 40,898
Rust	for_loop	random_sp	1.482	7,906,754 ns/iter ± 635,075
Rust	for_loop	random_printable	1.481	7,912,558 ns/iter ± 499,821
C	original	random_printable	1.478	7,930,400 ns/iter ± 135,758
C	for_loop	random_printable	1.478	7,931,450 ns/iter ± 516,610
C	original	random_sp	0.278	42,119,291 ns/iter ± 501,957
Python	for_loop	random_printable	0.001	18,939,466,916 ns/iter ± 44,605,333
Python	for_loop	random_sp	0.001	19,451,341,325 ns/iter ± 291,675,532

Analysis

TODO! Briefly, Clang generates code for c_original that does all of its counting logic with branches, which is probably why it struggles with random_sp — the M1 just can't predict the branches. c_for_loop and rust_for_loop both yields assembly that uses the cinc and csel, avoiding costly unpredictable branches. Rust/LLVM is able to autovectorize rust_iter but it generates a whole mess of instructions, and I haven't put the time to understand what the instructions are actually doing. For rust_simd, Rust/LLVM generates nice SIMD code that processes 16 bytes at a time, in a tiny loop.

Details of my testing machine

Apple MacBook Pro M1, 2020.
RAM: 8 GB
Max Clock speed: 3.2 GHz
L2 cache: 12 MB
Apple clang version 14.0.0 (clang-1400.0.29.202)
rustc 1.68.0-nightly (61a415be5 2023-01-12)
LLVM version: 15.0.6

License

This is a simple command line application to convert bibtex to json written in Rust and Python

bibtex-to-json This is a simple command line application to convert bibtex to json written in Rust and Python. Why? To enable you to convert very big

3 Mar 23, 2022

A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust.

rBloom A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust. It's designed to be as pythonic as possible, mimicki

91 Feb 4, 2023

An easy-to-use SocketCAN library for Python and C++, built in Rust.

JCAN An easy-to-use SocketCAN library for Python and C++, built in Rust, using cxx-rs and pyo3. Warning: I have never used Rust before and I don't kno

4 Feb 9, 2023

TinyTodo is a Cedar Agent example, with a server in Rust and client in python

TinyTodo - OPAL and Cedar Agent Demo TinyTodo is a simple application for managing task lists. It uses OPAL and Cedar Agent to control who has access

4 Aug 9, 2023

The high-speed IAPWS-IF97 package in Rust with C and Python binding

SEUIF97 This is the Rust implementation of the high-speed IAPWS-IF97 package seuif97 with C and Python binding. It is suitable for computation-intensi

5 Aug 12, 2023

Schemars is a high-performance Python serialization library, leveraging Rust and PyO3 for efficient handling of complex objects

Schemars Introduction Schemars is a Python package, written in Rust and leveraging PyO3, designed for efficient and flexible serialization of Python c

7 Nov 21, 2023

Benchmarking C, Python, and Rust on the "sp" problem

Related tags

Overview

Fast SP

Requirements

Running the benchmarks

"Data analysis"

Implementations

Benchmarks

Analysis

License

You might also like...

This is a simple command line application to convert bibtex to json written in Rust and Python

A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust.

An easy-to-use SocketCAN library for Python and C++, built in Rust.

TinyTodo is a Cedar Agent example, with a server in Rust and client in python

The high-speed IAPWS-IF97 package in Rust with C and Python binding

Schemars is a high-performance Python serialization library, leveraging Rust and PyO3 for efficient handling of complex objects

The fastest memoizing and caching Python library written in Rust.

Python package for topological data analysis written in Rust. Not limited to just H0 and H1.

A Command-line tool to create, manage and deploy your python projects

Owner

Eddie Antonio Santos

Red-blue graph problem solver - Rust implementation

Proof-of-concept on how to solve Bitcoin's light node sync problem with zkSNARKs

A Rust on-site channel benchmarking helper. Inter-Process (async / busy) & Intra-Process (async single threaded / async multi threaded)

A command-line benchmarking tool

ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation.

Rust Imaging Library's Python binding: A performant and high-level image processing library for Python written in Rust

A lightweight and high-performance order-book designed to process level 2 and trades data. Available in Rust and Python

Python/Rust implementations and notes from Proofs Arguments and Zero Knowledge study group

My solutions for the Advent of Code 2021 in Scala, Python, Haskell and Rust.

A toy example showing how to run Rust code in Python for speed and progress.