Benchmarking C, Python, and Rust on the "sp" problem

Overview

Fast SP

Various implementations of the problem in this blog post.

Requirements

To run this, you will need Rust Nightly and Python 3.8+ with numpy.

Rust (nightly) Use [rustup](https://www.rust-lang.org/tools/install) to install a Rust toolchain, then install a nightly toolchain:
rustup update -- nightly

Then run rustup override set nightly to use the Rust nightly in the current directory.

Python 3.8+ with NumPy The test cases were generated using NumPy shenanigans.

You probably want to create a virtual environment, and install NumPy inside it. Here's one way to do it:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Running the benchmarks

Note: the cargo bench must be run before profiling Python (I'm sorry).

Benchmark C and Rust:

cargo bench

Benchmark Python (ensure NumPy is installed):

python3 python/benchmark-python.py

"Data analysis"

If you want to try analyzing results, you will need to install additional Python packages. In your virtual environment, run this:

python3 -m pip install -r requirements-dev.txt

This was really janky. So I just copy-pasted the output from the benchmarks straight from my terminal into a file called output.txt, but I guess you can do this:

cargo bench | tee output.txt &&\
  python3 python/benchmark-python.py | tee -a output.txt

And then run the script to parse this file:

python3 ./analyze-data-from-cargo-bench-output.py output.txt

Implementations

  • c_original — the original implementation from the blog post.
  • c_for_loop — a straightforward C implementation, with buffer size given (no need to find the null-terminator).
  • c_while_loop — a slight variation on the original.
  • rust_for_loop — a Rust implementation that uses a for loop and mutable state.
  • rust_iter — a Rust implementation that uses a for loop and mutable state.
  • rust_simd — a Rust implementation that uses Portable SIMD.
  • python_for_loop — Python code to analyze buffer byte-by-byte.
  • python_numpy — solution that uses NumPy.

Benchmarks

There were two test cases that I used:

  • random_printable: 12 MiB of random printable ASCII characters.
  • random_sp: 12 MiB of either the ASCII character s or p.

I chose 12 MiB as the size of the test data, as that is the size of the L2 cache on the M1's performance cores (allegedly).

Here's how fast various implementation strategies work on my machine (from fastest to slowest):

Language Implementation Test case Throughput (GiB/s) Time per iteration
Rust portable_simd random_sp 19.874 589,654 ns/iter ± 5,769
Rust portable_simd random_printable 19.870 589,766 ns/iter ± 5,726
Python numpy random_printable 5.800 2,020,549 ns/iter ± 28,156
Python numpy random_sp 5.711 2,052,063 ns/iter ± 113,433
Rust iter random_printable 3.979 2,944,916 ns/iter ± 80,495
Rust iter random_sp 3.978 2,946,037 ns/iter ± 33,849
Rust emulate_numpy random_sp 3.031 3,866,700 ns/iter ± 116,109
Rust emulate_numpy random_printable 3.026 3,872,550 ns/iter ± 103,355
C while_loop random_printable 1.487 7,878,845 ns/iter ± 105,742
C while_loop random_sp 1.486 7,886,922 ns/iter ± 158,662
C for_loop random_sp 1.482 7,905,779 ns/iter ± 40,898
Rust for_loop random_sp 1.482 7,906,754 ns/iter ± 635,075
Rust for_loop random_printable 1.481 7,912,558 ns/iter ± 499,821
C original random_printable 1.478 7,930,400 ns/iter ± 135,758
C for_loop random_printable 1.478 7,931,450 ns/iter ± 516,610
C original random_sp 0.278 42,119,291 ns/iter ± 501,957
Python for_loop random_printable 0.001 18,939,466,916 ns/iter ± 44,605,333
Python for_loop random_sp 0.001 19,451,341,325 ns/iter ± 291,675,532

Analysis

TODO! Briefly, Clang generates code for c_original that does all of its counting logic with branches, which is probably why it struggles with random_sp — the M1 just can't predict the branches. c_for_loop and rust_for_loop both yields assembly that uses the cinc and csel, avoiding costly unpredictable branches. Rust/LLVM is able to autovectorize rust_iter but it generates a whole mess of instructions, and I haven't put the time to understand what the instructions are actually doing. For rust_simd, Rust/LLVM generates nice SIMD code that processes 16 bytes at a time, in a tiny loop.

Details of my testing machine

  • Apple MacBook Pro M1, 2020.
  • RAM: 8 GB
  • Max Clock speed: 3.2 GHz
  • L2 cache: 12 MB
  • Apple clang version 14.0.0 (clang-1400.0.29.202)
  • rustc 1.68.0-nightly (61a415be5 2023-01-12)
  • LLVM version: 15.0.6

License

Copyright © 2023 Eddie Antonio Santos. AGPL-3.0 licensed.

You might also like...
This is a simple command line application to convert bibtex to json written in Rust and Python

bibtex-to-json This is a simple command line application to convert bibtex to json written in Rust and Python. Why? To enable you to convert very big

A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust.

rBloom A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust. It's designed to be as pythonic as possible, mimicki

An easy-to-use SocketCAN library for Python and C++, built in Rust.

JCAN An easy-to-use SocketCAN library for Python and C++, built in Rust, using cxx-rs and pyo3. Warning: I have never used Rust before and I don't kno

TinyTodo is a Cedar Agent example, with a server in Rust and client in python
TinyTodo is a Cedar Agent example, with a server in Rust and client in python

TinyTodo - OPAL and Cedar Agent Demo TinyTodo is a simple application for managing task lists. It uses OPAL and Cedar Agent to control who has access

The high-speed IAPWS-IF97 package in Rust with C and Python binding
The high-speed IAPWS-IF97 package in Rust with C and Python binding

SEUIF97 This is the Rust implementation of the high-speed IAPWS-IF97 package seuif97 with C and Python binding. It is suitable for computation-intensi

Schemars is a high-performance Python serialization library, leveraging Rust and PyO3 for efficient handling of complex objects

Schemars Introduction Schemars is a Python package, written in Rust and leveraging PyO3, designed for efficient and flexible serialization of Python c

The fastest memoizing and caching Python library written in Rust.

Cachebox Cachebox is a Python library (written in Rust) that provides memoizations and cache implementions with different cache replecement policies.

Python package for topological data analysis written in Rust. Not limited to just H0 and H1.

Topological Data Analysis (TDA) Contents Installation Compiling from source Roadmap TDA is a python package for topological data analysis written in R

A Command-line tool to create, manage and deploy your python projects

PPM A Command-line tool to create, manage and deploy your python projects Table of Contents PPM Main Features Create a Project project.ini file Projec

Owner
Eddie Antonio Santos
Software developer and computing educator
Eddie Antonio Santos
Red-blue graph problem solver - Rust implementation

Red-blue graph problem solver - Rust implementation The problem is the following: In a directed graph, each node is colored either red or blue. Furthe

Thomas Prévost 2 Jan 17, 2022
Proof-of-concept on how to solve Bitcoin's light node sync problem with zkSNARKs

BTC Warp Prove and verify the longest Bitcoin PoW chain BTC Warp is a proof-of-concept system that aims to solve the client-syncing problem for Bitcoi

Succinct 45 May 31, 2023
A Rust on-site channel benchmarking helper. Inter-Process (async / busy) & Intra-Process (async single threaded / async multi threaded)

On-Site Rust Channel Benchmarking Helper Deploy on server to determine which public crates are the fastest for communicating in different architecture

null 23 Jul 9, 2024
A command-line benchmarking tool

hyperfine 中文 A command-line benchmarking tool. Demo: Benchmarking fd and find: Features Statistical analysis across multiple runs. Support for arbitra

David Peter 14.1k Jan 4, 2023
ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation.

ObfusEval ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation. The following two metrics related the

Software Engineering Lab @ NAIST 4 Dec 15, 2022
Rust Imaging Library's Python binding: A performant and high-level image processing library for Python written in Rust

ril-py Rust Imaging Library for Python: Python bindings for ril, a performant and high-level image processing library written in Rust. What's this? Th

Cryptex 13 Dec 6, 2022
A lightweight and high-performance order-book designed to process level 2 and trades data. Available in Rust and Python

ninjabook A lightweight and high-performance order-book implemented in Rust, designed to process level 2 and trades data. Available in Python and Rust

Ninja Quant 134 Jul 22, 2024
Python/Rust implementations and notes from Proofs Arguments and Zero Knowledge study group

What is this? This is where I'll be collecting resources related to the Study Group on Dr. Justin Thaler's Proofs Arguments And Zero Knowledge Book. T

Thor 65 Dec 16, 2022
My solutions for the Advent of Code 2021 in Scala, Python, Haskell and Rust.

Advent of Code 2021 These are my Advent of Code 2021 solutions written in Scala 3, Haskell, Python and Rust. Day Title L1 L2 L3 L4 01 Sonar Sweep Scal

Axel Suárez 2 Oct 14, 2022
A toy example showing how to run Rust code in Python for speed and progress.

PoC: Integrating Rust in Python A toy example showing how to run Rust code in Python for speed and progress. Requirements Python 3.6+ Rust 1.44+ Cargo

Emil Thorenfeldt 2 Feb 7, 2022