Efficient argmin & argmax (in 1 function)

Jeroen Van Der Donckt

Last update: Feb 12, 2023

Related tags

Machine learning rust neon numpy sse simd rust-lang ndarray avx2 avx512 argmax argmin

Overview

ArgMinMax

Efficient argmin & argmax (in 1 function) with SIMD (SSE, AVX(2), AVX512, NEON) for f16, f32, f64, i8, i16, i32, i64, u8, u16, u32, u64.

🚀 The function is generic over the type of the array, so it can be used on &[T] or Vec<T> where T can be f16¹, f32, f64, i8, i16, i32, i64, u8, u16, u32, u64.

🤝 The trait is implemented for slice, Vec, 1D ndarray::ArrayBase², and apache arrow::PrimitiveArray³.

⚡ Runtime CPU feature detection is used to select the most efficient implementation for the current CPU. This means that the same binary can be used on different CPUs without recompilation.

👀 The SIMD implementation contains no if checks, ensuring that the runtime of the function is independent of the input data its order (best-case = worst-case = average-case).

🪄 Efficient support for f16 and uints: through (bijective aka symmetric) bitwise operations, f16 (optional¹) and uints are converted to ordered integers, allowing to use integer SIMD instructions.

¹ for f16 you should enable the "half" feature.
² for ndarray::ArrayBase you should enable the "ndarray" feature.
³ for arrow::PrimitiveArray you should enable the "arrow" feature.

Installing

Add the following to your Cargo.toml:

[dependencies]
argminmax = "0.4"

Example usage

use argminmax::ArgMinMax;  // import trait

let arr: Vec<i32> = (0..200_000).collect();  // create a vector

let (min, max) = arr.argminmax();  // apply extension

println!("min: {}, max: {}", min, max);
println!("arr[min]: {}, arr[max]: {}", arr[min], arr[max]);

Features

"half": support f16 argminmax (through using the half crate).
"ndarray": add ArgMinMax trait to ndarray its Array1 & ArrayView1.

Benchmarks

Benchmarks on my laptop (AMD Ryzen 7 4800U, 1.8 GHz, 16GB RAM) using criterion show that the function is 3-20x faster than the scalar implementation (depending of data type).

See /benches/results.

Run the benchmarks yourself with the following command:

cargo bench --quiet --message-format=short --features half | grep "time:"

Tests

To run the tests use the following command:

cargo test --message-format=short --all-features

Limitations

❗ Does not support NaNs.

Acknowledgements

Some parts of this library are inspired by the great work of minimalrust's argmm project.

Comments

:recycle: major refactoring

I'll document the changes tomorrow - a bit too tired now :upside_down_face:

Long story short -> used more traits & structs to provide more flexibility (& less duplicate code) yay! :tada:

-- did not add support for aarch64/arm (as I currently do not have access to my ARM testing device)

opened by jvdd 4
Add NEON SIMD types to new NaN v3 design

Apparently we can indeed create pulls into pull requests.

Here we go. This should add some functions and types for the NEON/arm64 SIMD specialisations.

opened by varon 2
:sparkles: implement ArgMinMax trait for other types
Currently the ArgMinMax trait is only implemented for ndarray::ArrayView1.
It might be interesting to implement the trait for

Vec

array?

Credits @minimalrust, see https://github.com/minimalrust/argmm/issues/8
enhancement
opened by jvdd 2
:pie: update NEON SIMD

Fixed the merge issues of #24

With a lot of patience I debugged & updated the NEON code on my raspberry pi 3 :upside_down_face:

Thx for the headstart @varon :)

This PR also fixes a bug in the tests in where we checked for the wrong target feature (and thus performed illegal instructions)

opened by jvdd 1
QOL improvements
Add rust toolchain file to default to nightly

.gitignore improvements

This is a super-tiny PR - just some minor touchups during the initial onboarding of working with this code.

Adding the toolchain file solved a common gotcha, and the .gitignore adds support for working in CLion.
opened by varon 1
:rabbit: add codspeed benchmarking to ci/cd

To make codspeed work I had to add "half" as default feature. I will revert this once https://github.com/CodSpeedHQ/codspeed-rust/issues/1 is resolved

opened by jvdd 1
:tea: use slice internally + implement for various types
[x] use slice (&[T]) under the hood instead of ArrayView1

[x] ndarray -> optional dependency

extend trait implementation, resolves #9

[x] slice

[x] Vec

[x] test this

[x] ndarray

[x] test this

[x] arrow

[x] test this

[x] update Readme

[x] final rerun of benchmarks

[x] no regressions (i.e., same performance or even better :rocket:) for scalar implementation

[ ] no regressions (i.e., same performance or even better :rocket:) for SIMD implementation => I notices very few (minor) regressions for certain dtype / SIMD combinations & highlighted those in the benches/results file.. But, since the codspeed continuous monitoring only shows 2% regressions for the AVX2 u8 & i8 - (which is most likely not the most common datatypes for end users) - I am pro for merging this into main, given the significant enhanced flexibility that this PR adds :)
opened by jvdd 1
:zap: optimize iteration over array
This pull request updates the way we iterate over arrays in our code by replacing the use of the [exact_chunks] (https://docs.rs/ndarray/latest/ndarray/struct.ArrayBase.html#method.exact_chunks) method with a pointer that we offset ourselves. Our profiling showed that this change resulted in significant speed improvements, with up to a 50% speedup for 8-bit data types and noticeable speedups for other data types. This should result in a notable improvement in the overall performance of our code.

:sparkles: Contributions of this PR (focussing on performance) :rocket:

[x] optimized iteration over array using (mutable) pointer that we offset ourselves

[x] reorder SIMD inner loop

[x] for 16-bit dtypes: _horiz_min and _horiz_max with SIMD

[x] use NEON SIMD implementation on arm for 32-bit dtypes (as this is now 2-3x faster than scalar)

[x] explore prefetching - as currently the bottleneck is loading the data => included in the code but commented out, as this has minor / no impact (only measurable positive impact on large arrays - 500M+ data points)

This PR was mainly driven by analyzing stacktraces of perf.

:eyes: short (& incomplete) summary of observed improvements

x86(_64)

8-bit dtypes: 80% faster :rocket: :rocket:

16-bit dtypes: 5-12% faster (12% for floats, 10% for ints, 5% for uints)

32-bit dtypes: 0-5% faster (5% for floats, 2% for ints, 0% for uints)

64-bit dtypes: 3-17% faster (3% for floats, 17% for ints & uints)

AArch64

8-bit dtypes: 34-42% faster :rocket: :rocket: (34% for ints and 42% for floats)

16-bit dtypes: 7% slower :cry: (=> bc of horiz SIMD? :thinking:)

32-bit dtypes: 0-20% faster (20% for floats, 0% for ints & uints)

ARM

8-bit: 400% faster :rocket: :rocket:

16-bit: 350-460% faster :rocket: :rocket: (350% for floats, 460% for ints & uints)

32-bit: 420-450% faster :rocket: :rocket: (420% for floats, 450% for ints & uints)

enhancement
opened by jvdd 1
:sparkles: support int8 and uint8
:sparkles: add support for int8

[x] SIMD + SCALAR implementation

[x] alleviate bottleneck (horizontal argmin in final vector) -> algorithm: https://stackoverflow.com/a/9798369

[x] SSE

[x] AVX2

[x] AVX512(bw)

[x] ARM/AARCH64

:thinking: add support for uint8?
-> would probably require the XOR conversion ~~but would exit the overflow loop 2x fewer than int8 (guess this will be net 1.5x faster than int8 for longer arrays)~~ :sweat_smile: nvm still limited by the i8::MAX for overflow TODO: bench + horizontal SIMD implementation for neon

[x] SSE

[x] AVX2

[x] AVX512(bw)

[x] ARM/AARCH64

:snowflake: minor improvements

[x] faster implementation of overflow loop

[ ] ~~remove _get_min_max_index_value~~ => is for another refactoring :recycle: PR

Other stuff

[x] :robot: rerun benchmarks!
opened by jvdd 0
:tada: NEON support for aarch64
[x] :tada: add support for SIMD (NEON) on aarch64 cpu architectures

[x] :recycle: benchmark code on raspberry pi 4 (which is aarch64)

[x] :broom: fix some clippy warnings

Thx @matdemeue for providing me with a raspberry pi 4!! :hugs: (this allowed me to develop & bench the code!)
opened by jvdd 0
:broom: minor improvements
This PR will be squashed when merged - it includes #4

Contributions of this PR (in addition to #4):

[x] minor refactoring of naming

[ ] ~~lazy_static~~ => once_cell => will skip this for now
opened by jvdd 0
Make datatypes opt-in

Hi @jvdd I am considering adding these to polars, but given that we define our own NaN handling, I only want to expose it for integer backed data.

Could the floating points be feature gated? Polars compile times are really strained so I'd rather not include code we don't need. Let me know what you think!

opened by ritchie46 2
:rocket: float NaN handling
Handle NaNs, closes #16 Behavior changes in this PR:

.argminmax returns the index of the first NaN value (instead of ignoring it) To realize this functionality, we use the transformation as detailed in #16 & explored in #18

.nanargminmax (new function :tada:) ignores NaNs (while being even faster for floats) => Only "downside" - if data contains ONLY NaNs and/or +inf/-inf this will return 0 (I believe we can accept this unexpected for now - seems like a very uncommon use case)

Changing the "architecture":

[x] SIMD ops in stand-alone trait (better w.r.t. coupling & cohesion)

[x] create SIMDOps trait & add additional traits (see sketch below)

[x] add auto-implement for SIMDCore traits

[x] implement the SIMDOps trait for the various data types

[x] x86 / x86_64

[x] arm / Aarch64

[x] implement IgnoreNaN and ReturnNan variant for floats the SIMDInstructionSet structs

[x] IgnoreNaN

[x] ReturnNaN

Changing the default behavior

[ ] switch from IgnoreNan to ReturnNan for argminmax

[x] simd_f16.rs should implement SIMDInstructionSet and not the IgnoreNan structs

[x] document header of IgnoreNan & ReturnNan implementations

[x] add tests for IgnoreNan

[x] add tests for ReturnNan

[ ] update benches (currently we only bench FloatIgnoreNaN) -> first assure that we currently have no regressions - is still the case! - if so, change the FloatIgnoreNan benches to FloatReturnNan (will most likely result in some regressions) and move the FloatIgnoreNan benches to dedicated bench file.

[x] update the ArgMinMax trait

[x] add scalar support

[x] update tests with correct scalar implementation

[x] implement SCALARIgnoreNaN

[x] double check correct scalar support for f16

Minor TODOs during this (waaay too large) PR:

[x] try to get rid of overhead for signed & unsigned integers (added to ignore NaNs in the first SIMD Vec, see commit edd083c5343e009c338b7a9f35cf68827adf90ab)

[x] add tests for infinities, resolves #20

[ ] :thinking: decide what we will default to: argminmax / nanargminmax?

Overview of the new "architecture"

default SIMDInstructionSet struct (e.g., AVX2) its argminmax is "return NaN" (e.g. simd_f32.rs)

"ignore NaN" is served in a distinct struct (e.g. AVX2FloatIgnoreNaN)
opened by jvdd 7
Add tests for NaN handling

This PR adds some (failing) tests for NaN handling.

This does work and is standalone, but obviously the tests fail, so do be aware of that prior to merging.

This can either be used as the base branch for other PRs in this series, such as #18 or can be extended here to a complete implementation.

opened by varon 1
:construction: support NaNs for SSE & AVX2 f32
This PR aims at handling Nans, #16.

For the scalar implementation I check whether the scalar value is not equal to itself. This check will only be true if the scalar value is NAN, as the following is correct in Rust:

let v = f32::NAN; assert!(v != v);

For the SIMD implementation I used a transformation similar to the one used in #1 - this transformation projects the NANs to integer values that are either higher / lower than the "real" floating point values. The transformation leverages the 2-complement https://observablehq.com/@rreusser/half-precision-floating-point-visualized

Some remarks:

+ inf & - inf will get projected as well

[x] validate if these are inbetween the NaNs & the "real" floating point values (pretty sure this is the case)
=> this is indeed the case - see plot :arrow_down:

there are a plethora of ways to represent NaNs in floating point representation: exponent should be 11111 and the fraction should be non-zero. Thus the sign bit may be 1 or 0 -> resulting in half of the NaNs getting projected above and the other half below the "real" floating point values. Thus, only 1 of the 2 checks (either > or <) will fire & thus detect the NaN. This might be a problem when we want to implement argmin and argmax as separate functions..

Paths I looked into but did not seem worthwhile:

I tried to handle NaNs with SIMD instructions, but the SIMD instruction to check whether there are a NaNs in the register turns out to be rather computationally expense
opened by jvdd 3
:muscle: handle NaNs
We want to properly handle NaNs.

What are NaNs?

NaNs are only present in float datatypes - and thus do not exist in (unsigend) integer datatypes.
For simplicity I will illustrate things for 16-bit numbers.

Some background :arrow_down:

Int16 representation
https://steffanynaranjo.medium.com/how-integers-are-stored-in-memory-using-twos-complement-8b4237f7426d

Float16 representation

!! We are dealing with fp16 here, NOT bfloat16

numpy.float16: 16-bit-precision floating-point number type: sign bit, 5 bits exponent, 10 bits mantissa.

So how do NaNs look like?

nans occur when exponent is all 1s and mantissa is not all 0s. (sign does not matter)

How do infinities look like?
+inf occurs when sign is 0, exp is all 1s, and mantissa is all 0s.

-inf occurs when sign is 1, exp is all 1s, and mantissa is all 0s.

How does the current implementation cope with NaNs?

Necessary background info: every comparison (gt, lt, eq) with NaNs results in false :arrow_down:

let v = f32::NAN; assert_eq!(v > 5.0, false); assert_eq!(v < 5.0, false); assert_eq!(v == 5.0, false);

=> Current implementation deals as follows with NaNs in the data:

if there are NO NaNs in the first SIMD lane: current implementation will ignore the NaNs (as comparisons with NaNs result in false and thus no NaNs will be added in the accumulating SIMD vector)

this corresponds to the np.nanargmax / np.nanargmin behavior

if there are ONE or MULTIPLE NaNs in first SIMD lane: current implementation will never update the NaNs in the accumulating SIMD vector (as comparisons will result in false and thus the NaNs will never be updated).

TODOs

[ ] fix this issue with NaN values in the first SIMD vector -> if we can create/ensure a "NaN-free" initial SIMD vec, than no NaNs will be added in the inner SIMD loop => we have an implementation of np.nanargmin / np.nanargmax.

How should we handle NaNs

Ideally we support, just like numpy, two variants of the argmim / argmax algorithm:

ignoring NaNs: as the current version does (if we fix the TODO above) -> corresponds to np.nanargmin / np.nanargmax

returning first NaN index once there is one present in the data -> corresponds to np.argmin / np.argmax

we can serve both functionalities by adding a new function to the ArgMinMax trait and let it default for non-float datatypes to the current implementation.
opened by jvdd 2

Releases(v0.4.0)

v0.4.0(Feb 13, 2023)
This release improves the underlying implementation by replacing ndarray::ArrayView1 with the more flexible (and zero-dependency) &[T] slice. This new implementation enhances the library's interoperability with other array-like data types and crates.

As of now importing the ArgMinMax trait adds the .argminmax function to:

slice (&[T])

Vec

Using optional features, the .argminmax function can be added to:

"ndarray" feature: ndarray::ArrayView1 & ndarray::Array1 (allows backwards compatibility)

"arrow" feature: arrow::array::PrimitiveArray

_{P.S.: To include the amazing @CodSpeedHQ continuous performance monitoring in our CI we had to add the "half" feature as default-feature (see https://github.com/CodSpeedHQ/codspeed-rust/issues/1). As soon as this issue is resolved we plan to remove again "half" from the default-features.}

What's Changed

:rabbit: add codspeed benchmarking to ci/cd by @jvdd in https://github.com/jvdd/argminmax/pull/15

:tea: use slice internally + implement for various types by @jvdd in https://github.com/jvdd/argminmax/pull/13

QOL improvements by @varon in https://github.com/jvdd/argminmax/pull/17

New Contributors

@varon made their first contribution in https://github.com/jvdd/argminmax/pull/17

Full Changelog: https://github.com/jvdd/argminmax/compare/v0.3.1...v0.4.0
Source code(tar.gz)
Source code(zip)
v0.3.1(Feb 5, 2023)

This is the latest release that uses ndarray::ArrayView1 as underlying and mandatory interface. In future releases, we will switch to using slices internally and offer optional compatibility with other data types, including ndarray::ArrayView1

_{P.S.: This release and tag were created after we had already implemented slices as the internal interface. The tag references the latest commit that includes the ndarray::ArrayView1 implementation, which corresponds to v0.3.1 of the Rust crate.}
Source code(tar.gz)
Source code(zip)

Owner

Jeroen Van Der Donckt

Research Engineer

GitHub https://crates.io/crates/argminmax

A Rust attribute macro to limit a function's number of runs over a specified period of time

throttle_my_fn: A Rust attribute macro to throttle the execution of functions throttle_my_fn is a Rust attribute macro to limit a function's number of

8 Dec 3, 2022

A fast and cross-platform Signed Distance Function (SDF) viewer, easily integrated with your SDF library.

SDF Viewer (demo below) A fast and cross-platform Signed Distance Function (SDF) viewer, easily integrated with your SDF library. A Signed Distance Fu

39 Dec 21, 2022

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

405 Dec 20, 2022

Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

This repo contains a prototype of Narwhal and Tusk. It supplements the paper Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus.

134 Dec 8, 2022

HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

hnsw Hierarchical Navigable Small World Graph for fast ANN search Enable the serde feature to serialize and deserialize HNSW. Tips A good default for

93 Dec 30, 2022

A Rust🦀 implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning

craftml-rs A Rust implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning (Siblini et al., 2018). Perf

15 Nov 6, 2022

🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

?? efficient approximate nearest neighbor search algorithm collections library written in Rust ?? .

2.3k Jan 3, 2023

Efficient ML solutions for long-tailed demands.

MegFlow MegFlow 是一个面向视觉应用的流式计算框架, 目标是简单、高性能、帮助机器学习应用快速落地。 Features 基于 async-std[features=tokio1] 的高效异步运行时调度器简洁的基于 toml 的建图描述格式支持静态、动态、共享子图支持 Rust/P

371 Dec 21, 2022

An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

Omikuji An efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, writte

73 Nov 7, 2022

1️⃣ el lisp number uno - one lisp to rule them all 🏆

luno el lisp number uno luno is the one lisp to rule them all. Still experimental, do not use it in production yet. goals embeddable small size simple

3 Apr 25, 2022

argmax is a library that allows Rust applications to avoid Argument list too long errors (E2BIG) by providing a std::process::Command wrapper with a

argmax argmax is a library that allows Rust applications to avoid Argument list too long errors (E2BIG) by providing a std::process::Command wrapper w

22 Nov 20, 2022

Demonstration of flexible function calls in Rust with function overloading and optional arguments

Table of Contents Table of Contents flexible-fn-rs What is this trying to demo? How is the code structured? Named/Unnamed and Optional arguments Mecha

81 Nov 3, 2022

A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

random_search A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the los

2 Apr 1, 2022

Efficient argmin & argmax (in 1 function)

Related tags

Overview

ArgMinMax

Installing

Example usage

Features

Benchmarks

Tests

Limitations

Acknowledgements

Comments

:sparkles: Contributions of this PR (focussing on performance) :rocket:

:eyes: short (& incomplete) summary of observed improvements

What are NaNs?

So how do NaNs look like?

How does the current implementation cope with NaNs?

How should we handle NaNs

Releases(v0.4.0)

v0.4.0(Feb 13, 2023)

What's Changed

New Contributors

v0.3.1(Feb 5, 2023)

Owner

Jeroen Van Der Donckt

A Rust attribute macro to limit a function's number of runs over a specified period of time

A fast and cross-platform Signed Distance Function (SDF) viewer, easily integrated with your SDF library.

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

A Rust🦀 implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning

🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

Efficient ML solutions for long-tailed demands.

An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

1️⃣ el lisp number uno - one lisp to rule them all 🏆

argmax is a library that allows Rust applications to avoid Argument list too long errors (E2BIG) by providing a std::process::Command wrapper with a

Demonstration of flexible function calls in Rust with function overloading and optional arguments

A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

Fast Function Dispatch: Improving the performance of Rust's dynamic function calls

A priority queue for Rust with efficient change function.

Rust cache structures and easy function memoization

An OpenGL function pointer loader for Rust

A Quest to Find a Highly Compressed Emoji :shortcode: Lookup Function

Deno Foreign Function Interface.

Easy-to-use optional function arguments for Rust