Efficient argmin & argmax (in 1 function)

Overview

ArgMinMax

Efficient argmin & argmax (in 1 function) with SIMD (SSE, AVX(2), AVX512, NEON) for f16, f32, f64, i8, i16, i32, i64, u8, u16, u32, u64.

🚀 The function is generic over the type of the array, so it can be used on &[T] or Vec<T> where T can be f161, f32, f64, i8, i16, i32, i64, u8, u16, u32, u64.

🤝 The trait is implemented for slice, Vec, 1D ndarray::ArrayBase2, and apache arrow::PrimitiveArray3.

Runtime CPU feature detection is used to select the most efficient implementation for the current CPU. This means that the same binary can be used on different CPUs without recompilation.

👀 The SIMD implementation contains no if checks, ensuring that the runtime of the function is independent of the input data its order (best-case = worst-case = average-case).

🪄 Efficient support for f16 and uints: through (bijective aka symmetric) bitwise operations, f16 (optional1) and uints are converted to ordered integers, allowing to use integer SIMD instructions.

1 for f16 you should enable the "half" feature.
2 for ndarray::ArrayBase you should enable the "ndarray" feature.
3 for arrow::PrimitiveArray you should enable the "arrow" feature.

Installing

Add the following to your Cargo.toml:

[dependencies]
argminmax = "0.4"

Example usage

use argminmax::ArgMinMax;  // import trait

let arr: Vec<i32> = (0..200_000).collect();  // create a vector

let (min, max) = arr.argminmax();  // apply extension

println!("min: {}, max: {}", min, max);
println!("arr[min]: {}, arr[max]: {}", arr[min], arr[max]);

Features

  • "half": support f16 argminmax (through using the half crate).
  • "ndarray": add ArgMinMax trait to ndarray its Array1 & ArrayView1.

Benchmarks

Benchmarks on my laptop (AMD Ryzen 7 4800U, 1.8 GHz, 16GB RAM) using criterion show that the function is 3-20x faster than the scalar implementation (depending of data type).

See /benches/results.

Run the benchmarks yourself with the following command:

cargo bench --quiet --message-format=short --features half | grep "time:"

Tests

To run the tests use the following command:

cargo test --message-format=short --all-features

Limitations

Does not support NaNs.


Acknowledgements

Some parts of this library are inspired by the great work of minimalrust's argmm project.

Comments
  • :recycle: major refactoring

    :recycle: major refactoring

    I'll document the changes tomorrow - a bit too tired now :upside_down_face:

    Long story short -> used more traits & structs to provide more flexibility (& less duplicate code) yay! :tada:

    -- did not add support for aarch64/arm (as I currently do not have access to my ARM testing device)

    opened by jvdd 4
  • Add NEON SIMD types to new NaN v3 design

    Add NEON SIMD types to new NaN v3 design

    Apparently we can indeed create pulls into pull requests.

    Here we go. This should add some functions and types for the NEON/arm64 SIMD specialisations.

    opened by varon 2
  • :sparkles: implement ArgMinMax trait for other types

    :sparkles: implement ArgMinMax trait for other types

    Currently the ArgMinMax trait is only implemented for ndarray::ArrayView1.
    It might be interesting to implement the trait for

    • Vec
    • array?

    Credits @minimalrust, see https://github.com/minimalrust/argmm/issues/8

    enhancement 
    opened by jvdd 2
  • :pie: update NEON SIMD

    :pie: update NEON SIMD

    Fixed the merge issues of #24

    With a lot of patience I debugged & updated the NEON code on my raspberry pi 3 :upside_down_face: image

    Thx for the headstart @varon :)


    This PR also fixes a bug in the tests in where we checked for the wrong target feature (and thus performed illegal instructions)

    opened by jvdd 1
  • QOL improvements

    QOL improvements

    • Add rust toolchain file to default to nightly
    • .gitignore improvements

    This is a super-tiny PR - just some minor touchups during the initial onboarding of working with this code.

    Adding the toolchain file solved a common gotcha, and the .gitignore adds support for working in CLion.

    opened by varon 1
  • :rabbit: add codspeed benchmarking to ci/cd

    :rabbit: add codspeed benchmarking to ci/cd

    To make codspeed work I had to add "half" as default feature. I will revert this once https://github.com/CodSpeedHQ/codspeed-rust/issues/1 is resolved

    opened by jvdd 1
  • :tea: use slice internally + implement for various types

    :tea: use slice internally + implement for various types

    • [x] use slice (&[T]) under the hood instead of ArrayView1
    • [x] ndarray -> optional dependency
    • extend trait implementation, resolves #9
      • [x] slice
      • [x] Vec
        • [x] test this
      • [x] ndarray
        • [x] test this
      • [x] arrow
        • [x] test this
    • [x] update Readme

    • [x] final rerun of benchmarks
      • [x] no regressions (i.e., same performance or even better :rocket:) for scalar implementation
      • [ ] no regressions (i.e., same performance or even better :rocket:) for SIMD implementation => I notices very few (minor) regressions for certain dtype / SIMD combinations & highlighted those in the benches/results file.. But, since the codspeed continuous monitoring only shows 2% regressions for the AVX2 u8 & i8 - (which is most likely not the most common datatypes for end users) - I am pro for merging this into main, given the significant enhanced flexibility that this PR adds :)
    opened by jvdd 1
  • :zap: optimize iteration over array

    :zap: optimize iteration over array

    This pull request updates the way we iterate over arrays in our code by replacing the use of the [exact_chunks] (https://docs.rs/ndarray/latest/ndarray/struct.ArrayBase.html#method.exact_chunks) method with a pointer that we offset ourselves. Our profiling showed that this change resulted in significant speed improvements, with up to a 50% speedup for 8-bit data types and noticeable speedups for other data types. This should result in a notable improvement in the overall performance of our code.

    :sparkles: Contributions of this PR (focussing on performance) :rocket:

    • [x] optimized iteration over array using (mutable) pointer that we offset ourselves
    • [x] reorder SIMD inner loop
    • [x] for 16-bit dtypes: _horiz_min and _horiz_max with SIMD
    • [x] use NEON SIMD implementation on arm for 32-bit dtypes (as this is now 2-3x faster than scalar)
    • [x] explore prefetching - as currently the bottleneck is loading the data => included in the code but commented out, as this has minor / no impact (only measurable positive impact on large arrays - 500M+ data points)

    This PR was mainly driven by analyzing stacktraces of perf.


    :eyes: short (& incomplete) summary of observed improvements

    x86(_64)

    • 8-bit dtypes: 80% faster :rocket: :rocket:
    • 16-bit dtypes: 5-12% faster (12% for floats, 10% for ints, 5% for uints)
    • 32-bit dtypes: 0-5% faster (5% for floats, 2% for ints, 0% for uints)
    • 64-bit dtypes: 3-17% faster (3% for floats, 17% for ints & uints)

    AArch64

    • 8-bit dtypes: 34-42% faster :rocket: :rocket: (34% for ints and 42% for floats)
    • 16-bit dtypes: 7% slower :cry: (=> bc of horiz SIMD? :thinking:)
    • 32-bit dtypes: 0-20% faster (20% for floats, 0% for ints & uints)

    ARM

    • 8-bit: 400% faster :rocket: :rocket:
    • 16-bit: 350-460% faster :rocket: :rocket: (350% for floats, 460% for ints & uints)
    • 32-bit: 420-450% faster :rocket: :rocket: (420% for floats, 450% for ints & uints)
    enhancement 
    opened by jvdd 1
  • :sparkles: support int8 and uint8

    :sparkles: support int8 and uint8

    :sparkles: add support for int8

    • [x] SIMD + SCALAR implementation
      • [x] alleviate bottleneck (horizontal argmin in final vector) -> algorithm: https://stackoverflow.com/a/9798369
        • [x] SSE
        • [x] AVX2
        • [x] AVX512(bw)
        • [x] ARM/AARCH64

    :thinking: add support for uint8?
    -> would probably require the XOR conversion ~~but would exit the overflow loop 2x fewer than int8 (guess this will be net 1.5x faster than int8 for longer arrays)~~ :sweat_smile: nvm still limited by the i8::MAX for overflow TODO: bench + horizontal SIMD implementation for neon

    • [x] SSE
    • [x] AVX2
    • [x] AVX512(bw)
    • [x] ARM/AARCH64

    :snowflake: minor improvements

    • [x] faster implementation of overflow loop
    • [ ] ~~remove _get_min_max_index_value~~ => is for another refactoring :recycle: PR

    Other stuff

    • [x] :robot: rerun benchmarks!
    opened by jvdd 0
  • :tada: NEON support for aarch64

    :tada: NEON support for aarch64

    • [x] :tada: add support for SIMD (NEON) on aarch64 cpu architectures
    • [x] :recycle: benchmark code on raspberry pi 4 (which is aarch64)
    • [x] :broom: fix some clippy warnings

    Thx @matdemeue for providing me with a raspberry pi 4!! :hugs: (this allowed me to develop & bench the code!)

    opened by jvdd 0
  • :broom: minor improvements

    :broom: minor improvements

    This PR will be squashed when merged - it includes #4

    Contributions of this PR (in addition to #4):

    • [x] minor refactoring of naming
    • [ ] ~~lazy_static~~ => once_cell => will skip this for now
    opened by jvdd 0
  • Make datatypes opt-in

    Make datatypes opt-in

    Hi @jvdd I am considering adding these to polars, but given that we define our own NaN handling, I only want to expose it for integer backed data.

    Could the floating points be feature gated? Polars compile times are really strained so I'd rather not include code we don't need. Let me know what you think!

    opened by ritchie46 2
  • :rocket: float NaN handling

    :rocket: float NaN handling

    Handle NaNs, closes #16 Behavior changes in this PR:

    • .argminmax returns the index of the first NaN value (instead of ignoring it) To realize this functionality, we use the transformation as detailed in #16 & explored in #18
    • .nanargminmax (new function :tada:) ignores NaNs (while being even faster for floats) => Only "downside" - if data contains ONLY NaNs and/or +inf/-inf this will return 0 (I believe we can accept this unexpected for now - seems like a very uncommon use case)

    Changing the "architecture":

    • [x] SIMD ops in stand-alone trait (better w.r.t. coupling & cohesion)
      • [x] create SIMDOps trait & add additional traits (see sketch below)
      • [x] add auto-implement for SIMDCore traits
      • [x] implement the SIMDOps trait for the various data types
        • [x] x86 / x86_64
        • [x] arm / Aarch64
    • [x] implement IgnoreNaN and ReturnNan variant for floats the SIMDInstructionSet structs
      • [x] IgnoreNaN
      • [x] ReturnNaN

    Changing the default behavior

    • [ ] switch from IgnoreNan to ReturnNan for argminmax
      • [x] simd_f16.rs should implement SIMDInstructionSet and not the IgnoreNan structs
      • [x] document header of IgnoreNan & ReturnNan implementations
      • [x] add tests for IgnoreNan
      • [x] add tests for ReturnNan
      • [ ] update benches (currently we only bench FloatIgnoreNaN) -> first assure that we currently have no regressions - is still the case! - if so, change the FloatIgnoreNan benches to FloatReturnNan (will most likely result in some regressions) and move the FloatIgnoreNan benches to dedicated bench file.
      • [x] update the ArgMinMax trait
    • [x] add scalar support
      • [x] update tests with correct scalar implementation
      • [x] implement SCALARIgnoreNaN
      • [x] double check correct scalar support for f16

    Minor TODOs during this (waaay too large) PR:

    • [x] try to get rid of overhead for signed & unsigned integers (added to ignore NaNs in the first SIMD Vec, see commit edd083c5343e009c338b7a9f35cf68827adf90ab)
    • [x] add tests for infinities, resolves #20
    • [ ] :thinking: decide what we will default to: argminmax / nanargminmax?

    Overview of the new "architecture"

    image

    • default SIMDInstructionSet struct (e.g., AVX2) its argminmax is "return NaN" (e.g. simd_f32.rs)
    • "ignore NaN" is served in a distinct struct (e.g. AVX2FloatIgnoreNaN)
    opened by jvdd 7
  • Add tests for NaN handling

    Add tests for NaN handling

    This PR adds some (failing) tests for NaN handling.

    This does work and is standalone, but obviously the tests fail, so do be aware of that prior to merging.

    This can either be used as the base branch for other PRs in this series, such as #18 or can be extended here to a complete implementation.

    opened by varon 1
  • :construction: support NaNs for SSE & AVX2 f32

    :construction: support NaNs for SSE & AVX2 f32

    This PR aims at handling Nans, #16.

    For the scalar implementation I check whether the scalar value is not equal to itself. This check will only be true if the scalar value is NAN, as the following is correct in Rust:

    let v = f32::NAN;
    assert!(v != v);
    

    For the SIMD implementation I used a transformation similar to the one used in #1 - this transformation projects the NANs to integer values that are either higher / lower than the "real" floating point values. The transformation leverages the 2-complement https://observablehq.com/@rreusser/half-precision-floating-point-visualized

    Some remarks:

    • + inf & - inf will get projected as well
      • [x] validate if these are inbetween the NaNs & the "real" floating point values (pretty sure this is the case)
        => this is indeed the case - see plot :arrow_down: image
    • there are a plethora of ways to represent NaNs in floating point representation: exponent should be 11111 and the fraction should be non-zero. Thus the sign bit may be 1 or 0 -> resulting in half of the NaNs getting projected above and the other half below the "real" floating point values. Thus, only 1 of the 2 checks (either > or <) will fire & thus detect the NaN. This might be a problem when we want to implement argmin and argmax as separate functions..

    Paths I looked into but did not seem worthwhile:

    opened by jvdd 3
  • :muscle: handle NaNs

    :muscle: handle NaNs

    We want to properly handle NaNs.

    What are NaNs?

    NaNs are only present in float datatypes - and thus do not exist in (unsigend) integer datatypes.
    For simplicity I will illustrate things for 16-bit numbers.

    Some background :arrow_down:

    Int16 representation https://steffanynaranjo.medium.com/how-integers-are-stored-in-memory-using-twos-complement-8b4237f7426d

    Float16 representation

    !! We are dealing with fp16 here, NOT bfloat16

    numpy.float16: 16-bit-precision floating-point number type: sign bit, 5 bits exponent, 10 bits mantissa.

    So how do NaNs look like?

    nans occur when exponent is all 1s and mantissa is not all 0s. (sign does not matter)

    How do infinities look like? +inf occurs when sign is 0, exp is all 1s, and mantissa is all 0s.

    -inf occurs when sign is 1, exp is all 1s, and mantissa is all 0s.

    How does the current implementation cope with NaNs?

    Necessary background info: every comparison (gt, lt, eq) with NaNs results in false :arrow_down:

    let v = f32::NAN;
    assert_eq!(v > 5.0, false);
    assert_eq!(v < 5.0, false);
    assert_eq!(v == 5.0, false);
    

    => Current implementation deals as follows with NaNs in the data:

    • if there are NO NaNs in the first SIMD lane: current implementation will ignore the NaNs (as comparisons with NaNs result in false and thus no NaNs will be added in the accumulating SIMD vector)
    • if there are ONE or MULTIPLE NaNs in first SIMD lane: current implementation will never update the NaNs in the accumulating SIMD vector (as comparisons will result in false and thus the NaNs will never be updated).

    TODOs

    • [ ] fix this issue with NaN values in the first SIMD vector -> if we can create/ensure a "NaN-free" initial SIMD vec, than no NaNs will be added in the inner SIMD loop => we have an implementation of np.nanargmin / np.nanargmax.

    How should we handle NaNs

    Ideally we support, just like numpy, two variants of the argmim / argmax algorithm:

    1. ignoring NaNs: as the current version does (if we fix the TODO above) -> corresponds to np.nanargmin / np.nanargmax
    2. returning first NaN index once there is one present in the data -> corresponds to np.argmin / np.argmax

    we can serve both functionalities by adding a new function to the ArgMinMax trait and let it default for non-float datatypes to the current implementation.

    opened by jvdd 2
Releases(v0.4.0)
  • v0.4.0(Feb 13, 2023)

    This release improves the underlying implementation by replacing ndarray::ArrayView1 with the more flexible (and zero-dependency) &[T] slice. This new implementation enhances the library's interoperability with other array-like data types and crates.

    As of now importing the ArgMinMax trait adds the .argminmax function to:

    • slice (&[T])
    • Vec

    Using optional features, the .argminmax function can be added to:

    P.S.: To include the amazing @CodSpeedHQ continuous performance monitoring in our CI we had to add the "half" feature as default-feature (see https://github.com/CodSpeedHQ/codspeed-rust/issues/1). As soon as this issue is resolved we plan to remove again "half" from the default-features.

    What's Changed

    • :rabbit: add codspeed benchmarking to ci/cd by @jvdd in https://github.com/jvdd/argminmax/pull/15
    • :tea: use slice internally + implement for various types by @jvdd in https://github.com/jvdd/argminmax/pull/13
    • QOL improvements by @varon in https://github.com/jvdd/argminmax/pull/17

    New Contributors

    • @varon made their first contribution in https://github.com/jvdd/argminmax/pull/17

    Full Changelog: https://github.com/jvdd/argminmax/compare/v0.3.1...v0.4.0

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Feb 5, 2023)

    This is the latest release that uses ndarray::ArrayView1 as underlying and mandatory interface. In future releases, we will switch to using slices internally and offer optional compatibility with other data types, including ndarray::ArrayView1

    P.S.: This release and tag were created after we had already implemented slices as the internal interface. The tag references the latest commit that includes the ndarray::ArrayView1 implementation, which corresponds to v0.3.1 of the Rust crate.

    Source code(tar.gz)
    Source code(zip)
Owner
Jeroen Van Der Donckt
Research Engineer
Jeroen Van Der Donckt
A Rust attribute macro to limit a function's number of runs over a specified period of time

throttle_my_fn: A Rust attribute macro to throttle the execution of functions throttle_my_fn is a Rust attribute macro to limit a function's number of

Fred Morcos 8 Dec 3, 2022
A fast and cross-platform Signed Distance Function (SDF) viewer, easily integrated with your SDF library.

SDF Viewer (demo below) A fast and cross-platform Signed Distance Function (SDF) viewer, easily integrated with your SDF library. A Signed Distance Fu

null 39 Dec 21, 2022
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

Synerise 405 Dec 20, 2022
Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

This repo contains a prototype of Narwhal and Tusk. It supplements the paper Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus.

Facebook Research 134 Dec 8, 2022
HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

hnsw Hierarchical Navigable Small World Graph for fast ANN search Enable the serde feature to serialize and deserialize HNSW. Tips A good default for

Rust Computer Vision 93 Dec 30, 2022
A Rust🦀 implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning

craftml-rs A Rust implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning (Siblini et al., 2018). Perf

Tom Dong 15 Nov 6, 2022
🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

?? efficient approximate nearest neighbor search algorithm collections library written in Rust ?? .

Hora-Search 2.3k Jan 3, 2023
Efficient ML solutions for long-tailed demands.

MegFlow MegFlow 是一个面向视觉应用的流式计算框架, 目标是简单、高性能、帮助机器学习应用快速落地。 Features 基于 async-std[features=tokio1] 的高效异步运行时调度器 简洁的基于 toml 的建图描述格式 支持静态、动态、共享子图 支持 Rust/P

旷视天元 MegEngine 371 Dec 21, 2022
An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

Omikuji An efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, writte

Tom Dong 73 Nov 7, 2022
1️⃣ el lisp number uno - one lisp to rule them all 🏆

luno el lisp number uno luno is the one lisp to rule them all. Still experimental, do not use it in production yet. goals embeddable small size simple

Eva Pace 3 Apr 25, 2022
argmax is a library that allows Rust applications to avoid Argument list too long errors (E2BIG) by providing a std::process::Command wrapper with a

argmax argmax is a library that allows Rust applications to avoid Argument list too long errors (E2BIG) by providing a std::process::Command wrapper w

David Peter 22 Nov 20, 2022
Demonstration of flexible function calls in Rust with function overloading and optional arguments

Table of Contents Table of Contents flexible-fn-rs What is this trying to demo? How is the code structured? Named/Unnamed and Optional arguments Mecha

Tien Duc (TiDu) Nguyen 81 Nov 3, 2022
A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

random_search A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the los

ph04 2 Apr 1, 2022
Fast Function Dispatch: Improving the performance of Rust's dynamic function calls

Fast Function Dispatch: Improving the performance of Rust's dynamic function calls A safe, pragmatic toolkit for high-performance virtual function cal

Joshua Barretto 30 Apr 12, 2024
A priority queue for Rust with efficient change function.

PriorityQueue This crate implements a Priority Queue with a function to change the priority of an object. Priority and items are stored in an IndexMap

null 139 Dec 30, 2022
Rust cache structures and easy function memoization

cached Caching structures and simplified function memoization cached provides implementations of several caching structures as well as a handy macro f

James Kominick 996 Jan 7, 2023
An OpenGL function pointer loader for Rust

gl-rs Overview This repository contains the necessary building blocks for OpenGL wrapper libraries. For more information on each crate, see their resp

Brendan Zabarauskas 621 Dec 17, 2022
A Quest to Find a Highly Compressed Emoji :shortcode: Lookup Function

Highly Compressed Emoji Shortcode Mapping An experiment to try and find a highly compressed representation of the entire unicode shortcodes-to-emoji m

Daniel Prilik 13 Nov 16, 2021
Deno Foreign Function Interface.

deno_plugin_ffi (WIP & Need Help) Deno Foreign Function Interface. deno_ffi is a Deno plugin for loading and calling dynamic libraries using pure Java

Deno Foreign Function Interface 37 Aug 18, 2022
Easy-to-use optional function arguments for Rust

OptArgs uses const generics to ensure compile-time correctness. I've taken the liberty of expanding and humanizing the macros in the reference examples.

Jonathan Kelley 37 Nov 18, 2022