Generic k-means implementation written in Rust

Related tags

Machine learning rkm
Overview

RKM - Rust k-means

Build Statusdocs crates.io

A simple Rust implementation of the k-means clustering algorithm based on a C++ implementation, dkm.

This implementation is generic, and will accept any type that satisfies the Value trait requirements. At a minimum, numeric floating point types built into Rust (f32 and f64) should be supported. Rayon is used for parallelism to improve scalability at the cost of some performance on small data sets.

The parallel feature enables parallelism to speed up the algorithm in complex cases. The parallel algorithm may be slower than the non-parallel algorithm for small data sets, but is much faster for data sets with high dimensionality. Make sure you benchmark your use case with both configurations before deciding which to use.

Known to compile against Rust stable 1.41.0.

Usage

Calculate the k-means clusters for a set of data by calling rkm::kmeans_lloyd with your dataset in a 2D ndarray array and the number of clusters you would like to segment the data into. The return value will be a tuple containing the cluster means/centroids (as a 2D ndarray) and a Vec of indices that map each of the input data points to an element of the means array.

e.g.

// load an `ndarray.Array2` containing the points
let data = read_test_data();
// split the data into 3 clusters, iterating until the algorithm converges
let (means, clusters) = rkm::kmeans_lloyd(&data.view(), 3);
println!(
    "data:\n{:?}\nmeans:\n{:?}\nclusters:\n{:?}",
    data, means, clusters
);

Another variation of the algorithm can be used via the rkm::kmeans_lloyd_with_config function which takes a reference to an additional Config struct as a reference. This config struct can be used to change the termination behavior of the algorithm, or change the random seed used for initialization.

let config = Config::from(Some((7 as u128).to_le_bytes()), None, None);
let (means, clusters) = kmeans_lloyd_with_config(&data.view(), 3, &config);

You can change the termination behavior of the algorithm by specifying a maximum number of iterations to terminate at.

let config = Config::from(None, Some(10), None);
let (means, clusters) = kmeans_lloyd_with_config(&data.view(), 3, &config);

The other termination condition that can be set is the minimum delta in mean positions observed between iterations; if none of the means change by more than this amount between successive iterations, the algorithm terminates.

let config = Config::from(None, None, Some(12.7f));
let (means, clusters) = kmeans_lloyd_with_config(&data.view(), 3, &config);

You may also specify both limits and first one to be triggered will stop the algorithm. The default termination criteria for rkm::kmeans_lloyd or when an empty Config is used with rkm:::kmeans_lloyd_with_config is to terminate when the means don't change at all between iterations. It may be prudent to set an arbitrarily high iteration limit to prevent infinite oscillations between a small number of states in some non-convergent edge cases.

See examples/example.rs for a simple usage example.

Data

A small set of benchmarks for this library is included in benches/bench.rs. The data sets are as follows:

iris.data.csv natural data taken from measurements of different iris plants. 150 points, 2 dimensions, 3 clusters. Source: UCI machine learning repository.

s1.data.csv synthetic data. 5000 points, 2 dimensions, 15 clusters. Source: P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.

birch3.data.csv synthetic data large set. 100000 points, 2 dimensions, 100 clusters. Source: Zhang et al., "BIRCH: A new data clustering algorithm and its applications", Data Mining and Knowledge Discovery, 1 (2), 141-182, 1997

dim128.data.csv synthetic data with high dimensionality. 1024 points, 128 dimensions, 16 clusters. Source: P. Fränti, O. Virmajoki and V. Hautamäki, "Fast agglomerative clustering using a k-nearest neighbor graph", IEEE Trans. on Pattern Analysis and Machine Intelligence, 28 (11), 1875-1881, November 2006

Compared to dkm this implementation is slower for the small iris and s1 data sets, but faster for the dim128 and birch3 data sets.

Licensing

This code is licensed under the MIT license.

You might also like...
A naive DBSCAN implementation in Rust

DBSCAN Density-Based Spatial Clustering of Applications with Noise Wikipedia link DBSCAN is a density-based clustering algorithm: given a set of point

A random forest implementation in Rust

randomforest A random forest implementation in Rust. Examples use randomforest::criterion::Mse; use randomforest::RandomForestRegressorOptions; use ra

Rust implementation of multi-index hashing for neighbor searches on binary codes in the Hamming space

mih-rs Rust implementation of multi-index hashing (MIH) for neighbor searches on binary codes in the Hamming space, described in the paper Norouzi, Pu

kdtree implementation for rust.

kdtree-rust kdtree implementation for rust. Implementation uses sliding midpoint variation of the tree. More Info here Implementation uses single Vec

Rust implementation of user-based collaborative filtering

Rucommender Recommendation system written in Rust Overview An implementation in Rust of a collaborative filtering recommendations algorithm with a use

This repository features a simple Kalman filter and RTS smoother (KFS) implementation in Rust by using the ndarray library.
This repository features a simple Kalman filter and RTS smoother (KFS) implementation in Rust by using the ndarray library.

Kalman filter and RTS smoother in Rust (ndarray) This repository features a simple Kalman filter and RTS smoother (KFS) implementation in Rust by usin

Python+Rust implementation of the Probabilistic Principal Component Analysis model

Probabilistic Principal Component Analysis (PPCA) model This project implements a PPCA model implemented in Rust for Python using pyO3 and maturin. In

TopK algorithm implementation in Rust (Filtered Space-Saving)

TopK TopK algorithm implementation in Rust. This crate currently provides the Filtered Space-Saving algorithm. Version numbers follow the semver conve

Rust+OpenCL+AVX2 implementation of LLaMA inference code
Rust+OpenCL+AVX2 implementation of LLaMA inference code

RLLaMA RLLaMA is a pure Rust implementation of LLaMA large language model inference.. Supported features Uses either f16 and f32 weights. LLaMA-7B, LL

Comments
  • Failure to converge and panic under `initialize_plusplus`

    Failure to converge and panic under `initialize_plusplus`

    I'm currently running kmeans on a dataset generated from radiation data. From time to time the algorithm fails to converge (typically kmeans takes several seconds for the ~2 million points, have let it run overnight).

    On the other hand I have a second issue with the same data set, it will sometimes fail with:

    stack backtrace:
       0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
                 at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
       1: std::sys_common::backtrace::print
                 at libstd/sys_common/backtrace.rs:71
                 at libstd/sys_common/backtrace.rs:59
       2: std::panicking::default_hook::{{closure}}
                 at libstd/panicking.rs:211
       3: std::panicking::default_hook
                 at libstd/panicking.rs:227
       4: std::panicking::rust_panic_with_hook
                 at libstd/panicking.rs:475
       5: std::panicking::begin_panic
                 at /checkout/src/libstd/panicking.rs:409
       6: <rand::distributions::WeightedChoice<'a, T>>::new
                 at /home/omnipotententity/work/psd-dsp/<panic macros>:3
       7: rkm::initialize_plusplus
                 at /home/omnipotententity/.cargo/registry/src/github.com-1ecc6299db9ec823/rkm-0.3.0/src/lib.rs:77
       8: rkm::kmeans_lloyd
                 at /home/omnipotententity/.cargo/registry/src/github.com-1ecc6299db9ec823/rkm-0.3.0/src/lib.rs:143
       9: psd_dsp::main
                 at src/main.rs:206
      10: std::rt::lang_start::{{closure}}
                 at /checkout/src/libstd/rt.rs:74
      11: std::panicking::try::do_call
                 at libstd/rt.rs:59
                 at libstd/panicking.rs:310
      12: __rust_maybe_catch_panic
                 at libpanic_unwind/lib.rs:105
      13: std::rt::lang_start_internal
                 at libstd/panicking.rs:289
                 at libstd/panic.rs:392
                 at libstd/rt.rs:58
      14: std::rt::lang_start
                 at /checkout/src/libstd/rt.rs:74
      15: main
      16: __libc_start_main
      17: _start
    

    I'm currently using the version referenced on crates.io, 0.3.0.

    opened by OmnipotentEntity 7
  • Added CI..

    Added CI..

    Added a .travis.yml file, formatted the code, removed some unnecessary return statements and changed x..(x+1) to x..=x. Upon merging please do change the README.md to show the build status.

    opened by emmanuelantony2000 2
  • make initialization deterministic given seed

    make initialization deterministic given seed

    Also fixed a couple warnings along the way.

    I needed a way to seed the initialization deterministically, this does so here.

    I'll probably also be implementing max_iter or history summarization to prevent e.g. bistable solutions trap.

    opened by nlhepler 1
Owner
Nick Sarten
Nick Sarten
Generic decision trees for rust

Stamm Stamm is a rust library for creating decision trees and random forests in a very general way. Decision trees are used in machine learning for cl

null 10 Jul 28, 2022
Barnes-Hut t-SNE implementation written in Rust.

bhtsne Barnes-Hut implementation of t-SNE written in Rust. The algorithm is described with fine detail in this paper by Laurens van der Maaten. Instal

Francesco Iannelli 48 Dec 27, 2022
Rust implementation of real-coded GA for solving optimization problems and training of neural networks

revonet Rust implementation of real-coded genetic algorithm for solving optimization problems and training of neural networks. The latter is also know

Yury Tsoy 19 Aug 11, 2022
Instance Distance is a fast pure-Rust implementation of the Hierarchical Navigable Small Worlds paper

Fast approximate nearest neighbor searching in Rust, based on HNSW index

Instant Domain Search, Inc. 135 Dec 24, 2022
A real-time implementation of "Ray Tracing in One Weekend" using nannou and rust-gpu.

Real-time Ray Tracing with nannou & rust-gpu An attempt at a real-time implementation of "Ray Tracing in One Weekend" by Peter Shirley. This was a per

null 89 Dec 23, 2022
A neural network, and tensor dynamic automatic differentiation implementation for Rust.

Corgi A neural network, and tensor dynamic automatic differentiation implementation for Rust. BLAS The BLAS feature can be enabled, and requires CBLAS

Patrick Song 20 Nov 7, 2022
Flexible, reusable reinforcement learning (Q learning) implementation in Rust

Rurel Rurel is a flexible, reusable reinforcement learning (Q learning) implementation in Rust. Release documentation In Cargo.toml: rurel = "0.2.0"

Milan Boers 60 Dec 29, 2022
A Rust🦀 implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning

craftml-rs A Rust implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning (Siblini et al., 2018). Perf

Tom Dong 15 Nov 6, 2022
An implementation of the Pair Adjacent Violators algorithm for isotonic regression in Rust

Pair Adjacent Violators for Rust Overview An implementation of the Pair Adjacent Violators algorithm for isotonic regression. Note this algorithm is a

Ian Clarke 3 Dec 25, 2021
Rust implementation for DBSCANSD, a trajectory clustering algorithm.

DBSCANSD Rust implementation for DBSCANSD, a trajectory clustering algorithm. Brief Introduction DBSCANSD (Density-Based Spatial Clustering of Applica

Nick Gu 2 Mar 14, 2021