k-Medoids clustering in Rust with the FasterPAM algorithm

Overview

k-Medoids Clustering in Rust with FasterPAM

This Rust crate implements k-medoids clustering with PAM. It can be used with arbitrary dissimilarites, as it requires a dissimilarity matrix as input.

For further details on the implemented algorithm FasterPAM, see:

Erich Schubert, Peter J. Rousseeuw
Fast and Eager k-Medoids Clustering:
O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
Information Systems (101), 2021, 101804
https://doi.org/10.1016/j.is.2021.101804 (open access)

an earlier (slower, and now obsolete) version was published as:

Erich Schubert, Peter J. Rousseeuw:
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
In: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.
https://doi.org/10.1007/978-3-030-32047-8_16
Preprint: https://arxiv.org/abs/1810.05691

This is a port of the original Java code from ELKI to Rust.

If you use this code in scientific work, please cite above papers. Thank you.

Example

let dissim = ndarray::arr2(&[[0,1,2,3],[1,0,4,5],[2,4,0,6],[3,5,6,0]]);
let mut meds = kmedoids::random_initialization(4, 2, &mut rand::thread_rng());
let (loss, assingment, n_iter, n_swap): (f64, _, _, _) = kmedoids::fasterpam(&dissim, &mut meds, 100);
println!("Loss is: {}", loss);

Note that:

  • you need to specify the "output" data type of loss -- chose a signed type with sufficient precision. For example for unsigned distances using u32, it may be better to use i64 to compute the loss.
  • the input distance type needs to be convertible into the output data type via Into

Implemented Algorithms

  • FasterPAM (Schubert and Rousseeuw, 2020, 2021)
  • FasterPAM with an integrated additional shuffling step
  • FastPAM1 (Schubert and Rousseeuw, 2019, 2021)
  • PAM (Kaufman and Rousseeuw, 1987) with BUILD and SWAP
  • Alternating optimization (k-means-style algorithm)
  • Silhouette index for evaluation (Rousseeuw, 1987)

Note that the k-means-like algorithm for k-medoids tends to find much worse solutions.

Rust Dependencies

  • num-traits for supporting different numeric types
  • ndarray for arrays (optional)
  • rand for random initialization (optional)

License: GPL-3 or later

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

FAQ: Why GPL and not Apache/MIT/BSD?

Because copyleft software like Linux is what built the open-source community.

Tit for tat: you get to use my code, I get to use your code.

You might also like...
Execute genetic algorithm (GA) simulations in a customizable and extensible way.

genevo genevo provides building blocks to run simulations of optimization and search problems using genetic algorithms (GA). The vision for genevo is

An Implementation of the Context Tree Weighting (CTW) Sequence Prediction Algorithm

Context Tree Weighting (CTW) CTW is a lightweight, practical and well performing sequence prediction algorithm discovered by Frans Willems, Yuri Shtar

A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

random_search A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the los

Msgpack serialization/deserialization library for Python, written in Rust using PyO3, and rust-msgpack. Reboot of orjson. msgpack.org[Python]

ormsgpack ormsgpack is a fast msgpack library for Python. It is a fork/reboot of orjson It serializes faster than msgpack-python and deserializes a bi

Practice repo for learning Rust. Currently going through "Rust for JavaScript Developers" course.

rust-practice 🦀 Practice repo for learning Rust. Directories /rust-for-js-dev Files directed towards "Rust for JavaScript Developers" course. Thank y

A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio

🦀Rust Turkiye - Rust Dersleri

Rust Turkiye - Rust Dersleri CURIOSITY - Featuring Richard Feynman Bu repo Rust Turkiye tarafindan duzenlenen Rust Dersleri egitiminin alistirma ve ko

A Rust machine learning framework.

Linfa linfa (Italian) / sap (English): The vital circulating fluid of a plant. linfa aims to provide a comprehensive toolkit to build Machine Learning

Machine Learning library for Rust

rusty-machine This library is no longer actively maintained. The crate is currently on version 0.5.4. Read the API Documentation to learn more. And he

Comments
  • a Even faster implementation

    a Even faster implementation

    Dear rust-k-mediods team,

    This is amazing implementation and many thanks for this. I noticed an even faster one, which was based on fastPAM but 4 times faster. It is not an exact solution but approximate solution. However, it is as accuracy as the exact solution in practice. Check here: https://proceedings.neurips.cc/paper/2020/file/73b817090081cef1bca77232f4532c5d-Paper.pdf

    Any thoughts and potential to also implement this? It would be even useful for very large dataset.

    Thanks,

    Jianshu

    opened by jianshu93 1
Releases(v0.4.1)
  • v0.4.1(Sep 24, 2022)

    Added FastMSC, FasterMSC, PAMMEDSIL and PAMSIL clustering, to optimize the (Medoid) Silhouette directly.

    For details on these methods, please see the upcoming publication Lars Lenssen, Erich Schubert: Clustering by Direct Optimization of the Medoid Silhouette In: 15th International Conference on Similarity Search and Applications (SISAP 2022) https://doi.org/10.1007/978-3-031-17849-8_15

    Source code(tar.gz)
    Source code(zip)
Owner
Erich Schubert
Professor of Data Mining. Working on: Outlier Detection; Cluster Analysis; Text Mining; Intrinsic Dimensionality; High-Dimensional Data
Erich Schubert
A rust library inspired by kDDBSCAN clustering algorithm

kddbscan-rs Rust implementation of the kddbscan clustering algorithm. From the authors of kDDBSCAN algorithm. Due to the adoption of global parameters

WhizSid 2 Apr 28, 2021
Rust implementation for DBSCANSD, a trajectory clustering algorithm.

DBSCANSD Rust implementation for DBSCANSD, a trajectory clustering algorithm. Brief Introduction DBSCANSD (Density-Based Spatial Clustering of Applica

Nick Gu 2 Mar 14, 2021
Fast hierarchical agglomerative clustering in Rust.

kodama This crate provides a fast implementation of agglomerative hierarchical clustering. This library is released under the MIT license. The ideas a

Diffeo 61 Oct 7, 2022
A Rust🦀 implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning

craftml-rs A Rust implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning (Siblini et al., 2018). Perf

Tom Dong 15 Nov 6, 2022
DBSCAN and OPTICS clustering algorithms.

petal-clustering A collection of clustering algorithms. Currently this crate provides DBSCAN and OPTICS. Examples The following example shows how to c

Petabi 15 Dec 15, 2022
🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

?? efficient approximate nearest neighbor search algorithm collections library written in Rust ?? .

Hora-Search 2.3k Jan 3, 2023
Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). LGC and CAMLP have been implemented.

label-propagation-rs Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). A simple LGC and a mor

vaaaaanquish 4 Sep 15, 2021
An implementation of the Pair Adjacent Violators algorithm for isotonic regression in Rust

Pair Adjacent Violators for Rust Overview An implementation of the Pair Adjacent Violators algorithm for isotonic regression. Note this algorithm is a

Ian Clarke 3 Dec 25, 2021
Rust port of the extended isolation forest algorithm for anomaly detection

Extended Isolation Forest This is a rust port of the anomaly detection algorithm described in Extended Isolation Forest and implemented in https://git

Nico Mandery 6 Oct 21, 2022
TopK algorithm implementation in Rust (Filtered Space-Saving)

TopK TopK algorithm implementation in Rust. This crate currently provides the Filtered Space-Saving algorithm. Version numbers follow the semver conve

null 6 Feb 24, 2023