Rust library for Self Organising Maps (SOM).

Avinash Shenoy

Last update: Oct 17, 2022

Related tags

Machine learning rust machine-learning som ml crates rust-library self-organizing-map

Overview

RusticSOM

Rust library for Self Organising Maps (SOM).

Using this Crate

Add rusticsom as a dependency in Cargo.toml

[dependencies]
rusticsom = "1.1.0"

Include the crate

use rusticsom::SOM;

API

Use SOM::create to create an SOM object using the API call below, which creates an SOM with length x breadth cells and accepts neurons of length inputs.

pub fn create(length: usize, breadth: usize, inputs: usize, randomize: bool, learning_rate: Option<f32>, sigma: Option<f32>, decay_function: Option<fn(f32, u32, u32) -> f64>, neighbourhood_function: Option<fn((usize, usize), (usize, usize), f32) -> Array2<f64>>) -> SOM { ... }

randomize is a flag, which, if true, initializes the weights of each cell to random, small, floating-point values.

learning_rate, optional, is the learning_rate of the SOM; by default it will be 0.5.

sigma, optional, is the spread of the neighbourhood function; by default it will be 1.0.

decay_function, optional, is a function pointer that accepts functions that take 3 parameters of types f32, u32, u32, and returns an f64. This function is used to "decay" both the learning_rate and sigma. By default it is

new_value = old_value / (1 + current_iteration/total_iterations)

neighbourhood_function, optional, is also a function pointer that accepts functions that take 3 parameters, a tuple of type (usize, usize) representing the size of the SOM, another tuple of type (usize, usize) representing the position of the winner neuron, and an f32 representing sigma; and returns a 2D Array containing weights of the neighbours of the winning neuron, i.e, centered at winner. By default, the Gaussian function will be used, which returns a "Gaussian centered at the winner neuron".

    pub fn from_json(serialized: &str,  decay_function: Option<fn(f32, u32, u32) -> f64>, neighbourhood_function: Option<fn((usize, usize), (usize, usize), f32) -> Array2<f64>>) -> serde_json::Result<SOM> { ... }

This function allows to create a SOM from a previously exported SOM json data using SOM::to_json().

Use SOM_Object.train_random() to train the SOM with the input dataset, where samples from the input dataset are picked in a random order.

pub fn train_random(&mut self, data: Array2<f64>, iterations: u32) { ... }

Samples (rows) from the 2D Array data are picked randomly and the SOM is trained for iterations iterations!

Use SOM_Object.train_batch() to train the SOM with the input dataset, where samples from the input dataset are picked in a sequential order.

pub fn train_batch(&mut self, data: Array2<f64>, iterations: u32) { ... }

Samples (rows) from the 2D Array data are picked sequentially and the SOM is trained for iterations iterations!

Use SOM_Object.winner() to find the winning neuron for a given sample.

pub fn winner(&mut self, elem: Array1<f64>) -> (usize, usize) { ... }

This function must be called with an SOM object.

Requires one parameter, a 1D Array of f64s representing the input sample.

Returns a tuple (usize, usize) representing the x and y coordinates of the winning neuron in the SOM.

Use SOM_Object.winner_dist() to find the winning neuron for a given sample, and it's distance from this winner neuron.

pub fn winner_dist(&mut self, elem: Array1<f64>) -> ((usize, usize), f64) { ... }

This function must be called with an SOM object.

Requires one parameter, a 1D Array of f64s representing the input sample.

Returns a tuple (usize, usize) representing the x and y coordinates of the winning neuron in the SOM.

Also returns an f64 representing the distance of the input sample from this winner neuron.

pub fn activation_response(&self) -> ArrayView2<usize> { ... }

This function returns the activation map of the SOM. The activation map is a 2D Array where each cell at (i, j) represents the number of times the (i, j) cell of the SOM was picked to be the winner neuron.

pub fn get_size(&self) -> (usize, usize)

This function returns a tuple representing the size of the SOM. Format is (length, breadth).

pub fn distance_map(self) -> Array2<f64> { ... }

Returns the distance map of the SOM, i.e, the normalized distance of every neuron with every other neuron.

pub fn to_json(&self) -> serde_json::Result<String> { ... }

Returns the internal SOM data as pretty printed json (using serde_json).

Primary Contributors


	Aditi Srinivas
	Avinash Shenoy

Example

We've tested this crate on the famous iris dataset (present in csv format in the extras folder).

The t_full_test function in /tests/test.rs was used to produce the required output. The following plots were obtained using matplotlib for Python.

Using a 5 x 5 SOM, trained for 250 iterations :

Using a 10 x 10 SOM, trained for 1000 iterations :

Symbol	Represents
Circle	setosa
Square	versicolor
Diamond	virginica

Comments

Add lint and formatting checks to CI

This PR adds two new CI steps: one to check formatting (rustfmt), and another to check for code lints (clippy).

I reformatted the code and fixed all lints. However, I purposely introduced one lint and formatting mistake to ensure the CI steps execute as expected. If these changes are desirable, I will remove the last commit so this PR can merge.

opened by JayKickliter 4
implement serde_json export/import

In order to use trained SOMs for later use, I have had the need for import/export. Using serde I have implemented json as import/export format. One simple test added under tests/.

If you like it, you can include it in mainstream.

While I was at it, I have added Edition 2008 in Cargo.toml and fixed the one or other compiler warning.

opened by gin66 3
[docs] Reformat docstrings to be auto-generated

Hey!

Dunno if you're accepting PRs, but this is a documentation-only change that basically just uses /// instead of // for docstrings, so that the auto-generated documentation on docs.rs will actually display all the existing documentation.

opened by beyarkay 2
Simplify code where possible
This PR is a non-comprehensive¹ effort to:

reduce temporary heap allocations

delete unnecessary code

improve readability

1: I have another branch with more aggressive, but API breaking, optimizations that remove a lot of heap allocations. I can open a follow-up PR for that along with a major version bump.
opened by JayKickliter 1
update ndarray to 0.13

This updates the ndarray dependency to 0.13 for compatibility with minor changes to the public types. I also bumped the version number to 1.1.1 and fixed some unused warnings.

Side-note: I noticed the 1.1.0 version did not seem to be published to crates, but using the repository url in the cargo file is an easy workaround.

opened by masonblier 1
Potential NaNs due to div by zero when normalising
So these lines in the update method:

for i in 0..self.data.x { for j in 0..self.data.y { for k in 0..self.data.z { self.data.map[[i, j, k]] += (elem[[k]] - self.data.map[[i, j, k]]) * g[[i, j]]; } let norm = norm(self.data.map.index_axis(Axis(0), i).index_axis(Axis(0), j)); for k in 0..self.data.z { self.data.map[[i, j, k]] /= norm; } } }

Were causing me issues because norm was ending up as zero, causing a divide by zero to make self.data.map[[i, j, k]] be f64::NaN and resulting in funky results later down the line (I've got NaN values in my input features)

I'm not sure what the purpose of the normalization is? I understand that it would ensure each neuron's weights sum to 1, but I can't find where this is recommended.

On my own fork I've wrapped the normalisation with a check to make sure norm >0 and that seems to have solved the issues, although I'm not sure how valid it is.
opened by beyarkay 0

Breaking changes: increase performance

This PR significantly increases training and winner-lookup speed. The speedup is primarily achieved by reducing heap allocations. However, the cost is that it introduces API-breaking changes.

The methodology was first to add benchmarking to the current code base and measuring performance deltas after every little tweak to the code. In some cases, removing intermediate heap allocations led to performance regressions. In those cases, I left comments explaining why they're necessary.

Because this PR introduces breaking changes, I added breaking change: the non-default feature serde-1. This helps with build times for people not interested in serialization. Building with serde-1 enables this crate's old [to, from]_json support. I believe that those functions are out of this crate's scope, but as long as they are disabled by default, I see no harm.

Benchmarks

I first ran cargo bench without any library modifications, and the output below is after rerunning it on the tip of this branch.

Training/Random/10      time:   [68.191 us 68.893 us 69.746 us]
                        thr8 Kelem/s 145.15 Kelem/s 146.65 Kelem/s]
                 change:
                        time:   [-4.4626% -3.2623% -2.0623%] (p = 0.00 < 0.05)
                        thrpt:  [+2.1057% +3.3724% +4.6710%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
Training/Batch/10       time:   [59.171 us 59.415 us 59.682 us]
                        thrpt:  [167.55 Kelem/s 168.31 Kelem/s 169.00 Kelem/s]
                 change:
                        time:   [-19.837% -18.782% -17.789%] (p = 0.00 < 0.05)
                        thrpt:  [+21.639% +23.126% +24.745%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Training/Random/100     time:   [666.92 us 671.67 us 678.39 us]
                        thrpt:  [147.41 Kelem/s 148.88 Kelem/s 149.94 Kelem/s]
                 change:
                        time:   [-6.2380% -4.6824% -2.8067%] (p = 0.00 < 0.05)
                        thrpt:  [+2.8878% +4.9124% +6.6531%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe
Training/Batch/100      time:   [605.23 us 607.84 us 611.17 us]
                        thrpt:  [163.62 Kelem/s 164.52 Kelem/s 165.23 Kelem/s]
                 change:
                        time:   [-17.402% -16.414% -15.498%] (p = 0.00 < 0.05)
                        thrpt:  [+18.340% +19.637% +21.068%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
Training/Random/1000    time:   [6.7615 ms 6.7930 ms 6.8285 ms]
                        thrpt:  [146.45 Kelem/s 147.21 Kelem/s 147.90 Kelem/s]
                 change:
                        time:   [-3.2269% -2.6307% -2.0150%] (p = 0.00 < 0.05)
                        thrpt:  [+2.0564% +2.7017% +3.3345%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
Training/Batch/1000     time:   [6.0276 ms 6.0493 ms 6.0753 ms]
                        thrpt:  [164.60 Kelem/s 165.31 Kelem/s 165.90 Kelem/s]
                 change:
                        time:   [-14.930% -14.487% -14.008%] (p = 0.00 < 0.05)
                        thrpt:  [+16.290% +16.942% +17.550%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Winner/Plain/4          time:   [1.0489 us 1.0518 us 1.0548 us]
                        change: [-45.352% -44.837% -44.342%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Winner/Distance/4       time:   [1.0665 us 1.0722 us 1.0785 us]
                        change: [-48.275% -47.959% -47.654%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

Notes

I'm fairly certain that the lackluster speedup of random training is explained in this comment.

opened by JayKickliter 1

Owner

Avinash Shenoy

A venti cup of coffee in a land of tea

GitHub

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Sonos' Neural Network inference engine. This project used to be called tfdeploy, or Tensorflow-deploy-rust. What ? tract is a Neural Network inference

1.5k Jan 8, 2023

Msgpack serialization/deserialization library for Python, written in Rust using PyO3, and rust-msgpack. Reboot of orjson. msgpack.org[Python]

ormsgpack ormsgpack is a fast msgpack library for Python. It is a fork/reboot of orjson It serializes faster than msgpack-python and deserializes a bi

139 Dec 30, 2022

A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio