Dynamically get the suggested clusters in the data for unsupervised learning.

Miles Granger

Last update: Dec 9, 2022

Related tags

Machine learning python unsupervised clustering cluster scikit-learn kmeans unsupervised-learning cluster-count

Overview

Python implementation of the Gap Statistic

Purpose

Dynamically identify the suggested number of clusters in a data-set using the gap statistic.

Full example available in a notebook HERE

Install:

Bleeding edge:

pip install git+git://github.com/milesgranger/gap_statistic.git

PyPi:

pip install --upgrade gap-stat

With Rust extension:

pip install --upgrade gap-stat[rust]

Uninstall:

pip uninstall gap-stat

Methodology:

This package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in "Estimating the number of clusters in a data set via the gap statistic" (Tibshirani et al.).

The methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:

Taking the k maximizing the Gap value, which is calculated for each k. This, however, might not always be possible, as for many datasets this value is monotonically increasing or decreasing.
Taking the smallest k such that Gap(k) >= Gap(k+1) - s(k+1). This is the method suggested in Tibshirani et al. (consult the paper for details). The measure diff = Gap(k) - Gap(k+1) + s(k+1) is calculated for each k; the parallel here, then, is to take the smallest k for which diff is positive. Note that in some cases this can be true for the entire range of k.
Taking the k maximizing the Gap* value, an alternative measure suggested in "A comparison of Gap statistic definitions with and with-out logarithm function" by Mohajer, Englmeier and Schmid. The authors claim this measure avoids the over-estimation of the number of clusters from which the original Gap statistics suffers, and can also suggest an optimal value for k for cases in which Gap cannot. They do warn, however, that the original Gap statistic performs better than Gap* in the case of overlapped clusters, due to its tendency to overestimate the number of clusters.

Note that none of the above methods is guaranteed to find an optimal value for k, and that they often contradict one another. Rather, they can provide more information on which to base your choice of k, which should take numerous other factors into account.

Use:

First, construct an OptimalK object. Optional intialization parameters are:

n_jobs - Splits computation into this number of parallel jobs. Requires choosing a parallel backend.
parallel_backend - Possible values are joblib, rust or multiprocessing for the built-in Python backend. If parallel_backend == 'rust' it will use all cores.
clusterer - Takes a custom clusterer function to be used when clustering. See the example notebook for more details.
clusterer_kwargs - Any keyword arguments to be forwarded to the custom clusterer function on each call.

An example intialization:

optimalK = OptimalK(n_jobs=4, parallel_backend='joblib')

After the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:

X - A pandas dataframe or numpy array of data points of shape (n_samples, n_features).
n_refs - The number of random reference data sets to use as inertia reference to actual data. Optional.
cluster_array - A 1-dimensional iterable of integers; each representing n_clusters to try on the data. Optional.

For example:

import numpy as np
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))

After performing the search procedure, a DataFrame of gap values and other usefull statistics for each passed cluster count is now available as the gap_df attributre of the OptimalK object:

optimalK.gap_df.head()

The columns of the dataframe are:

n_clusters - The number of clusters for which the statistics in this row were calculated.
gap_value - The Gap value for this n.
gap* - The Gap* value for this n.
ref_dispersion_std - The standard deviation of the reference distributions for this n.
sk - The standard error of the Gap statistic for this n.
sk* - The standard error of the Gap* statistic for this n.
diff - The diff value for this n (see the methodology section for details).
diff* - The diff* value for this n (corresponding to the diff value for Gap*).

Additionally, the relation between the above measures and the number of clusters can be plotted by calling the OptimalK.plot_results() method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:

A plot of the Gap value versus n, the number of clusters.
A plot of diff versus n.
A plot of the Gap* value versus n, the number of clusters.
A plot of the diff* value versus n.

Comments

Advanced method to find k + Gap* alternative measure + documentation
This pull request is meant to address four issues (three of which I've opened, just for order's sake):

Issue #14, namely implementing step 3 from the original Gap statistic paper, which is the authors' recommended method to choose k. This part is partly based on code found in a fork of this repo by a user named druogury, which implements this step (among other enhancements).

Issue #25, in which I suggested the possible enhancement of also providing the alternative Gap* measure (link) in the resulting gap_df attribute. This is done without effecting the results of the calculation.

Issue #26, in which I suggested to add documentation to the readme file, presenting concisely both the package's API and the method on which it is based.

Issue #27, in which I suggested to add a plotting function implementing the plotting logic found in the example notebook.

To this end, this pull request changes exactly three files:

.gitignore - Just adds ignoring of .swp files (for people working with vim, like myself) and of .DS_Store files (for people working on mac, like myself).

optimalK.py - Adds:

The calculation of measures used for the aforementioned step 3 and Gap* in the OptimalK._calculate_gap() method.

Their additions to the gap_df attribute - and additional calculations which cannot be done in a k-specific function - in the OptimalK.__call__() method.

The plotting of four relevant plots (including the Gap-vs-n one from the example notebook) in the newly added OptimalK.plot_results() method.

README.md - Adds documentation for the original suggested method (very basic; more of a mention), the package's API and all new features. This is partial, but a good start, I believe.

Examples images are below.

An important note is that while this works great on my machine, I was unable to test it as it seems the azure-pipelines-based CI solutions you recently switched to is not working. I think this should be addressed, but you can also just pull this - this code is really safe, doing nothing not already done by the package.

If I can somehow help with getting testing to work again I'd love to land a hand. I would suggest adding travis back as an additional platform to run tests, regardless of what happens with Azure Pipelines; it's free, and you don't have to put the badge, but it can help with development in the meantime.

An example of the enhanced dataframe can be seen here:

And of the new plots here:
opened by shaypal5 18
negative gap values

Can someone please explain why gap values in this implementation are negative? Based on the formulas in the paper, a negative gap value means that distances in random clusterings are smaller than the original clustering => random ones are better than the original one. I have verified that in another book as well.

opened by shahabh 5

Is it normal to get different optimal ks for identically-started runs?

My data are below. Here are the OptimalK calls,

ok = OptimalK(parallel_backend='rust')
results = {i:0 for i in range(1,50)}
for i in range(100):
    k = ok(d, cluster_array=np.arange(1,50))
    results[k] += 1
results = {k:results[k] for k in results if results[k] != 0}
print('k\t%answers')
for k in results:
    print('{}\t{}%'.format(k, results[k]))

Returns,

k	%answers
3	26%
4	10%
5	4%
6	5%
7	9%
8	13%
9	15%
10	9%
11	2%
12	2%
13	2%
14	1%
15	1%
17	1%

d = array([[4.93510425e-01, 2.90838450e-01],
       [3.09295535e-01, 6.62941515e-01],
       [5.54909945e-01, 7.95982182e-01],
       [8.83511543e-01, 6.47582054e-01],
       [8.89093220e-01, 5.11645794e-01],
       [9.26957607e-01, 5.39648056e-01],
       [8.82565558e-01, 6.00233793e-01],
       [5.19726157e-01, 2.54190862e-02],
       [5.55906892e-01, 7.95450807e-01],
       [3.07076156e-01, 6.56616330e-01],
       [4.48115617e-01, 3.06021482e-01],
       [4.91613209e-01, 2.76198626e-01],
       [3.27725083e-01, 6.29869163e-01],
       [9.03380156e-01, 5.39514124e-01],
       [9.32406545e-01, 5.74251175e-01],
       [3.17812830e-01, 6.30333602e-01],
       [4.04152036e-01, 2.92286336e-01],
       [4.57990140e-01, 2.75646180e-01],
       [4.78725642e-01, 3.34153652e-01],
       [3.46405119e-01, 5.96605897e-01],
       [4.76704448e-01, 3.63030165e-01],
       [9.75593328e-01, 6.08678877e-01],
       [4.66915905e-01, 3.40948761e-01],
       [4.19108897e-01, 3.72122675e-01],
       [9.85820293e-01, 5.98729610e-01],
       [4.91563708e-01, 3.82882714e-01],
       [3.52195084e-01, 5.93102217e-01],
       [3.78976882e-01, 2.73120731e-01],
       [4.81789589e-01, 3.11298639e-01],
       [3.41597885e-01, 6.11805677e-01],
       [9.50300395e-01, 5.93189597e-01],
       [7.80480206e-01, 8.06312382e-01],
       [7.77700722e-01, 8.01734090e-01],
       [9.48634505e-01, 5.99592268e-01],
       [3.34010780e-01, 6.20543480e-01],
       [4.37550068e-01, 3.32868695e-01],
       [1.02654584e-01, 9.02040780e-01],
       [2.18036711e-01, 9.20055687e-01],
       [4.04806346e-01, 3.51537496e-01],
       [7.06579804e-01, 5.95337689e-01],
       [7.69683421e-02, 9.34572697e-01],
       [2.06806496e-01, 6.27684355e-01],
       [1.69860516e-02, 7.60613024e-01],
       [1.63733829e-02, 7.69193411e-01],
       [9.20301318e-01, 6.70372963e-01],
       [2.22508654e-01, 6.46966994e-01],
       [8.88094530e-02, 9.48851287e-01],
       [6.90402329e-01, 6.03659868e-01],
       [3.76565397e-01, 3.72656286e-01],
       [2.20571712e-01, 9.15373325e-01],
       [2.14627996e-01, 8.94647002e-01],
       [4.96129423e-01, 3.75080891e-02],
       [4.03500944e-01, 3.34637493e-01],
       [7.40296364e-01, 5.82079351e-01],
       [6.12510256e-02, 9.35847938e-01],
       [2.03431174e-01, 6.23758912e-01],
       [1.46360639e-02, 7.94035971e-01],
       [1.57320481e-02, 7.74557173e-01],
       [2.08367273e-01, 6.28546059e-01],
       [7.47190416e-02, 9.35449421e-01],
       [7.58668780e-01, 5.85885167e-01],
       [3.75279933e-01, 3.45732212e-01],
       [8.01291764e-01, 5.17260671e-01],
       [5.06415963e-01, 1.49608795e-02],
       [2.20077679e-01, 9.02958751e-01],
       [4.38759625e-01, 1.00000000e+00],
       [2.26562068e-01, 9.43768561e-01],
       [3.24914008e-01, 3.38468045e-01],
       [6.85906827e-01, 6.15463316e-01],
       [8.58106837e-02, 9.53653753e-01],
       [0.00000000e+00, 8.75699461e-01],
       [2.20739305e-01, 6.58935964e-01],
       [2.14074403e-01, 6.68919981e-01],
       [2.22968418e-04, 8.81857634e-01],
       [7.26443902e-02, 9.61623967e-01],
       [6.82860017e-01, 6.21624827e-01],
       [4.00524646e-01, 3.22538495e-01],
       [2.18578905e-01, 9.32940543e-01],
       [4.45220828e-01, 9.98077750e-01],
       [2.42618069e-01, 9.45194304e-01],
       [9.11884427e-01, 6.26113832e-01],
       [3.61020207e-01, 3.47451836e-01],
       [6.46184504e-01, 6.65111303e-01],
       [5.78398667e-02, 9.70919967e-01],
       [2.47765612e-03, 8.70541453e-01],
       [2.11473212e-01, 6.98399484e-01],
       [2.17672527e-01, 6.82360351e-01],
       [1.08121373e-02, 8.27167630e-01],
       [7.14609325e-02, 9.63070035e-01],
       [8.71961057e-01, 6.62579477e-01],
       [6.51352942e-01, 6.60499811e-01],
       [3.96378666e-01, 3.81715685e-01],
       [2.71128565e-01, 9.47126746e-01],
       [4.46094841e-01, 9.95863795e-01],
       [4.65775967e-01, 1.06437868e-02],
       [8.47087502e-01, 6.18577540e-01],
       [4.84454006e-01, 3.31616513e-02],
       [4.38885242e-01, 9.83732283e-01],
       [2.80731469e-01, 9.46760356e-01],
       [4.54750657e-01, 9.74769175e-01],
       [9.05241847e-01, 6.15000069e-01],
       [4.79105085e-01, 0.00000000e+00],
       [6.39352620e-01, 6.78436995e-01],
       [4.27057110e-02, 9.67314363e-01],
       [2.12270334e-01, 7.05709577e-01],
       [2.05992073e-01, 7.15562344e-01],
       [4.27672118e-02, 9.67822433e-01],
       [6.30898833e-01, 6.86514854e-01],
       [4.74503189e-01, 1.25078056e-02],
       [4.52066749e-01, 9.77194011e-01],
       [2.75237828e-01, 9.46144998e-01],
       [4.24723804e-01, 9.34700549e-01],
       [4.85090286e-01, 2.99814064e-02],
       [3.73024613e-01, 3.23480099e-01],
       [1.79844219e-02, 7.53372788e-01],
       [1.96820032e-02, 7.42539406e-01],
       [4.13297772e-01, 3.92302841e-01],
       [4.98634547e-01, 3.71525176e-02],
       [4.21102315e-01, 9.21213269e-01],
       [4.30848062e-01, 9.43217456e-01],
       [5.06067097e-01, 1.60759371e-02],
       [5.00103951e-01, 2.11710716e-03],
       [4.41640884e-01, 9.58876848e-01],
       [5.94580173e-01, 7.25803792e-01],
       [8.79334331e-01, 5.65088570e-01],
       [5.32339394e-01, 3.67977649e-01],
       [9.82017875e-01, 6.22487664e-01],
       [7.70088553e-01, 7.69430220e-01],
       [7.70804822e-01, 7.65877843e-01],
       [1.00000000e+00, 6.21931195e-01],
       [5.16272545e-01, 3.78439784e-01],
       [8.65725100e-01, 5.29811561e-01],
       [5.76321423e-01, 7.38793075e-01],
       [5.71399033e-01, 7.55231142e-01],
       [8.59548867e-01, 5.97527027e-01],
       [5.21215796e-01, 3.39219332e-01],
       [7.69934297e-01, 7.82899201e-01],
       [7.74726272e-01, 7.99651206e-01],
       [5.11989355e-01, 3.21145296e-01],
       [8.63555312e-01, 5.74779034e-01],
       [5.79283595e-01, 7.43836045e-01],
       [4.46372151e-01, 4.20742869e-01],
       [4.42506313e-01, 2.78049320e-01],
       [3.22920680e-01, 6.73968077e-01],
       [5.71109712e-01, 7.87904024e-01],
       [8.85180473e-01, 6.13964677e-01],
       [5.05266011e-01, 3.78317326e-01],
       [7.91613698e-01, 7.16017783e-01],
       [4.50733423e-01, 3.75497431e-01],
       [7.89266765e-01, 7.15016484e-01],
       [3.56960475e-01, 2.98306078e-01],
       [8.63006055e-01, 6.26766682e-01],
       [8.55437160e-01, 5.49330831e-01],
       [5.62943280e-01, 7.81707823e-01],
       [3.15390468e-01, 6.78839445e-01],
       [4.51380849e-01, 3.04520398e-01],
       [5.78324974e-01, 7.77113616e-01],
       [8.90495539e-01, 6.34768724e-01],
       [5.15006125e-01, 3.49437058e-01],
       [8.17594051e-01, 6.46463871e-01],
       [9.06470120e-01, 5.79871655e-01],
       [8.00006151e-01, 6.40044153e-01],
       [5.46061635e-01, 3.39911550e-01],
       [8.67089629e-01, 6.13316178e-01],
       [5.74792325e-01, 7.57430434e-01]], dtype=float32)

opened by mcsimenc 5

gap statistic calculation incorrect?

Look at this notebook, I noticed that you are calculating gap statistic as: gap = np.log(np.mean(refDisps)) - np.log(origDisp)

Shouldn't it be gap = np.mean(np.log(refDisps)) - np.log(origDisp) instead?
enhancement help wanted

opened by rakshita95 5
Range of the uniform reference distribution
In the original paper by Tibshirani et al., the authors consider two choices for the reference distribution:

generate each reference feature uniformly over the range of the observed values for that feature;

generate the reference features from a uniform distribution over a box aligned with the principal components of the data.

In the mod.rs file, it seems that random data is generated from a standard uniform U[0, 1] for all features, and not from U[X[:, i].min(), X[:, i].max()]. Is that approach intentional? Thanks!
opened by dvukolov 4
Provide standard deviation for reference samples?

Hi,

Thanks for providing this nice tool! It's very easy to plug into my work. :-)

I was wondering if you'd thought about, instead of just returning the gap statistic for each value of k, returning the value s_k for the reference distribution as well, so we can easily do the Tibshirani-recommended approach: choose the smallest k such that Gap(k) > Gap(k+1) - s(k+1).

https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/

Thanks again for writing this package!

Russell

opened by drussellmrichie 4
question

Why the results are different every time the program runs based on the same data and OptimalK object. Even using the examples you provide, the results are also different.

How can I get the same/definite result?

Moreover, How do I use "clusterer_kwargs", Would you like show me an example?

opened by nry123 3
The n_refs argument ignored in Rust

Hi Miles, it seems that the n_refs argument is silently ignored when using the Rust backend. This leads to additional discrepancies with the Python version.

opened by dvukolov 3
OptimalK _calculate_gap method passes incorrect data to clusterer?

In line 181-183 of optimalK.py, the _calculate_gap() method finds the dispersion measure of the observed data-set. centroids, labels = self.clusterer(random_data, n_clusters, *self.clusterer_kwargs ) # type: Tuple[np.ndarray, np.ndarray]

However, it looks like this line passes the null data-set instead of the observed data-set. The centroids and labels of this call are then passed to the _calculate_dispersion() method using the observed data-set. This means the dispersion metric is using the incorrect centroids and labels, which gives a larger dispersion metric for the observed set.

I changed this line to pass the observed data instead of the null data and this is the result compared to an implementation I did (labeled Gap2) :).

opened by johnvorsten 3

RuntimeWarning: divide by zero encountered in log

I am encountering the following RuntimeWarning:

...\Anaconda3\lib\site-packages\gap_statistic\optimalK.py:292: RuntimeWarning: divide by zero encountered in log
log_dispersion = np.log(dispersion)

From what I read, the warning means dispersion=0, so np.log(dispersion) is undefined. Despite the warning, the results are not affected as far as I see. I am using version 2.0 of gap-stat, installed via pip within conda:

# Name                    Version                   Build  Channel
gap-stat                  2.0.0                    pypi_0    pypi
numpy                     1.18.1           py37h93ca92e_0
numpy-base                1.18.1           py37hc3f5095_1
numpydoc                  0.9.2                      py_0

opened by rtrad89 2

Wrong results?

Just run the sample notebook without changing anything. The gap-stat values are very different than the ones showed in the sample notebook. It picked k=11 (instead k=3) as the optimal k. Can someone confirm that the code is working fine?

opened by shahabh 2
Question

Sorry! I try again, bu the results are also different in differently running. May be some wrong in my procedure. Could you help and check for me.

#!/usr/bin/env python

coding: utf-8

import numpy as np import pandas as pd import matplotlib.pyplot as plt from gap_statistic import OptimalK try: from sklearn.datasets.samples_generator import make_blobs except ImportError: from sklearn.datasets import make_blobs from sklearn.cluster import KMeans

X, y = make_blobs(n_samples=int(1e5), n_features=2, centers=3, random_state=25) print('Data shape: ', X.shape)

test the first exmaple

The 1st run about gap statistic

optimalK = OptimalK(parallel_backend='rust') n_clusters = optimalK(X, cluster_array=np.arange(1, 15)) print('Optimal clusters: ', n_clusters)

plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3) plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters, optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show()

The 2nd run about gap statistic

optimalK = OptimalK(parallel_backend='rust') n_clusters = optimalK(X, cluster_array=np.arange(1, 15)) print('Optimal clusters: ', n_clusters)

plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3) plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters, optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show()

The 3rd run about gap statistic

optimalK = OptimalK(parallel_backend='rust') n_clusters = optimalK(X, cluster_array=np.arange(1, 15)) print('Optimal clusters: ', n_clusters)

plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3) plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters, optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show()

test the second exmaple, I will test another exmple by fixing random_state and defining the OptimalK instance and passing in our own clustering function

def special_clustering_func(X, k): m = KMeans(random_state=0) m.fit(X) return m.cluster_centers_, m.predict(X)

The first run about gap statistic

optimalk = OptimalK(clusterer=special_clustering_func) n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3) plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters, optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show()

The second run about gap statistic

optimalk = OptimalK(clusterer=special_clustering_func) n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3) plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters, optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show()

The third run about gap statistic

optimalk = OptimalK(clusterer=special_clustering_func) n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3) plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters, optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show()

opened by nry123 1
Pull out kmeans impl into own crate/package
Right now, a full kmeans implementation sits here... Basically a freebie to pull this out into it's own crate and use that.

[ ] External Kmeans crate

[ ] Implement another Python package with that crate.
opened by milesgranger 0

Releases(v2.0.1)

v2.0.1(Apr 20, 2020)

:bug: Bug fix :hammer_and_wrench:

In v2.0.0, passing a pandas.DataFrame with parallel_backend='rust' would result in TypeError, this patch ensures any dataframe is converted to np.ndarray before hand.
Source code(tar.gz)
Source code(zip)
v2.0.0(Feb 19, 2020)

Disconnect rust & use github actions (#41) Rust as optional install (#42) Fix workflows (#43) Return standard errors of the Gap statistic (#39) Generate ref data uniformly over feature range (#40) Pass n_refs and n_iter to rust impl (#46) Fix Rust ref data generation (#47) Add warning about optional rust feature (#47)
Source code(tar.gz)
Source code(zip)
v2.0.0-rc2(Feb 18, 2020)
Pass n_refs and n_iter to rust impl (#46)

Fix Rust ref data generation (#47)

Add warning about optional rust feature (#47)

Source code(tar.gz)
Source code(zip)
v2.0.0-rc1(Feb 15, 2020)
Improvements and Fixes:

Return standard errors of the Gap statistic (#39)

Generate ref data uniformly over feature range (#40)

Rust as optional install:

Disconnect rust & use github actions (#41)

Rust as optional install (#42)

Fix workflows (#43)

Source code(tar.gz)
Source code(zip)
v1.7.1(Aug 18, 2019)
Optimizations to Rust backend. https://github.com/milesgranger/gap_statistic/pull/35

Source code(tar.gz)
Source code(zip)
gap_stat-1.7.1-cp35-cp35m-manylinux1_x86_64.whl(796.53 KB)
gap_stat-1.7.1-cp36-cp36m-macosx_10_13_x86_64.whl(235.35 KB)
gap_stat-1.7.1-cp36-cp36m-manylinux1_x86_64.whl(1.55 MB)
gap_stat-1.7.1-cp36-cp36m-win_amd64.whl(185.75 KB)
gap_stat-1.7.1-cp37-cp37m-macosx_10_13_x86_64.whl(234.91 KB)
gap_stat-1.7.1-cp37-cp37m-manylinux1_x86_64.whl(2.32 MB)
gap_stat-1.7.1-cp37-cp37m-win_amd64.whl(185.12 KB)
v1.7.0(Jul 9, 2019)
Change license from BSD-3 to dual Unlicense / MIT (#30)

Add OSX to Azure Pipelines building (#31)

Multiple upgrade to gap calculation, n cluster calc, docs, etc (#28)

Fix passing of null data to final calculations rather than the observed dataset (#34)

Source code(tar.gz)
Source code(zip)
gap_stat-1.7.0-cp35-cp35m-manylinux1_x86_64.whl(729.38 KB)
gap_stat-1.7.0-cp36-cp36m-macosx_10_13_x86_64.whl(185.72 KB)
gap_stat-1.7.0-cp36-cp36m-manylinux1_x86_64.whl(1.41 MB)
gap_stat-1.7.0-cp36-cp36m-win_amd64.whl(141.65 KB)
gap_stat-1.7.0-cp37-cp37m-macosx_10_13_x86_64.whl(185.66 KB)
gap_stat-1.7.0-cp37-cp37m-manylinux1_x86_64.whl(2.12 MB)
gap_stat-1.7.0-cp37-cp37m-win_amd64.whl(141.57 KB)
v1.7.0.rc1(Jul 9, 2019)
Change license from BSD-3 to dual Unlicense / MIT (#30)

Add OSX to Azure Pipelines building (#31)

Multiple upgrade to gap calculation, n cluster calc, docs, etc (#28)

Fix passing of null data to final calculations rather than the observed dataset (#34)

Source code(tar.gz)
Source code(zip)
gap_stat-1.7.0rc1-cp35-cp35m-manylinux1_x86_64.whl(729.41 KB)
gap_stat-1.7.0rc1-cp36-cp36m-macosx_10_13_x86_64.whl(185.75 KB)
gap_stat-1.7.0rc1-cp36-cp36m-manylinux1_x86_64.whl(1.41 MB)
gap_stat-1.7.0rc1-cp36-cp36m-win_amd64.whl(141.69 KB)
gap_stat-1.7.0rc1-cp37-cp37m-macosx_10_13_x86_64.whl(185.69 KB)
gap_stat-1.7.0rc1-cp37-cp37m-manylinux1_x86_64.whl(2.12 MB)
gap_stat-1.7.0rc1-cp37-cp37m-win_amd64.whl(141.61 KB)
v1.6.1(May 25, 2019)
OptimalK object's dataframe now has an additional column with the standard deviations of the reference distributions (https://github.com/milesgranger/gap_statistic/commit/c61ee923d3040815372ed77141de34144fdf49d4)

Source code(tar.gz)
Source code(zip)
v1.6.0(May 19, 2019)
User can pass in their own clustering function into OptimalK (https://github.com/milesgranger/gap_statistic/commit/fe1b1d31c93c592eb597474493f520b7f6b6434c)

Switch to Azure Pipelines (https://github.com/milesgranger/gap_statistic/commit/9020d9530d15e1ed79045e25e1ca5e6c67fa5c0c, https://github.com/milesgranger/gap_statistic/commit/db030eea4e362acc4a4d43029c7ddb701ff24c95)

Use black formatting for the project (https://github.com/milesgranger/gap_statistic/commit/bdb4af1467533f3d8091d05c83012ea040f0ffa9)

Source code(tar.gz)
Source code(zip)
v1.5.2(Aug 4, 2018)

Source code(tar.gz)
Source code(zip)
gap-stat-1.5.2.tar.gz(5.07 KB)
gap_stat-1.5.2-cp35-cp35m-macosx_10_5_x86_64.whl(236.55 KB)
gap_stat-1.5.2-cp35-cp35m-manylinux1_i686.whl(921.98 KB)
gap_stat-1.5.2-cp35-cp35m-manylinux1_x86_64.whl(862.29 KB)
gap_stat-1.5.2-cp35-cp35m-win32.whl(150.31 KB)
gap_stat-1.5.2-cp35-cp35m-win_amd64.whl(163.80 KB)
gap_stat-1.5.2-cp36-cp36m-macosx_10_7_x86_64.whl(236.57 KB)
gap_stat-1.5.2-cp36-cp36m-manylinux1_i686.whl(1.79 MB)
gap_stat-1.5.2-cp36-cp36m-manylinux1_x86_64.whl(1.67 MB)
gap_stat-1.5.2-cp36-cp36m-win32.whl(150.28 KB)
gap_stat-1.5.2-cp36-cp36m-win_amd64.whl(163.78 KB)
v1.5.1(Jun 15, 2018)

Release contains an initial stable release of the same API found in the first release, along with OptimalK taking an additional value "rust" for the backend parameter.
Source code(tar.gz)
Source code(zip)
gap-stat-1.5.1.tar.gz(5.03 KB)
gap_stat-1.5.1-cp35-cp35m-macosx_10_5_x86_64.whl(233.22 KB)
gap_stat-1.5.1-cp35-cp35m-manylinux1_i686.whl(895.09 KB)
gap_stat-1.5.1-cp35-cp35m-manylinux1_x86_64.whl(833.97 KB)
gap_stat-1.5.1-cp35-cp35m-win32.whl(148.10 KB)
gap_stat-1.5.1-cp35-cp35m-win_amd64.whl(160.63 KB)
gap_stat-1.5.1-cp36-cp36m-macosx_10_7_x86_64.whl(233.23 KB)
gap_stat-1.5.1-cp36-cp36m-manylinux1_i686.whl(1.74 MB)
gap_stat-1.5.1-cp36-cp36m-manylinux1_x86_64.whl(1.62 MB)
gap_stat-1.5.1-cp36-cp36m-win32.whl(148.12 KB)
gap_stat-1.5.1-cp36-cp36m-win_amd64.whl(160.64 KB)
v1.5.0a2(May 27, 2018)

Initial release including a Rust backend implementation of the gap statistic & kmeans.
Source code(tar.gz)
Source code(zip)
gap_stat-1.5.0a2-cp35-cp35m-macosx_10_5_x86_64.whl(235.25 KB)
gap_stat-1.5.0a2-cp35-cp35m-manylinux1_i686.whl(890.12 KB)
gap_stat-1.5.0a2-cp35-cp35m-manylinux1_x86_64.whl(834.34 KB)
gap_stat-1.5.0a2-cp36-cp36m-macosx_10_7_x86_64.whl(235.24 KB)
gap_stat-1.5.0a2-cp36-cp36m-manylinux1_i686.whl(1.73 MB)
gap_stat-1.5.0a2-cp36-cp36m-manylinux1_x86_64.whl(1.62 MB)
v1.0.1(May 15, 2018)

Support for Joblib, Multi-processing and single core
Source code(tar.gz)
Source code(zip)

Owner

Miles Granger

Just a happy engineer.

GitHub

A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio

0 Sep 2, 2021

Flexible, reusable reinforcement learning (Q learning) implementation in Rust

Rurel Rurel is a flexible, reusable reinforcement learning (Q learning) implementation in Rust. Release documentation In Cargo.toml: rurel = "0.2.0"

60 Dec 29, 2022

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr