Dynamically get the suggested clusters in the data for unsupervised learning.

Overview

Python implementation of the Gap Statistic

PythonCI RustCI

Downloads Coverage Status Code Health Code Style


Purpose

Dynamically identify the suggested number of clusters in a data-set using the gap statistic.


Full example available in a notebook HERE


Install:

Bleeding edge:

pip install git+git://github.com/milesgranger/gap_statistic.git

PyPi:

pip install --upgrade gap-stat

With Rust extension:

pip install --upgrade gap-stat[rust]

Uninstall:

pip uninstall gap-stat

Methodology:

This package provides several methods to assist in choosing the optimal number of clusters for a given dataset, based on the Gap method presented in "Estimating the number of clusters in a data set via the gap statistic" (Tibshirani et al.).

The methods implemented can cluster a given dataset using a range of provided k values, and provide you with statistics that can help in choosing the right number of clusters for your dataset. Three possible methods are:

  • Taking the k maximizing the Gap value, which is calculated for each k. This, however, might not always be possible, as for many datasets this value is monotonically increasing or decreasing.
  • Taking the smallest k such that Gap(k) >= Gap(k+1) - s(k+1). This is the method suggested in Tibshirani et al. (consult the paper for details). The measure diff = Gap(k) - Gap(k+1) + s(k+1) is calculated for each k; the parallel here, then, is to take the smallest k for which diff is positive. Note that in some cases this can be true for the entire range of k.
  • Taking the k maximizing the Gap* value, an alternative measure suggested in "A comparison of Gap statistic definitions with and with-out logarithm function" by Mohajer, Englmeier and Schmid. The authors claim this measure avoids the over-estimation of the number of clusters from which the original Gap statistics suffers, and can also suggest an optimal value for k for cases in which Gap cannot. They do warn, however, that the original Gap statistic performs better than Gap* in the case of overlapped clusters, due to its tendency to overestimate the number of clusters.

Note that none of the above methods is guaranteed to find an optimal value for k, and that they often contradict one another. Rather, they can provide more information on which to base your choice of k, which should take numerous other factors into account.


Use:

First, construct an OptimalK object. Optional intialization parameters are:

  • n_jobs - Splits computation into this number of parallel jobs. Requires choosing a parallel backend.
  • parallel_backend - Possible values are joblib, rust or multiprocessing for the built-in Python backend. If parallel_backend == 'rust' it will use all cores.
  • clusterer - Takes a custom clusterer function to be used when clustering. See the example notebook for more details.
  • clusterer_kwargs - Any keyword arguments to be forwarded to the custom clusterer function on each call.

An example intialization:

optimalK = OptimalK(n_jobs=4, parallel_backend='joblib')

After the object is created, it can be called like a function, and provided with a dataset for which the optimal K is found and returned. Parameters are:

  • X - A pandas dataframe or numpy array of data points of shape (n_samples, n_features).
  • n_refs - The number of random reference data sets to use as inertia reference to actual data. Optional.
  • cluster_array - A 1-dimensional iterable of integers; each representing n_clusters to try on the data. Optional.

For example:

import numpy as np
n_clusters = optimalK(X, cluster_array=np.arange(1, 15))

After performing the search procedure, a DataFrame of gap values and other usefull statistics for each passed cluster count is now available as the gap_df attributre of the OptimalK object:

optimalK.gap_df.head()

The columns of the dataframe are:

  • n_clusters - The number of clusters for which the statistics in this row were calculated.
  • gap_value - The Gap value for this n.
  • gap* - The Gap* value for this n.
  • ref_dispersion_std - The standard deviation of the reference distributions for this n.
  • sk - The standard error of the Gap statistic for this n.
  • sk* - The standard error of the Gap* statistic for this n.
  • diff - The diff value for this n (see the methodology section for details).
  • diff* - The diff* value for this n (corresponding to the diff value for Gap*).

Additionally, the relation between the above measures and the number of clusters can be plotted by calling the OptimalK.plot_results() method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:

  • A plot of the Gap value versus n, the number of clusters.
  • A plot of diff versus n.
  • A plot of the Gap* value versus n, the number of clusters.
  • A plot of the diff* value versus n.

Comments
  • Advanced method to find k + Gap* alternative measure + documentation

    Advanced method to find k + Gap* alternative measure + documentation

    This pull request is meant to address four issues (three of which I've opened, just for order's sake):

    1. Issue #14, namely implementing step 3 from the original Gap statistic paper, which is the authors' recommended method to choose k. This part is partly based on code found in a fork of this repo by a user named druogury, which implements this step (among other enhancements).
    2. Issue #25, in which I suggested the possible enhancement of also providing the alternative Gap* measure (link) in the resulting gap_df attribute. This is done without effecting the results of the calculation.
    3. Issue #26, in which I suggested to add documentation to the readme file, presenting concisely both the package's API and the method on which it is based.
    4. Issue #27, in which I suggested to add a plotting function implementing the plotting logic found in the example notebook.

    To this end, this pull request changes exactly three files:

    1. .gitignore - Just adds ignoring of .swp files (for people working with vim, like myself) and of .DS_Store files (for people working on mac, like myself).
    2. optimalK.py - Adds:
      • The calculation of measures used for the aforementioned step 3 and Gap* in the OptimalK._calculate_gap() method.
      • Their additions to the gap_df attribute - and additional calculations which cannot be done in a k-specific function - in the OptimalK.__call__() method.
      • The plotting of four relevant plots (including the Gap-vs-n one from the example notebook) in the newly added OptimalK.plot_results() method.
    3. README.md - Adds documentation for the original suggested method (very basic; more of a mention), the package's API and all new features. This is partial, but a good start, I believe.

    Examples images are below.

    An important note is that while this works great on my machine, I was unable to test it as it seems the azure-pipelines-based CI solutions you recently switched to is not working. I think this should be addressed, but you can also just pull this - this code is really safe, doing nothing not already done by the package.

    If I can somehow help with getting testing to work again I'd love to land a hand. I would suggest adding travis back as an additional platform to run tests, regardless of what happens with Azure Pipelines; it's free, and you don't have to put the badge, but it can help with development in the meantime.

    An example of the enhanced dataframe can be seen here: image

    And of the new plots here: image image image image

    opened by shaypal5 18
  • negative gap values

    negative gap values

    Can someone please explain why gap values in this implementation are negative? Based on the formulas in the paper, a negative gap value means that distances in random clusterings are smaller than the original clustering => random ones are better than the original one. I have verified that in another book as well.

    opened by shahabh 5
  • Is it normal to get different optimal ks for identically-started runs?

    Is it normal to get different optimal ks for identically-started runs?

    My data are below. Here are the OptimalK calls,

    ok = OptimalK(parallel_backend='rust')
    results = {i:0 for i in range(1,50)}
    for i in range(100):
        k = ok(d, cluster_array=np.arange(1,50))
        results[k] += 1
    results = {k:results[k] for k in results if results[k] != 0}
    print('k\t%answers')
    for k in results:
        print('{}\t{}%'.format(k, results[k]))
    

    Returns,

    k	%answers
    3	26%
    4	10%
    5	4%
    6	5%
    7	9%
    8	13%
    9	15%
    10	9%
    11	2%
    12	2%
    13	2%
    14	1%
    15	1%
    17	1%
    
    d = array([[4.93510425e-01, 2.90838450e-01],
           [3.09295535e-01, 6.62941515e-01],
           [5.54909945e-01, 7.95982182e-01],
           [8.83511543e-01, 6.47582054e-01],
           [8.89093220e-01, 5.11645794e-01],
           [9.26957607e-01, 5.39648056e-01],
           [8.82565558e-01, 6.00233793e-01],
           [5.19726157e-01, 2.54190862e-02],
           [5.55906892e-01, 7.95450807e-01],
           [3.07076156e-01, 6.56616330e-01],
           [4.48115617e-01, 3.06021482e-01],
           [4.91613209e-01, 2.76198626e-01],
           [3.27725083e-01, 6.29869163e-01],
           [9.03380156e-01, 5.39514124e-01],
           [9.32406545e-01, 5.74251175e-01],
           [3.17812830e-01, 6.30333602e-01],
           [4.04152036e-01, 2.92286336e-01],
           [4.57990140e-01, 2.75646180e-01],
           [4.78725642e-01, 3.34153652e-01],
           [3.46405119e-01, 5.96605897e-01],
           [4.76704448e-01, 3.63030165e-01],
           [9.75593328e-01, 6.08678877e-01],
           [4.66915905e-01, 3.40948761e-01],
           [4.19108897e-01, 3.72122675e-01],
           [9.85820293e-01, 5.98729610e-01],
           [4.91563708e-01, 3.82882714e-01],
           [3.52195084e-01, 5.93102217e-01],
           [3.78976882e-01, 2.73120731e-01],
           [4.81789589e-01, 3.11298639e-01],
           [3.41597885e-01, 6.11805677e-01],
           [9.50300395e-01, 5.93189597e-01],
           [7.80480206e-01, 8.06312382e-01],
           [7.77700722e-01, 8.01734090e-01],
           [9.48634505e-01, 5.99592268e-01],
           [3.34010780e-01, 6.20543480e-01],
           [4.37550068e-01, 3.32868695e-01],
           [1.02654584e-01, 9.02040780e-01],
           [2.18036711e-01, 9.20055687e-01],
           [4.04806346e-01, 3.51537496e-01],
           [7.06579804e-01, 5.95337689e-01],
           [7.69683421e-02, 9.34572697e-01],
           [2.06806496e-01, 6.27684355e-01],
           [1.69860516e-02, 7.60613024e-01],
           [1.63733829e-02, 7.69193411e-01],
           [9.20301318e-01, 6.70372963e-01],
           [2.22508654e-01, 6.46966994e-01],
           [8.88094530e-02, 9.48851287e-01],
           [6.90402329e-01, 6.03659868e-01],
           [3.76565397e-01, 3.72656286e-01],
           [2.20571712e-01, 9.15373325e-01],
           [2.14627996e-01, 8.94647002e-01],
           [4.96129423e-01, 3.75080891e-02],
           [4.03500944e-01, 3.34637493e-01],
           [7.40296364e-01, 5.82079351e-01],
           [6.12510256e-02, 9.35847938e-01],
           [2.03431174e-01, 6.23758912e-01],
           [1.46360639e-02, 7.94035971e-01],
           [1.57320481e-02, 7.74557173e-01],
           [2.08367273e-01, 6.28546059e-01],
           [7.47190416e-02, 9.35449421e-01],
           [7.58668780e-01, 5.85885167e-01],
           [3.75279933e-01, 3.45732212e-01],
           [8.01291764e-01, 5.17260671e-01],
           [5.06415963e-01, 1.49608795e-02],
           [2.20077679e-01, 9.02958751e-01],
           [4.38759625e-01, 1.00000000e+00],
           [2.26562068e-01, 9.43768561e-01],
           [3.24914008e-01, 3.38468045e-01],
           [6.85906827e-01, 6.15463316e-01],
           [8.58106837e-02, 9.53653753e-01],
           [0.00000000e+00, 8.75699461e-01],
           [2.20739305e-01, 6.58935964e-01],
           [2.14074403e-01, 6.68919981e-01],
           [2.22968418e-04, 8.81857634e-01],
           [7.26443902e-02, 9.61623967e-01],
           [6.82860017e-01, 6.21624827e-01],
           [4.00524646e-01, 3.22538495e-01],
           [2.18578905e-01, 9.32940543e-01],
           [4.45220828e-01, 9.98077750e-01],
           [2.42618069e-01, 9.45194304e-01],
           [9.11884427e-01, 6.26113832e-01],
           [3.61020207e-01, 3.47451836e-01],
           [6.46184504e-01, 6.65111303e-01],
           [5.78398667e-02, 9.70919967e-01],
           [2.47765612e-03, 8.70541453e-01],
           [2.11473212e-01, 6.98399484e-01],
           [2.17672527e-01, 6.82360351e-01],
           [1.08121373e-02, 8.27167630e-01],
           [7.14609325e-02, 9.63070035e-01],
           [8.71961057e-01, 6.62579477e-01],
           [6.51352942e-01, 6.60499811e-01],
           [3.96378666e-01, 3.81715685e-01],
           [2.71128565e-01, 9.47126746e-01],
           [4.46094841e-01, 9.95863795e-01],
           [4.65775967e-01, 1.06437868e-02],
           [8.47087502e-01, 6.18577540e-01],
           [4.84454006e-01, 3.31616513e-02],
           [4.38885242e-01, 9.83732283e-01],
           [2.80731469e-01, 9.46760356e-01],
           [4.54750657e-01, 9.74769175e-01],
           [9.05241847e-01, 6.15000069e-01],
           [4.79105085e-01, 0.00000000e+00],
           [6.39352620e-01, 6.78436995e-01],
           [4.27057110e-02, 9.67314363e-01],
           [2.12270334e-01, 7.05709577e-01],
           [2.05992073e-01, 7.15562344e-01],
           [4.27672118e-02, 9.67822433e-01],
           [6.30898833e-01, 6.86514854e-01],
           [4.74503189e-01, 1.25078056e-02],
           [4.52066749e-01, 9.77194011e-01],
           [2.75237828e-01, 9.46144998e-01],
           [4.24723804e-01, 9.34700549e-01],
           [4.85090286e-01, 2.99814064e-02],
           [3.73024613e-01, 3.23480099e-01],
           [1.79844219e-02, 7.53372788e-01],
           [1.96820032e-02, 7.42539406e-01],
           [4.13297772e-01, 3.92302841e-01],
           [4.98634547e-01, 3.71525176e-02],
           [4.21102315e-01, 9.21213269e-01],
           [4.30848062e-01, 9.43217456e-01],
           [5.06067097e-01, 1.60759371e-02],
           [5.00103951e-01, 2.11710716e-03],
           [4.41640884e-01, 9.58876848e-01],
           [5.94580173e-01, 7.25803792e-01],
           [8.79334331e-01, 5.65088570e-01],
           [5.32339394e-01, 3.67977649e-01],
           [9.82017875e-01, 6.22487664e-01],
           [7.70088553e-01, 7.69430220e-01],
           [7.70804822e-01, 7.65877843e-01],
           [1.00000000e+00, 6.21931195e-01],
           [5.16272545e-01, 3.78439784e-01],
           [8.65725100e-01, 5.29811561e-01],
           [5.76321423e-01, 7.38793075e-01],
           [5.71399033e-01, 7.55231142e-01],
           [8.59548867e-01, 5.97527027e-01],
           [5.21215796e-01, 3.39219332e-01],
           [7.69934297e-01, 7.82899201e-01],
           [7.74726272e-01, 7.99651206e-01],
           [5.11989355e-01, 3.21145296e-01],
           [8.63555312e-01, 5.74779034e-01],
           [5.79283595e-01, 7.43836045e-01],
           [4.46372151e-01, 4.20742869e-01],
           [4.42506313e-01, 2.78049320e-01],
           [3.22920680e-01, 6.73968077e-01],
           [5.71109712e-01, 7.87904024e-01],
           [8.85180473e-01, 6.13964677e-01],
           [5.05266011e-01, 3.78317326e-01],
           [7.91613698e-01, 7.16017783e-01],
           [4.50733423e-01, 3.75497431e-01],
           [7.89266765e-01, 7.15016484e-01],
           [3.56960475e-01, 2.98306078e-01],
           [8.63006055e-01, 6.26766682e-01],
           [8.55437160e-01, 5.49330831e-01],
           [5.62943280e-01, 7.81707823e-01],
           [3.15390468e-01, 6.78839445e-01],
           [4.51380849e-01, 3.04520398e-01],
           [5.78324974e-01, 7.77113616e-01],
           [8.90495539e-01, 6.34768724e-01],
           [5.15006125e-01, 3.49437058e-01],
           [8.17594051e-01, 6.46463871e-01],
           [9.06470120e-01, 5.79871655e-01],
           [8.00006151e-01, 6.40044153e-01],
           [5.46061635e-01, 3.39911550e-01],
           [8.67089629e-01, 6.13316178e-01],
           [5.74792325e-01, 7.57430434e-01]], dtype=float32)
    
    opened by mcsimenc 5
  • gap statistic calculation incorrect?

    gap statistic calculation incorrect?

    Look at this notebook, I noticed that you are calculating gap statistic as: gap = np.log(np.mean(refDisps)) - np.log(origDisp)

    Shouldn't it be gap = np.mean(np.log(refDisps)) - np.log(origDisp) instead?

    enhancement help wanted 
    opened by rakshita95 5
  • Range of the uniform reference distribution

    Range of the uniform reference distribution

    In the original paper by Tibshirani et al., the authors consider two choices for the reference distribution:

    • generate each reference feature uniformly over the range of the observed values for that feature;
    • generate the reference features from a uniform distribution over a box aligned with the principal components of the data.

    In the mod.rs file, it seems that random data is generated from a standard uniform U[0, 1] for all features, and not from U[X[:, i].min(), X[:, i].max()]. Is that approach intentional? Thanks!

    opened by dvukolov 4
  • Provide standard deviation for reference samples?

    Provide standard deviation for reference samples?

    Hi,

    Thanks for providing this nice tool! It's very easy to plug into my work. :-)

    I was wondering if you'd thought about, instead of just returning the gap statistic for each value of k, returning the value s_k for the reference distribution as well, so we can easily do the Tibshirani-recommended approach: choose the smallest k such that Gap(k) > Gap(k+1) - s(k+1).

    https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/

    Thanks again for writing this package!

    Russell

    opened by drussellmrichie 4
  • question

    question

    Why the results are different every time the program runs based on the same data and OptimalK object. Even using the examples you provide, the results are also different.

    How can I get the same/definite result?

    Moreover, How do I use "clusterer_kwargs", Would you like show me an example?

    opened by nry123 3
  • The n_refs argument ignored in Rust

    The n_refs argument ignored in Rust

    Hi Miles, it seems that the n_refs argument is silently ignored when using the Rust backend. This leads to additional discrepancies with the Python version.

    opened by dvukolov 3
  • OptimalK _calculate_gap method passes incorrect data to clusterer?

    OptimalK _calculate_gap method passes incorrect data to clusterer?

    In line 181-183 of optimalK.py, the _calculate_gap() method finds the dispersion measure of the observed data-set. centroids, labels = self.clusterer(random_data, n_clusters, *self.clusterer_kwargs ) # type: Tuple[np.ndarray, np.ndarray]

    However, it looks like this line passes the null data-set instead of the observed data-set. The centroids and labels of this call are then passed to the _calculate_dispersion() method using the observed data-set. This means the dispersion metric is using the incorrect centroids and labels, which gives a larger dispersion metric for the observed set.

    I changed this line to pass the observed data instead of the null data and this is the result compared to an implementation I did (labeled Gap2) :).

    Reference_Gap

    opened by johnvorsten 3
  • RuntimeWarning: divide by zero encountered in log

    RuntimeWarning: divide by zero encountered in log

    I am encountering the following RuntimeWarning:

    ...\Anaconda3\lib\site-packages\gap_statistic\optimalK.py:292: RuntimeWarning: divide by zero encountered in log
    log_dispersion = np.log(dispersion)
    

    From what I read, the warning means dispersion=0, so np.log(dispersion) is undefined. Despite the warning, the results are not affected as far as I see. I am using version 2.0 of gap-stat, installed via pip within conda:

    # Name                    Version                   Build  Channel
    gap-stat                  2.0.0                    pypi_0    pypi
    numpy                     1.18.1           py37h93ca92e_0
    numpy-base                1.18.1           py37hc3f5095_1
    numpydoc                  0.9.2                      py_0
    
    opened by rtrad89 2
  • Wrong results?

    Wrong results?

    Just run the sample notebook without changing anything. The gap-stat values are very different than the ones showed in the sample notebook. It picked k=11 (instead k=3) as the optimal k. Can someone confirm that the code is working fine?

    opened by shahabh 2
  • Question

    Question

    Sorry! I try again, bu the results are also different in differently running. May be some wrong in my procedure. Could you help and check for me.

    #!/usr/bin/env python

    coding: utf-8

    import numpy as np import pandas as pd import matplotlib.pyplot as plt from gap_statistic import OptimalK try: from sklearn.datasets.samples_generator import make_blobs except ImportError: from sklearn.datasets import make_blobs from sklearn.cluster import KMeans

    X, y = make_blobs(n_samples=int(1e5), n_features=2, centers=3, random_state=25) print('Data shape: ', X.shape)

    test the first exmaple

    The 1st run about gap statistic

    optimalK = OptimalK(parallel_backend='rust') n_clusters = optimalK(X, cluster_array=np.arange(1, 15)) print('Optimal clusters: ', n_clusters)

    plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3) plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters, optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show() image

    The 2nd run about gap statistic

    optimalK = OptimalK(parallel_backend='rust') n_clusters = optimalK(X, cluster_array=np.arange(1, 15)) print('Optimal clusters: ', n_clusters)

    plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3) plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters, optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show() image

    The 3rd run about gap statistic

    optimalK = OptimalK(parallel_backend='rust') n_clusters = optimalK(X, cluster_array=np.arange(1, 15)) print('Optimal clusters: ', n_clusters)

    plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3) plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters, optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show() image

    test the second exmaple, I will test another exmple by fixing random_state and defining the OptimalK instance and passing in our own clustering function

    def special_clustering_func(X, k): m = KMeans(random_state=0) m.fit(X) return m.cluster_centers_, m.predict(X)

    The first run about gap statistic

    optimalk = OptimalK(clusterer=special_clustering_func) n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

    plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3) plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters, optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show() image

    The second run about gap statistic

    optimalk = OptimalK(clusterer=special_clustering_func) n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

    plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3) plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters, optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show() image

    The third run about gap statistic

    optimalk = OptimalK(clusterer=special_clustering_func) n_clusters = optimalk(X, n_refs=3, cluster_array=range(1, 15))

    plt.plot(optimalk.gap_df.n_clusters, optimalk.gap_df.gap_value, linewidth=3) plt.scatter(optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].n_clusters, optimalk.gap_df[optimalk.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r') plt.grid(True) plt.xlabel('Cluster Count') plt.ylabel('Gap Value') plt.title('Gap Values by Cluster Count') plt.show() image

    opened by nry123 1
  • Pull out kmeans impl into own crate/package

    Pull out kmeans impl into own crate/package

    Right now, a full kmeans implementation sits here... Basically a freebie to pull this out into it's own crate and use that.

    • [ ] External Kmeans crate
    • [ ] Implement another Python package with that crate.
    opened by milesgranger 0
Releases(v2.0.1)
Owner
Miles Granger
Just a happy engineer.
Miles Granger
A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio

Neil Kaushikkar 0 Sep 2, 2021
Flexible, reusable reinforcement learning (Q learning) implementation in Rust

Rurel Rurel is a flexible, reusable reinforcement learning (Q learning) implementation in Rust. Release documentation In Cargo.toml: rurel = "0.2.0"

Milan Boers 60 Dec 29, 2022
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

Synerise 405 Dec 20, 2022
A Rust machine learning framework.

Linfa linfa (Italian) / sap (English): The vital circulating fluid of a plant. linfa aims to provide a comprehensive toolkit to build Machine Learning

Rust-ML 2.2k Jan 2, 2023
Machine Learning library for Rust

rusty-machine This library is no longer actively maintained. The crate is currently on version 0.5.4. Read the API Documentation to learn more. And he

James Lucas 1.2k Dec 31, 2022
Machine learning crate for Rust

rustlearn A machine learning package for Rust. For full usage details, see the API documentation. Introduction This crate contains reasonably effectiv

Maciej Kula 547 Dec 28, 2022
Open deep learning compiler stack for cpu, gpu and specialized accelerators

Open Deep Learning Compiler Stack Documentation | Contributors | Community | Release Notes Apache TVM is a compiler stack for deep learning systems. I

The Apache Software Foundation 8.9k Jan 4, 2023
Xaynet represents an agnostic Federated Machine Learning framework to build privacy-preserving AI applications.

xaynet Xaynet: Train on the Edge with Federated Learning Want a framework that supports federated learning on the edge, in desktop browsers, integrate

XayNet 196 Dec 22, 2022
The Hacker's Machine Learning Engine

Juice This is the workspace project for juice - machine learning frameworks for hackers coaster - underlying math abstraction coaster-nn coaster-blas

spearow 982 Dec 31, 2022
A fast, safe and easy to use reinforcement learning framework in Rust.

RSRL (api) Reinforcement learning should be fast, safe and easy to use. Overview rsrl provides generic constructs for reinforcement learning (RL) expe

Thomas Spooner 139 Dec 13, 2022
A deep learning library for rust

Alumina An experimental deep learning library written in pure rust. Breakage expected on each release in the short term. See mnist.rs in examples or R

zza 95 Nov 30, 2022
Awesome deep learning crate

NeuroFlow is fast neural networks (deep learning) Rust crate. It relies on three pillars: speed, reliability, and speed again. Hello, everyone! Work o

Mikhail Kravets 70 Nov 20, 2022
Machine learning in Rust.

Rustml Rustml is a library for doing machine learning in Rust. The documentation of the project with a descprition of the modules can be found here. F

null 60 Dec 15, 2022
Rust based Cross-GPU Machine Learning

HAL : Hyper Adaptive Learning Rust based Cross-GPU Machine Learning. Why Rust? This project is for those that miss strongly typed compiled languages.

Jason Ramapuram 83 Dec 20, 2022
Machine Learning Library for Rust

autograph Machine Learning Library for Rust undergoing maintenance Features Portable accelerated compute Run SPIR-V shaders on GPU's that support Vulk

null 223 Jan 1, 2023
Fwumious Wabbit, fast on-line machine learning toolkit written in Rust

Fwumious Wabbit is a very fast machine learning tool built with Rust inspired by and partially compatible with Vowpal Wabbit (much love! read more abo

Outbrain 115 Dec 9, 2022
🦀 Example of serving deep learning models in Rust with batched prediction

rust-dl-webserver This project provides an example of serving a deep learning model with batched prediction using Rust. In particular it runs a GPT2 m

Evan Pete Walsh 28 Dec 15, 2022
Tangram is an automated machine learning framework designed for programmers.

Tangram Tangram is an automated machine learning framework designed for programmers. Run tangram train to train a model from a CSV file on the command

Tangram 1.4k Dec 30, 2022
Reinforcement learning library written in Rust

REnforce Reinforcement library written in Rust This library is still in early stages, and the API has not yet been finalized. The documentation can be

Niven Achenjang 20 Jun 14, 2022