A Rust machine learning framework.

Overview

Linfa mascot icon

Linfa

crates.io Documentation Codequality Run Tests

linfa (Italian) / sap (English):

The vital circulating fluid of a plant.

linfa aims to provide a comprehensive toolkit to build Machine Learning applications with Rust.

Kin in spirit to Python's scikit-learn, it focuses on common preprocessing tasks and classical ML algorithms for your everyday ML tasks.

Documentation: latest Community chat: Zulip

Current state

Where does linfa stand right now? Are we learning yet?

linfa currently provides sub-packages with the following algorithms:

Name Purpose Status Category Notes
clustering Data clustering Tested / Benchmarked Unsupervised learning Clustering of unlabeled data; contains K-Means, Gaussian-Mixture-Model and DBSCAN
kernel Kernel methods for data transformation Tested Pre-processing Maps feature vector into higher-dimensional space
linear Linear regression Tested Partial fit Contains Ordinary Least Squares (OLS), Generalized Linear Models (GLM)
elasticnet Elastic Net Tested Supervised learning Linear regression with elastic net constraints
logistic Logistic regression Tested Partial fit Builds two-class logistic regression models
reduction Dimensionality reduction Tested Pre-processing Diffusion mapping and Principal Component Analysis (PCA)
trees Decision trees Experimental Supervised learning Linear decision trees
svm Support Vector Machines Tested Supervised learning Classification or regression analysis of labeled datasets
hierarchical Agglomerative hierarchical clustering Tested Unsupervised learning Cluster and build hierarchy of clusters
bayes Naive Bayes Tested Supervised learning Contains Gaussian Naive Bayes
ica Independent component analysis Tested Unsupervised learning Contains FastICA implementation

We believe that only a significant community effort can nurture, build, and sustain a machine learning ecosystem in Rust - there is no other way forward.

If this strikes a chord with you, please take a look at the roadmap and get involved!

BLAS/Lapack backend

At the moment you can choose between the following BLAS/LAPACK backends: openblas, netblas or intel-mkl

Backend Linux Windows macOS
OpenBLAS ✔️ - -
Netlib ✔️ - -
Intel MKL ✔️ ✔️ ✔️

For example if you want to use the system IntelMKL library for the PCA example, then pass the corresponding feature:

cd linfa-reduction && cargo run --release --example pca --features linfa/intel-mkl-system

This selects the intel-mkl system library as BLAS/LAPACK backend. On the other hand if you want to compile the library and link it with the generated artifacts, pass intel-mkl-static.

License

Dual-licensed to be compatible with the Rust project.

Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.

Comments
  • Optimize K-means

    Optimize K-means

    Optimized the averaging step of K_means to use non-moving averages and arrays instead of hashmaps. The code also updates the new centroid in-place. Also fixed deprecation warning on all benchmarks in linfa-clustering (other benchmarks may have the same warnings). I'm curious as to the effect of my changes on benchmark speeds, since my machine isn't suited to running benchmarks.

    opened by YuhanLiin 29
  • add PredictInto trait and impl for KMeans

    add PredictInto trait and impl for KMeans

    Work towards https://github.com/rust-ml/linfa/issues/130

    It would be possible to make a default implementation for any type that implements PredictRef. But we do not have specialization yet.

    Does only implementing this, for types that can do this more efficiently, make sense?

    opened by pYtoner 20
  • Fit trait modification and cross validation proposal

    Fit trait modification and cross validation proposal

    Changes in the Fit trait

    from this:

    pub trait Fit<'a, R: Records, T> {
        type Object: 'a;
    
        fn fit(&self, dataset: &DatasetBase<R, T>) -> Self::Object;
    }
    

    to this:

    pub trait Fit<R: Records, T, E: std::error::Error + std::convert::From<linfa::Error>> {
        type Object;
    
        fn fit(&self, dataset: &DatasetBase<R, T>) -> Result<Self::Object, E>;
    }
    

    by:

    • removing 'a lifetime (left from previous svm implementation, not actually used by any algorithm anymore)
    • forcing every implementation to return a result with an error struct (every implementation except PCA already returned an error, some implementations returned a String as the error but the transition keeps the same error messages)

    Edit 1:

    Added conversion from linfa error bound on fit error type. Every sub-crate should be able to handle the errors caused by using the base crate

    Cross validation POC

    Cross-validation can be defined by exploiting the new Fit definition. This is what it looks like for regression:

        use linfa::prelude::*;
        use linfa_elasticnet::{ElasticNet, Result};
    
        // load Diabetes dataset (mutable to allow fast k-folding)
        let mut dataset = linfa_datasets::diabetes();
    
        // prameters to compare
        let ratios = vec![0.1, 0.2, 0.5, 0.7, 1.0];
    
        // create a model for each parameter
        let models = ratios
            .iter()
            .map(|ratio| ElasticNet::params().penalty(0.3).l1_ratio(*ratio))
            .collect::<Vec<_>>();
    
        // get the mean r2 validation score across all folds for each model
        let r2_values =
            dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(&truth))?;
    
        for (ratio, r2) in ratios.iter().zip(r2_values.iter()) {
            println!("L1 ratio: {}, r2 score: {}", ratio, r2);
        }
    

    And this is what it looks like for classification:

        use linfa::prelude::*;
        use linfa_logistic::error::Result;
        use linfa_logistic::LogisticRegression;
    
        // Load dataset. Mutability is needed for fast cross validation
        let mut dataset =
            linfa_datasets::winequality().map_targets(|x| if *x > 6 { "good" } else { "bad" });
    
        // define a sequence of models to compare. In this case the
        // models will differ by the amount of l2 regularization
        let alphas = vec![0.1, 1., 10.];
        let models: Vec<_> = alphas
            .iter()
            .map(|alpha| {
                LogisticRegression::default()
                    .alpha(*alpha)
                    .max_iterations(150)
            })
            .collect();
    
        // use cross validation to compute the validation accuracy of each model. The
        // accuracy of each model will be averaged across the folds, 5 in this case
        let accuracies = dataset.cross_validate(5, &models, |prediction, truth| {
            Ok(prediction.confusion_matrix(truth)?.accuracy())
        })?;
    
        // display the accuracy of the models along with their regularization coefficient
        for (alpha, accuracy) in alphas.iter().zip(accuracies.iter()) {
            println!("Alpha: {}, accuracy: {} ", alpha, accuracy);
        }
    

    Possible follow up

    Redefine the Transformer trait to return a Result, like the t-sne implementation does, in order to avoid panicking as much as possible. It would be better to wait for #121 in order to avoid dealing with the unwrap() calls when performing float conversions.

    Notes

    • ~~Currently there are two versions of cross-validation: one for single target datasets and one for multi-target datasets. The main reasons for this are:~~
      • ~~Array1s of evaluation values can be constructed with collect, Array2 cannot and must be populated row by row (could be avoided by stacking, or maybe there is a dedicated ndarray method I am not aware of)~~
      • ~~Regression metrics behave differently when they are applied Array1-Array2 or Array2-Array1, making writing evaluation closures possibly more difficult than it should.~~

    ~~More than likely there is a solution to both problems, but I got stuck on it for too long and so I would consider it out of scope for this PR, unless someone has an easy solution to suggest. Forcing the user to give a single return value in the multi-target case (like a mean across the targets) could make the problem easier but the evaluation for each target would be lost in the process.~~

    • Got this error in elasticnet just once:
         Running target/release/deps/linfa_elasticnet-d4c6cf99e56de5dd
    
    running 11 tests
    test algorithm::tests::coordinate_descent_lowers_objective ... ok
    test algorithm::tests::elastic_net_2d_toy_example_works ... ok
    test algorithm::tests::elastic_net_diabetes_1_works_like_sklearn ... ok
    test algorithm::tests::diabetes_z_score ... ok
    test algorithm::tests::elastic_net_penalty_works ... ok
    test algorithm::tests::elastic_net_toy_example_works ... ok
    test algorithm::tests::lasso_toy_example_works ... ok
    test algorithm::tests::lasso_zero_works ... ok
    error: test failed, to rerun pass '-p linfa-elasticnet --lib'
    
    Caused by:
      process didn't exit successfully: `/home/ivano/Scrivania/github_projects/linfa/target/release/deps/linfa_elasticnet-d4c6cf99e56de5dd` (signal: 11, SIGSEGV: invalid memory reference)
    
    opened by Sauro98 17
  • Added Approximated dbscan to linfa-clustering

    Added Approximated dbscan to linfa-clustering

    I added an implementation of the Approximated DBSCAN to the linfa-clustering subcrate. This algorithm depends on the crate partitions for an implementation of the union-find structure. This can be removed by using adjacency lists, at the cost of performance. Another big thing: I noticed that the existing vanilla DBSCAN implementation used an axis convention that was the opposite to the one used by the generate_blobs function, which it uses in the benches. I modified the DBSCAN implementation to invert the axes in order to be able to compare benches of the two implementations, and I also modified the tests to pass this modification. I modified the vanilla implementation only to be able to compare the performances, I did not want to judge the previous implementation. This was also my first time using ndarray so any suggestion is very welcome.

    opened by Sauro98 17
  • Cargo run failed: Undefined symbols for architecture x86_64

    Cargo run failed: Undefined symbols for architecture x86_64

    I followed LinearRegression example and found err on cargo run

    Code:

    use linfa::prelude::SingleTargetRegression;
    use linfa::traits::{Fit, Predict};
    use linfa_linear::LinearRegression;
    
    
    fn main() {
        let dataset = linfa_datasets::diabetes();
        let model = LinearRegression::default().fit(&dataset).unwrap();  # the error seems happen on this line
        pred = model.predict(&dataset);
        pred.r2(&dataset).unwrap();
    }
    

    Err output:

      = note: Undefined symbols for architecture x86_64:
                "_dgelsd_", referenced from:
                    lapack::dgelsd::h78796e3523b23ad8 in liblax-5505ca61a5fbc67a.rlib(lax-5505ca61a5fbc67a.lax.d19e9304-cgu.2.rcgu.o)
                "_cblas_sdot", referenced from:
                    ndarray::linalg::impl_linalg::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$1$u5d$$GT$$GT$$GT$::dot_impl::h7adddea61e00d52a in rust_ds-fc4af6c90afa881c.4mtkwkgjvot8wzn2.rcgu.o
                "_cblas_ddot", referenced from:
                    ndarray::linalg::impl_linalg::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$1$u5d$$GT$$GT$$GT$::dot_impl::h7adddea61e00d52a in rust_ds-fc4af6c90afa881c.4mtkwkgjvot8wzn2.rcgu.o
              ld: symbol(s) not found for architecture x86_64
              clang: error: linker command failed with exit code 1 (use -v to see invocation)
    

    OS: Mac Catalina v10.15.7 Rust: rustc 1.58.1

    I have try cargo clean but still can't fix the issue

    opened by weiztech 15
  • Add Kmeans++ and Kmeans|| initialization

    Add Kmeans++ and Kmeans|| initialization

    This PR adds K-means++ as an initialization function as well as K-means||, which is a version of K-means++ that scales better with higher cluster counts, described in this paper. A new hyperparameter has also been added to specify the choice of initialization algorithm. I want to have the design reviewed before going any further.

    Todo:

    • [x] Integrate the new initialization algorithms to the benchmarks, since they should allow K-means to complete faster.
    • [x] Pick default initialization function.
    opened by YuhanLiin 15
  • Provide non-accelerated algorithms without LAPACK routines

    Provide non-accelerated algorithms without LAPACK routines

    The people at rust-cv looked recently at some clustering algorithms in linfa and one issue which came up are dependencies to the LAPACK routines. In some situations you want to avoid these, for example in an embedding context.

    Take for example the gaussian mixture model: the current implementation utilizes a cholesky decomposition for faster inverse and determinant computation and speeds the LL computation up. We could provide a second implementation which gets activated when the corresponding feature flags are deactivated and only provide models with independent univariates gaussians.

    infrastructure 
    opened by bytesnake 14
  • Introduce unchecked hyperparams to DBSCAN, Appx DBSCAN, and KMeans

    Introduce unchecked hyperparams to DBSCAN, Appx DBSCAN, and KMeans

    Similar to what I did for KMeans, I reintroduced hyperparam builders to DBSCAN and approximate DBSCAN. This allows the verification of hyperparams to be done via returning errors rather than panicking. I believe using a builder is more idiomatic then doing a verify() check since the builder forces the verification to be done exactly once before the hyperparams can be used to run any algorithms. Introduce a new trait called UncheckedHyperParams to represent the verification of hyperparameters. Integrate this trait for DBSCAN, appx DBSCAN, and KMeans.

    opened by YuhanLiin 14
  • Documentation maintenance and small fixes

    Documentation maintenance and small fixes

    Here is a collection of work that needs to be done regarding documentation and minor issues with the code itself. Picking an item from this list can be a good way to start getting acquainted with the codebase and provide a useful contribution. 😄 👍🏻

    In general, I would say that the pages for the individual algorithms in the linfa-clustering and linfa-elasticnet sub-crates can be good references for the structure of a documentation page.

    It's suggested to look at the existing documentation in your local build rather than on doc.rs since some pages may have already been updated.

    • [x] Update linfa-bayes's sub-crate documentation to include a brief description of the algorithm used and add an example of the usage of the provided model along with maybe some hints regarding when to choose a naive Bayes predictor. There are already some examples in the dedicated page for the params structure but maybe it's better to add at least one to the main page for the sub-crate
    • [x] linfa-hierarchical: The documentation for this sub-crate needs improvements like the ones listed for linfa-bayes
    • [x] linfa-ica: Also needs updates similar to linfa-bayes
    • [x] linfa-kernel: Add a description of what kernel methods are useful for, similarly to what's written in the linfa-svm crate, and add descriptions to the Kernel and KernelView subtypes. There are some examples in the KernelBase struct Documentation but maybe it would be a good idea to have an example of kernel transformation directly on the crate's main page.
    • [x] linfa-linear: The crate's documentation is inside the ols module instead of being in the crate's root and it is outdated since it says it only provides an implementation of the least squares algorithm
    • [x] linfa-linear: glm::TweedieRegressor provides its own fit method instead of implementing the Fit trait, and the same for predict. Moving the methods inside the trait would be a good way to align it with the rest of the algorithms provided and ensure compatibility
    • [x] linfa-logistic could use some more documentation to explain the algorithm and some examples in its main page
    • [x] linfa-logistic implements its own fit and predict methods instead of implementing the Fit and Predict traits. Like in the case of linear regression it would be a good thing to align the interface with the other sub-crates
    • [x] linfa-reduction completely lacks any documentation. An explanation of why dimensionality reduction is useful in ML and descriptions of the single algorithms would really help with understanding the usefulness of the crate
    • [x] linfa-svm: this is another reference for what other crates' documentation should look like. Right now the predict method returns an Array1 for classification and a Vec for regression. It would be less confusing if the regression case was modified to return an Array1 too.
    • [x] linfa-trees: it needs changes similar to linfa-bayes, with the addition of documentation regarding the methods for the params structure
    • [x] linfa: The main page of the rustdocs still says that linfa only provides the K-Means algorithm which may be very confusing to someone that only sees the crate in doc.rs for the first time. Here a completely revised page compete with project goals and links to the various sub-crates would be useful, so that one can find the algorithm they need without manually searching for the sub-crates.
    • [x] linfa: The dataset module page could use a bit of explanation about the main differences between the four dataset types and a brief recollection of what utility methods a dataset provides
    • [x] ~~linfa: The metrics module page could use a brief list of the provided metrics so that one does not have to go looking for them in the sup-pages if they only want to know whether a metric is provided or not~~
    help wanted good first issue 
    opened by Sauro98 14
  • Introduce checked parameters to all algorithms

    Introduce checked parameters to all algorithms

    Current state:

    • [x] port algorithm's parameters to new syntax
    • [ ] improve documentation (use linfa-bayes as prototype)
    • [x] fix issues with Transform and double results
    opened by bytesnake 13
  • WIP OPTICS clustering

    WIP OPTICS clustering

    Still working out how best to test it but it seems mostly there. Wiki for details https://en.wikipedia.org/wiki/OPTICS_algorithm there's also a paper I used as reference at the start I'll see if I can dig that out if anyone wants. Also I think I may have to change some of the neighbour stuff to be more correct. I'll figure it out 👀

    opened by xd009642 13
  • Broken link in linfa-logistic documentation

    Broken link in linfa-logistic documentation

    I was just scrolling through a part of of the linfa-logistic, and came across a documentation linking error for the link to https://docs.rs/linfa-logistic/latest/linfa_logistic/struct.LogisticRegression.html from the crate's index page. It might make sense to add a CI check to run a linkchecker over the index.html files created by cargo doc in each of the sub-crates to help keep an eye on this.

    One option for this could be the lychee crate, which seems to have a number of installation options (although cargo install probably isn't a good option, since it has something like 400 deps to build from source).

    opened by quietlychris 0
  • Update argmin to version 0.6.0

    Update argmin to version 0.6.0

    This PR updates argmin to the most recent version 0.6.0. In this version the serde dependency is optional, which addresses #48. It only concerns linfa-linear and linfa-logistic.

    linfa-linear

    This one was fairly easy to do. I'm unsure what the exact reasons for introducing ArgminParam were. I managed to remove ArgminParam by adding the necessary trait bounds to TweedyRegressorValidParams which avoids having to re-implement the math traits. Let me know if I missed something and I'll introduce ArgminParam again.

    linfa-logistic

    Here I did not remove ArgminParam here, but I think it should also be possible.

    EDIT: I see now why you haven't upgraded to 0.5 ;). I guess it only makes sense to upgrade once the MSRV is increased. Also, I noticed that you want to avoid pulling in ndarray-linalg whenever possible. I somehow by accident removed the linalg feature from argmin-math and I'll have to see how to add that again (in the current setup it will always have ndarray-linalg as a dependency). I'm unsure how to proceed here. You can close this if you want but you can also leave this around for later ;)

    opened by stefan-k 4
  • Implement PCA for f32 and f64

    Implement PCA for f32 and f64

    Addresses #232

    Adds PCA impl for f32 for all BLAS backends using macros. Does not involve the Float trait.

    There is currently a logic bug, as the test_explained_variance_diag and test_whitening_small tests are currently failing on f32. It might be related to the TruncateSvd code. @bytesnake any ideas?

    opened by YuhanLiin 4
  • Support for Hamming distance (l0 norm)

    Support for Hamming distance (l0 norm)

    Currently, Lp distance when using p=0 is broken. It tries to calculate 1/0. I have implemented Hamming Distance (l0 norm) that counts the number of positions which have different values.

    Also, I changed the distance tests to increase code readability and norm symmetry/homogeneity checks.

    opened by jorgehermo9 2
Releases(0.6.0)
  • 0.6.0(Jun 15, 2022)

    Linfa's 0.6.0 release removes the mandatory dependency on external BLAS libraries (such as intel-mkl) by using a pure-Rust linear algebra library. It also adds the Naive Multinomial Bayes and Follow The Regularized Leader algorithms. Additionally, the AsTargets trait has been separated into AsSingleTargets and AsMultiTargets.

    No more BLAS

    With older versions of Linfa, algorithm crates that used advanced linear algebra routines needed to be linked against an external BLAS library such as Intel-MKL. This is done by adding feature flags like linfa/intel-mkl-static to the build, and it increased the compile times significantly. Version 0.6.0 replaces the BLAS library with a pure-Rust implementation of all the required routines, which Linfa uses by default. This means all Linfa crates now build properly and quickly without any extra feature flags. It is still possible for the affected algorithm crates to link against an external BLAS libary. Doing so requires enabling the crate's blas feature, along with the feature flag for the external BLAS library. The affected crates are as follows:

    • linfa-ica
    • linfa-reduction
    • linfa-clustering
    • linfa-preprocessing
    • linfa-pls
    • linfa-linear
    • linfa-elasticnet

    New algorithms

    Multinomial Naive Bayes is a family of Naive Bayes classifiers that assume independence between variables. The advantage is a linear fitting time with maximum-likelihood training in a closed form. The algorithm is added to linfa-bayes and an example can be found at linfa-bayes/examples/winequality_multinomial.rs.

    Follow The Regularized Leader (FTRL) is a linear model for CTR prediction in online learning settings. It is a special type of linear model with sigmoid function which uses L1 and L2 regularization. The algorithm is contained in the newly-added linfa-ftrl crate, and an example can be found at linfa-ftrl/examples/winequality.rs.

    Distinguish between single and multi-target

    Version 0.6.0 introduces a major change to the AsTarget trait, which is now split into AsSingleTargets and AsMultiTargets. Additionally, the Dataset* types are parametrized by target dimensionality, instead of always using a 2D array. Furthermore, algorithms that work on single-target data will no longer accept multi-target datasets as input. This change may cause build errors in existing code that call the affected algorithms. The fix for it is as simple as adding Ix1 to the end of the type parameters for the dataset being passed in, which forces the dataset to be single-target.

    Improvements

    • Remove SeedableRng trait bound from KMeans and GaussianMixture.
    • Replace uses of Isaac RNG with Xoshiro RNG.
    • cross_validate changed to cross_validate_single, which is for single-target data; cross_validate_multi changed to cross_validate, which is for both single and multi-target datasets.
    • The probability type Pr has been constrained to 0. <= prob <= 1.. Also, the simple Pr(x) constructor has been replaced by Pr::new(x), Pr::new_unchecked(x), and Pr::try_from(x), which ensure that the invariant for Pr is met.
    Source code(tar.gz)
    Source code(zip)
  • 0.5.1(Mar 1, 2022)

    Release 0.5.1

    Linfa's 0.5.1 release fixes errors and bugs in the previous release, as well as removing useless trait bounds on the Dataset type. Note that the commits for this release are located in the 0-5-1 branch of the GitHub repo.

    Improvements

    • remove Float trait bound from many Dataset impls, making non-float datasets usable
    • fix build errors in 0.5.0 caused by breaking minor releases from dependencies
    • fix bug in k-means where the termination condition of the algorithm was calculated incorrectly
    • fix build failure when building linfa alone, caused by incorrect feature selection for ndarray
    Source code(tar.gz)
    Source code(zip)
  • 0.5.0(Oct 21, 2021)

    Linfa's 0.5.0 release adds initial support for the OPTICS algorithm, multinomials logistic regression, and the family of nearest neighbor algorithms. Furthermore, we have improved documentation and introduced hyperparameter checking to all algorithms.

    New algorithms

    OPTICS is an algorithm for finding density-based clusters. It can produce reachability-plots, hierarchical structure of clusters. Analysing data without prior assumption of any distribution is a common use-case. The algorithm is added to linfa-clustering and an example can be find at linfa-clustering/examples/optics.rs.

    Extending logistic regression to the multinomial distribution generalizes it to multiclass problems. This release adds support for multinomial logistic regression to linfa-logistic, you can experiment with the example at linfa-logistic/examples/winequality_multi.rs.

    Nearest neighbor search finds the set of neighborhood points to a given sample. It appears in numerous fields of applications as a distance metric provider. (e.g. clustering) This release adds a family of nearest neighbor algorithms, namely Ball tree, K-d tree and naive linear search. You can find an example in the next section.

    Improvements

    • use least-square solver from ndarray-linalg in linfa-linear
    • make clustering algorithms generic over distance metrics
    • bump ndarray to 0.15
    • introduce ParamGuard trait for explicit and implicit parameter checking (read more in the CONTRIBUTE.md)
    • improve documentation in various places

    Nearest Neighbors

    You can now choose from a growing list of NN implementations. The family provides efficient distance metrics to KMeans, DBSCAN etc. The example shows how to use KDTree nearest neighbor to find all the points in a set of observations that are within a certain range of a candidate point.

    You can query nearest points explicitly:

    // create a KDTree index consisting of all the points in the observations, using Euclidean distance
    let kdtree = CommonNearestNeighbour::KdTree.from_batch(observations, L2Dist)?;
    let candidate = observations.row(2);
    let points = kdtree.within_range(candidate.view(), range)?;
    

    Or use one of the distance metrics implicitly, here demonstrated for KMeans:

    use linfa_nn::distance::LInfDist;
    
    let model = KMeans::params_with(3, rng, LInfDist)
        .max_n_iterations(200)
        .tolerance(1e-5)
        .fit(&dataset)?;
    
    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Apr 28, 2021)

    Linfa's 0.4.0 release introduces four new algorithms, improves documentation of the ICA and K-means implementations, adds more benchmarks to K-Means and updates to ndarray's 0.14 version.

    New algorithms

    The Partial Least Squares Regression model family is added in this release (thanks to @relf). It projects the observable, as well as predicted variables to a latent space and maximizes the correlation for them. For problems with a large number of targets or collinear predictors it gives a better performance when compared to standard regression. For more information look into the documentation of linfa-pls.

    A wrapper for Barnes-Hut t-SNE is also added in this release. The t-SNE algorithm is often used for data visualization and projects data in a high-dimensional space to a similar representation in two/three dimension. It does so by maximizing the Kullback-Leibler Divergence between the high dimensional source distribution to the target distribution. The Barnes-Hut approximation improves the runtime drastically while retaining the performance. Kudos to github/frjnn for providing an implementation!

    A new preprocessing crate makes working with textual data and data normalization easy (thanks to @Sauro98). It implements count-vectorizer and IT-IDF normalization for text pre-processing. Normalizations for signals include linear scaling, norm scaling and whitening with PCA/ZCA/choelsky. An example with a Naive Bayes model achieves 84% F1 score for predicting categories alt.atheism, talk.religion.misc, comp.graphics and sci.space on a news dataset.

    Platt scaling calibrates a real-valued classification model to probabilities over two classes. This is used for the SV classification when probabilities are required. Further a multi class model, combining multiple binary models (e.g. calibrated SVM models) into a single multi-class model is also added. These composing models are moved to the linfa/src/composing/ subfolder.

    Improvements

    Numerous improvements are added to the KMeans implementation, thanks to @YuhanLiin. The implementation is optimized for offline training, an incremental training model is added and KMeans++/KMeans|| initialization gives good initial cluster means for medium and large datasets.

    We also moved to ndarray's version 0.14 and introduced F::cast for simpler floating point casting. The trait signature of linfa::Fit is changed such that it always returns a Result and error handling is added for the linfa-logistic and linfa-reduction subcrates.

    You often have to compare several model parametrization with k-folding. For this a new function cross_validate is added which takes the number of folds, model parameters and a closure for the evaluation metric. It automatically calls k-folding and averages the metric over the folds. To compare different L1 ratios of an elasticnet model, you can use it in the following way:

    // L1 ratios to compare
    let ratios = vec![0.1, 0.2, 0.5, 0.7, 1.0];
    
    // create a model for each parameter
    let models = ratios
        .iter()
        .map(|ratio| ElasticNet::params().penalty(0.3).l1_ratio(*ratio))
        .collect::<Vec<_>>();
    
    // get the mean r2 validation score across 5 folds for each model
    let r2_values =
        dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(&truth))?;
    
    // show the mean r2 score for each parameter choice
    for (ratio, r2) in ratios.iter().zip(r2_values.iter()) {
        println!("L1 ratio: {}, r2 score: {}", ratio, r2);
    }
    

    Other changes

    • fix for border points in the DBSCAN implementation
    • improved documentation of the ICA subcrate
    • prevent overflowing code example in website
    Source code(tar.gz)
    Source code(zip)
  • 0.3.1(Mar 11, 2021)

    In this release of Linfa the documentation is extended, new examples are added and the functionality of datasets improved. No new algorithms were added.

    The meta-issue #82 gives a good overview of the necessary documentation improvements and testing/documentation/examples were considerably extended in this release.

    Further new functionality was added to datasets and multi-target datasets are introduced. Bootstrapping is now possible for features and samples and you can cross-validate your model with k-folding. We polished various bits in the kernel machines and simplified the interface there.

    The trait structure of regression metrics are simplified and the silhouette score introduced for easier testing of K-Means and other algorithms.

    Changes

    • improve documentation in all algorithms, various commits
    • add a website to the infrastructure (c8acc785b)
    • add k-folding with and without copying (b0af80546f8)
    • add feature naming and pearson's cross correlation (71989627f)
    • improve ergonomics when handling kernels (1a7982b973)
    • improve TikZ generator in linfa-trees (9d71f603bbe)
    • introduce multi-target datasets (b231118629)
    • simplify regression metrics and add cluster metrics (d0363a1fa8ef)

    Example

    You can now perform cross-validation with k-folding. @Sauro98 actually implemented two versions, one which copies the dataset into k folds and one which avoid excessive memory operations by copying only the validation dataset around. For example to test a model with 8-folding:

    // perform cross-validation with the F1 score
    let f1_runs = dataset
        .iter_fold(8, |v| params.fit(&v).unwrap())
        .map(|(model, valid)| {
            let cm = model
                .predict(&valid)
                .mapv(|x| x > Pr::even())
                .confusion_matrix(&valid).unwrap();
      
              cm.f1_score()
        })  
        .collect::<Array1<_>>();
      
    // calculate mean and standard deviation
    println!("F1 score: {}±{}",
        f1_runs.mean().unwrap(),
        f1_runs.std_axis(Axis(0), 0.0),
    ); 
    
    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(Jan 21, 2021)

    New algorithms

    • Approximated DBSCAN has been added to linfa-clustering by [@Sauro98]
    • Gaussian Naive Bayes has been added to linfa-bayes by [@VasanthakumarV]
    • Elastic Net linear regression has been added to linfa-elasticnet by [@paulkoerbitz] and [@bytesnake]

    Changes

    • Added benchmark to gaussian mixture models (a3eede55)
    • Fixed bugs in linear decision trees, added generator for TiKZ trees (bfa5aebe7)
    • Implemented serde for all crates behind feature flag (4f0b63bb)
    • Implemented new backend features (7296c9ec4)
    • Introduced linfa-datasets for easier testing (3cec12b4f)
    • Rename Dataset to DatasetBase and introduce Dataset and DatasetView (21dd579cf)
    • Improve kernel tests and documentation (8e81a6d)

    Example

    The following section shows a small example how datasets interact with the training and testing of a Linear Decision Tree.

    You can load a dataset, shuffle it and then split it into training and validation sets:

    // initialize pseudo random number generator with seed 42
    let mut rng = Isaac64Rng::seed_from_u64(42);
    // load the Iris dataset, shuffle and split with ratio 0.8
    let (train, test) = linfa_datasets::iris()
        .shuffle(&mut rng)
        .split_with_ratio(0.8);
    

    With the training dataset a linear decision tree model can be trained. Entropy is used as a metric for the optimal split here:

    let entropy_model = DecisionTree::params()
        .split_quality(SplitQuality::Entropy)
        .max_depth(Some(100))
        .min_weight_split(10.0)
        .min_weight_leaf(10.0)
        .fit(&train);
    

    The validation dataset is now used to estimate the error. For this the true labels are predicted and then a confusion matrix gives clue about the type of error:

    let cm = entropy_model
        .predict(test.records().view())
        .confusion_matrix(&test);
    
    println!("{:?}", cm);
    
    println!(
        "Test accuracy with Entropy criterion: {:.2}%",
        100.0 * cm.accuracy()
    );
    

    Finally you can analyze which features were used in the decision and export the whole tree it to a TeX file. It will contain a TiKZ tree with information on the splitting decision and impurity improvement:

    let feats = entropy_model.features();
    println!("Features trained in this tree {:?}", feats);
    
    let mut tikz = File::create("decision_tree_example.tex").unwrap();
    tikz.write(gini_model.export_to_tikz().to_string().as_bytes())
        .unwrap();
    

    The whole example can be found in linfa-trees/examples/decision_tree.rs.

    Source code(tar.gz)
    Source code(zip)
  • 0.2.1(Nov 29, 2020)

  • 0.2.0(Nov 26, 2020)

    New algorithms

    • Ordinary Linear Regression has been added to linfa-linear by [@Nimpruda] and [@paulkoerbitz]
    • Generalized Linear Models has been added to linfa-linear by [@VasanthakumarV]
    • Linear decision trees were added to linfa-trees by [@mossbanay]
    • Fast independent component analysis (ICA) has been added to linfa-ica by [@VasanthakumarV]
    • Principal Component Analysis and Diffusion Maps have been added to linfa-reduction by [@bytesnake]
    • Support Vector Machines has been added to linfa-svm by [@bytesnake]
    • Logistic regression has been added to linfa-logistic by [@paulkoerbitz]
    • Hierarchical agglomerative clustering has been added to linfa-hierarchical by [@bytesnake]
    • Gaussian Mixture Models has been added to linfa-clustering by [@relf]

    Changes

    • Common metrics for classification and regression have been added
    • A new dataset interface simplifies the work with targets and labels
    • New traits for Transformer, Fit and IncrementalFit standardizes the interface
    • Switched to Github Actions for better integration
    Source code(tar.gz)
    Source code(zip)
Owner
Rust-ML
Rust-ML
Xaynet represents an agnostic Federated Machine Learning framework to build privacy-preserving AI applications.

xaynet Xaynet: Train on the Edge with Federated Learning Want a framework that supports federated learning on the edge, in desktop browsers, integrate

XayNet 191 Aug 30, 2022
Tangram is an automated machine learning framework designed for programmers.

Tangram Tangram is an automated machine learning framework designed for programmers. Run tangram train to train a model from a CSV file on the command

Tangram 1.4k Sep 22, 2022
Machine learning framework for building object trackers and similarity search engines

Similari Similari is a framework that helps build sophisticated tracking systems. The most frequently met operations that can be efficiently implement

In-Sight 59 Sep 20, 2022
A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio

Neil Kaushikkar 0 Sep 2, 2021
Machine Learning library for Rust

rusty-machine This library is no longer actively maintained. The crate is currently on version 0.5.4. Read the API Documentation to learn more. And he

James Lucas 1.2k Sep 24, 2022
Machine learning crate for Rust

rustlearn A machine learning package for Rust. For full usage details, see the API documentation. Introduction This crate contains reasonably effectiv

Maciej Kula 535 Sep 5, 2022
Machine learning in Rust.

Rustml Rustml is a library for doing machine learning in Rust. The documentation of the project with a descprition of the modules can be found here. F

null 53 Sep 16, 2022
Rust based Cross-GPU Machine Learning

HAL : Hyper Adaptive Learning Rust based Cross-GPU Machine Learning. Why Rust? This project is for those that miss strongly typed compiled languages.

Jason Ramapuram 84 Sep 15, 2022
Machine Learning Library for Rust

autograph Machine Learning Library for Rust undergoing maintenance Features Portable accelerated compute Run SPIR-V shaders on GPU's that support Vulk

null 213 Sep 28, 2022
Fwumious Wabbit, fast on-line machine learning toolkit written in Rust

Fwumious Wabbit is a very fast machine learning tool built with Rust inspired by and partially compatible with Vowpal Wabbit (much love! read more abo

Outbrain 110 Aug 1, 2022
Example of Rust API for Machine Learning

rust-machine-learning-api-example Example of Rust API for Machine Learning API example that uses resnet224 to infer images received in base64 and retu

vaaaaanquish 17 Apr 9, 2022
High-level non-blocking Deno bindings to the rust-bert machine learning crate.

bertml High-level non-blocking Deno bindings to the rust-bert machine learning crate. Guide Introduction The ModelManager class manages the FFI bindin

Carter Snook 11 Apr 3, 2022
Machine learning Neural Network in Rust

vinyana vinyana - stands for mind in pali language. Goal To implement a simple Neural Network Library in order to understand the maths behind it. This

Alexandru Olaru 2 Nov 17, 2021
Source Code for 'Practical Machine Learning with Rust' by Joydeep Bhattacharjee

Apress Source Code This repository accompanies Practical Machine Learning with Rust by Joydeep Bhattacharjee (Apress, 2020). Download the files as a z

Apress 50 Jul 28, 2022
🏆 A ranked list of awesome machine learning Rust libraries.

best-of-ml-rust ?? A ranked list of awesome machine learning Rust libraries. This curated list contains 180 awesome open-source projects with a total

₸ornike 97 Sep 23, 2022
An example of using TensorFlow rust bindings to serve trained machine learning models via Actix Web

Serving TensorFlow with Actix-Web This repository gives an example of training a machine learning model using TensorFlow2.0 Keras in python, exporting

Kyle Kosic 35 Aug 31, 2022
Machine learning crate in Rust

DeepRust - Machine learning in Rust Vision To create a deeplearning crate in rust aiming to create a great experience for ML researchers & developers

Vigneshwer Dhinakaran 8 Sep 6, 2022
BudouX-rs is a rust port of BudouX (machine learning powered line break organizer tool).

BudouX-rs BudouX-rs is a rust port of BudouX (machine learning powered line break organizer tool). Note: This project contains the deliverables of the

null 5 Jan 20, 2022
The Hacker's Machine Learning Engine

Juice This is the workspace project for juice - machine learning frameworks for hackers coaster - underlying math abstraction coaster-nn coaster-blas

spearow 944 Sep 19, 2022