A Rust machine learning framework.

Rust-ML

Last update: Jan 2, 2023

Related tags

Overview

Linfa

linfa (Italian) / sap (English):

The vital circulating fluid of a plant.

linfa aims to provide a comprehensive toolkit to build Machine Learning applications with Rust.

Kin in spirit to Python's scikit-learn, it focuses on common preprocessing tasks and classical ML algorithms for your everyday ML tasks.

Documentation: latest Community chat: Zulip

Current state

Where does linfa stand right now? Are we learning yet?

linfa currently provides sub-packages with the following algorithms:

Name	Purpose	Status	Category	Notes
clustering	Data clustering	Tested / Benchmarked	Unsupervised learning	Clustering of unlabeled data; contains K-Means, Gaussian-Mixture-Model and DBSCAN
kernel	Kernel methods for data transformation	Tested	Pre-processing	Maps feature vector into higher-dimensional space
linear	Linear regression	Tested	Partial fit	Contains Ordinary Least Squares (OLS), Generalized Linear Models (GLM)
elasticnet	Elastic Net	Tested	Supervised learning	Linear regression with elastic net constraints
logistic	Logistic regression	Tested	Partial fit	Builds two-class logistic regression models
reduction	Dimensionality reduction	Tested	Pre-processing	Diffusion mapping and Principal Component Analysis (PCA)
trees	Decision trees	Experimental	Supervised learning	Linear decision trees
svm	Support Vector Machines	Tested	Supervised learning	Classification or regression analysis of labeled datasets
hierarchical	Agglomerative hierarchical clustering	Tested	Unsupervised learning	Cluster and build hierarchy of clusters
bayes	Naive Bayes	Tested	Supervised learning	Contains Gaussian Naive Bayes
ica	Independent component analysis	Tested	Unsupervised learning	Contains FastICA implementation

We believe that only a significant community effort can nurture, build, and sustain a machine learning ecosystem in Rust - there is no other way forward.

If this strikes a chord with you, please take a look at the roadmap and get involved!

BLAS/Lapack backend

At the moment you can choose between the following BLAS/LAPACK backends: openblas, netblas or intel-mkl

Backend	Linux	Windows	macOS
OpenBLAS	✔️	-	-
Netlib	✔️	-	-
Intel MKL	✔️	✔️	✔️

For example if you want to use the system IntelMKL library for the PCA example, then pass the corresponding feature:

cd linfa-reduction && cargo run --release --example pca --features linfa/intel-mkl-system

This selects the intel-mkl system library as BLAS/LAPACK backend. On the other hand if you want to compile the library and link it with the generated artifacts, pass intel-mkl-static.

License

Dual-licensed to be compatible with the Rust project.

Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.

Comments

Optimize K-means

Optimized the averaging step of K_means to use non-moving averages and arrays instead of hashmaps. The code also updates the new centroid in-place. Also fixed deprecation warning on all benchmarks in linfa-clustering (other benchmarks may have the same warnings). I'm curious as to the effect of my changes on benchmark speeds, since my machine isn't suited to running benchmarks.

opened by YuhanLiin 29
add PredictInto trait and impl for KMeans

Work towards https://github.com/rust-ml/linfa/issues/130

It would be possible to make a default implementation for any type that implements PredictRef. But we do not have specialization yet.

Does only implementing this, for types that can do this more efficiently, make sense?

opened by pYtoner 20

Fit trait modification and cross validation proposal

Changes in the `Fit` trait

from this:

pub trait Fit<'a, R: Records, T> {
    type Object: 'a;

    fn fit(&self, dataset: &DatasetBase<R, T>) -> Self::Object;
}

to this:

pub trait Fit<R: Records, T, E: std::error::Error + std::convert::From<linfa::Error>> {
    type Object;

    fn fit(&self, dataset: &DatasetBase<R, T>) -> Result<Self::Object, E>;
}

by:

removing 'a lifetime (left from previous svm implementation, not actually used by any algorithm anymore)
forcing every implementation to return a result with an error struct (every implementation except PCA already returned an error, some implementations returned a String as the error but the transition keeps the same error messages)

Edit 1:

Added conversion from linfa error bound on fit error type. Every sub-crate should be able to handle the errors caused by using the base crate

Cross validation POC

Cross-validation can be defined by exploiting the new Fit definition. This is what it looks like for regression:

    use linfa::prelude::*;
    use linfa_elasticnet::{ElasticNet, Result};

    // load Diabetes dataset (mutable to allow fast k-folding)
    let mut dataset = linfa_datasets::diabetes();

    // prameters to compare
    let ratios = vec![0.1, 0.2, 0.5, 0.7, 1.0];

    // create a model for each parameter
    let models = ratios
        .iter()
        .map(|ratio| ElasticNet::params().penalty(0.3).l1_ratio(*ratio))
        .collect::<Vec<_>>();

    // get the mean r2 validation score across all folds for each model
    let r2_values =
        dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(&truth))?;

    for (ratio, r2) in ratios.iter().zip(r2_values.iter()) {
        println!("L1 ratio: {}, r2 score: {}", ratio, r2);
    }

And this is what it looks like for classification:

    use linfa::prelude::*;
    use linfa_logistic::error::Result;
    use linfa_logistic::LogisticRegression;

    // Load dataset. Mutability is needed for fast cross validation
    let mut dataset =
        linfa_datasets::winequality().map_targets(|x| if *x > 6 { "good" } else { "bad" });

    // define a sequence of models to compare. In this case the
    // models will differ by the amount of l2 regularization
    let alphas = vec![0.1, 1., 10.];
    let models: Vec<_> = alphas
        .iter()
        .map(|alpha| {
            LogisticRegression::default()
                .alpha(*alpha)
                .max_iterations(150)
        })
        .collect();

    // use cross validation to compute the validation accuracy of each model. The
    // accuracy of each model will be averaged across the folds, 5 in this case
    let accuracies = dataset.cross_validate(5, &models, |prediction, truth| {
        Ok(prediction.confusion_matrix(truth)?.accuracy())
    })?;

    // display the accuracy of the models along with their regularization coefficient
    for (alpha, accuracy) in alphas.iter().zip(accuracies.iter()) {
        println!("Alpha: {}, accuracy: {} ", alpha, accuracy);
    }

Possible follow up

Redefine the Transformer trait to return a Result, like the t-sne implementation does, in order to avoid panicking as much as possible. It would be better to wait for #121 in order to avoid dealing with the unwrap() calls when performing float conversions.

Notes

~~Currently there are two versions of cross-validation: one for single target datasets and one for multi-target datasets. The main reasons for this are:~~
- ~~Array1s of evaluation values can be constructed with collect, Array2 cannot and must be populated row by row (could be avoided by stacking, or maybe there is a dedicated ndarray method I am not aware of)~~
- ~~Regression metrics behave differently when they are applied Array1-Array2 or Array2-Array1, making writing evaluation closures possibly more difficult than it should.~~

~~More than likely there is a solution to both problems, but I got stuck on it for too long and so I would consider it out of scope for this PR, unless someone has an easy solution to suggest. Forcing the user to give a single return value in the multi-target case (like a mean across the targets) could make the problem easier but the evaluation for each target would be lost in the process.~~

Got this error in elasticnet just once:

     Running target/release/deps/linfa_elasticnet-d4c6cf99e56de5dd

running 11 tests
test algorithm::tests::coordinate_descent_lowers_objective ... ok
test algorithm::tests::elastic_net_2d_toy_example_works ... ok
test algorithm::tests::elastic_net_diabetes_1_works_like_sklearn ... ok
test algorithm::tests::diabetes_z_score ... ok
test algorithm::tests::elastic_net_penalty_works ... ok
test algorithm::tests::elastic_net_toy_example_works ... ok
test algorithm::tests::lasso_toy_example_works ... ok
test algorithm::tests::lasso_zero_works ... ok
error: test failed, to rerun pass '-p linfa-elasticnet --lib'

Caused by:
  process didn't exit successfully: `/home/ivano/Scrivania/github_projects/linfa/target/release/deps/linfa_elasticnet-d4c6cf99e56de5dd` (signal: 11, SIGSEGV: invalid memory reference)

opened by Sauro98 17

Added Approximated dbscan to linfa-clustering

I added an implementation of the Approximated DBSCAN to the linfa-clustering subcrate. This algorithm depends on the crate partitions for an implementation of the union-find structure. This can be removed by using adjacency lists, at the cost of performance. Another big thing: I noticed that the existing vanilla DBSCAN implementation used an axis convention that was the opposite to the one used by the generate_blobs function, which it uses in the benches. I modified the DBSCAN implementation to invert the axes in order to be able to compare benches of the two implementations, and I also modified the tests to pass this modification. I modified the vanilla implementation only to be able to compare the performances, I did not want to judge the previous implementation. This was also my first time using ndarray so any suggestion is very welcome.

opened by Sauro98 17

Cargo run failed: Undefined symbols for architecture x86_64

I followed LinearRegression example and found err on cargo run

Code:

use linfa::prelude::SingleTargetRegression;
use linfa::traits::{Fit, Predict};
use linfa_linear::LinearRegression;


fn main() {
    let dataset = linfa_datasets::diabetes();
    let model = LinearRegression::default().fit(&dataset).unwrap();  # the error seems happen on this line
    pred = model.predict(&dataset);
    pred.r2(&dataset).unwrap();
}

Err output:

  = note: Undefined symbols for architecture x86_64:
            "_dgelsd_", referenced from:
                lapack::dgelsd::h78796e3523b23ad8 in liblax-5505ca61a5fbc67a.rlib(lax-5505ca61a5fbc67a.lax.d19e9304-cgu.2.rcgu.o)
            "_cblas_sdot", referenced from:
                ndarray::linalg::impl_linalg::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$1$u5d$$GT$$GT$$GT$::dot_impl::h7adddea61e00d52a in rust_ds-fc4af6c90afa881c.4mtkwkgjvot8wzn2.rcgu.o
            "_cblas_ddot", referenced from:
                ndarray::linalg::impl_linalg::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$1$u5d$$GT$$GT$$GT$::dot_impl::h7adddea61e00d52a in rust_ds-fc4af6c90afa881c.4mtkwkgjvot8wzn2.rcgu.o
          ld: symbol(s) not found for architecture x86_64
          clang: error: linker command failed with exit code 1 (use -v to see invocation)

OS: Mac Catalina v10.15.7 Rust: rustc 1.58.1

I have try cargo clean but still can't fix the issue

opened by weiztech 15

Add Kmeans++ and Kmeans|| initialization
This PR adds K-means++ as an initialization function as well as K-means||, which is a version of K-means++ that scales better with higher cluster counts, described in this paper. A new hyperparameter has also been added to specify the choice of initialization algorithm. I want to have the design reviewed before going any further.

Todo:

[x] Integrate the new initialization algorithms to the benchmarks, since they should allow K-means to complete faster.

[x] Pick default initialization function.
opened by YuhanLiin 15
Provide non-accelerated algorithms without LAPACK routines

The people at rust-cv looked recently at some clustering algorithms in linfa and one issue which came up are dependencies to the LAPACK routines. In some situations you want to avoid these, for example in an embedding context.

Take for example the gaussian mixture model: the current implementation utilizes a cholesky decomposition for faster inverse and determinant computation and speeds the LL computation up. We could provide a second implementation which gets activated when the corresponding feature flags are deactivated and only provide models with independent univariates gaussians.
infrastructure

opened by bytesnake 14
Introduce unchecked hyperparams to DBSCAN, Appx DBSCAN, and KMeans

Similar to what I did for KMeans, I reintroduced hyperparam builders to DBSCAN and approximate DBSCAN. This allows the verification of hyperparams to be done via returning errors rather than panicking. I believe using a builder is more idiomatic then doing a verify() check since the builder forces the verification to be done exactly once before the hyperparams can be used to run any algorithms. Introduce a new trait called UncheckedHyperParams to represent the verification of hyperparameters. Integrate this trait for DBSCAN, appx DBSCAN, and KMeans.

opened by YuhanLiin 14
Documentation maintenance and small fixes
Here is a collection of work that needs to be done regarding documentation and minor issues with the code itself. Picking an item from this list can be a good way to start getting acquainted with the codebase and provide a useful contribution. 😄 👍🏻

In general, I would say that the pages for the individual algorithms in the linfa-clustering and linfa-elasticnet sub-crates can be good references for the structure of a documentation page.

It's suggested to look at the existing documentation in your local build rather than on doc.rs since some pages may have already been updated.

[x] Update linfa-bayes's sub-crate documentation to include a brief description of the algorithm used and add an example of the usage of the provided model along with maybe some hints regarding when to choose a naive Bayes predictor. There are already some examples in the dedicated page for the params structure but maybe it's better to add at least one to the main page for the sub-crate

[x] linfa-hierarchical: The documentation for this sub-crate needs improvements like the ones listed for linfa-bayes

[x] linfa-ica: Also needs updates similar to linfa-bayes

[x] linfa-kernel: Add a description of what kernel methods are useful for, similarly to what's written in the linfa-svm crate, and add descriptions to the Kernel and KernelView subtypes. There are some examples in the KernelBase struct Documentation but maybe it would be a good idea to have an example of kernel transformation directly on the crate's main page.

[x] linfa-linear: The crate's documentation is inside the ols module instead of being in the crate's root and it is outdated since it says it only provides an implementation of the least squares algorithm

[x] linfa-linear: glm::TweedieRegressor provides its own fit method instead of implementing the Fit trait, and the same for predict. Moving the methods inside the trait would be a good way to align it with the rest of the algorithms provided and ensure compatibility

[x] linfa-logistic could use some more documentation to explain the algorithm and some examples in its main page

[x] linfa-logistic implements its own fit and predict methods instead of implementing the Fit and Predict traits. Like in the case of linear regression it would be a good thing to align the interface with the other sub-crates

[x] linfa-reduction completely lacks any documentation. An explanation of why dimensionality reduction is useful in ML and descriptions of the single algorithms would really help with understanding the usefulness of the crate

[x] linfa-svm: this is another reference for what other crates' documentation should look like. Right now the predict method returns an Array1 for classification and a Vec for regression. It would be less confusing if the regression case was modified to return an Array1 too.

[x] linfa-trees: it needs changes similar to linfa-bayes, with the addition of documentation regarding the methods for the params structure

[x] linfa: The main page of the rustdocs still says that linfa only provides the K-Means algorithm which may be very confusing to someone that only sees the crate in doc.rs for the first time. Here a completely revised page compete with project goals and links to the various sub-crates would be useful, so that one can find the algorithm they need without manually searching for the sub-crates.

[x] linfa: The dataset module page could use a bit of explanation about the main differences between the four dataset types and a brief recollection of what utility methods a dataset provides

[x] ~~linfa: The metrics module page could use a brief list of the provided metrics so that one does not have to go looking for them in the sup-pages if they only want to know whether a metric is provided or not~~

help wanted good first issue
opened by Sauro98 14
update criterion settings

This PR updates the criterion settings for our benched algorithms. Over the course of a day or a few days I will update this description with the timing stats from before and after changes for each algorithm. Unless, it is okay to just do it for a few then I'll do that instead

linfa-ica

Old - 1min 39secs New - 3min 38secs

linfa-pls

Old - 4mins 16secs New - 9mins 42secs

linfa-linear

Old - 1min 29secs New - 3min 37secs

linfa-nn

Old - 3mins 43secs New - 8mins 57secs

linfa-clustering

k-means:

Old - 4min 6secs New - 7mins 34secs

gaussian_mixture

Old - 51secs New - 2mins 5secs

dbscan

Old - 1min 14secs New - 2mins 50secs

appx_db_scan

Old - 47secs New - 1min 27secs

linfa-ftrl

Old - 1min 43secs New - 3min 57secs

linfa-trees

Old - 55secs New - 2min 38secs

opened by oojo12 13
Introduce checked parameters to all algorithms
Current state:

[x] port algorithm's parameters to new syntax

[ ] improve documentation (use linfa-bayes as prototype)

[x] fix issues with Transform and double results
opened by bytesnake 13
Add arrow data storage support for linfa pre-processing and training module

Preprocessing and transformation with Data frames are heavily used for ml model training in scikitlearn. The two most popular DataFrame libraries (Polars, DataFusion) written in rust are based on apache arrow in-memory data format but not based on ndarray. Which also looks like will be the trend for any new data frame players in Rust. It does not look like there will be a data frame that wraps ndarray under the hood, the way pandas wrap numpy.

By adding arrow support in linfa, any data frame based on arrow will have default support which means any arrow-based data frame can be passed to any preprocessing or training modules of linfa. Without dealing with ndarray. The way pandas can be passed to sci-kit-learn. Hence I propose to have direct arrow support in linfa to have it a more generalized framework. By doing that Polars/DataFusion users can already use rust for ml training out of the box.

opened by DataPsycho 2
Investigate discrepancy between non-BLAS and BLAS versions of `linfa-pls`

According to benchmark results from this comment, linfa-pls is slightly slower without BLAS. Specifically, the Regression-Nipals benchmarks are slightly slower when the sample size is 100000, and the Regression-Svd and Canonical-Svd benchmarks are slower when the sample size is 100000.
enhancement help wanted

opened by YuhanLiin 6
Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear`

According to the benchmark results from this comment, linfa-linear is faster with BLAS than without. The OLS algorithm isn't too different with 5 features and is only slightly slower with 10 features, but GLM is significantly slower without BLAS. We should profile and investigate the difference between the BLAS and non-BLAS performance.
enhancement help wanted

opened by YuhanLiin 9
Investigate discrepancy between non-BLAS and BLAS versions of `linfa-ica`

According to these results, all ICA benchmarks are noticibly faster with BLAS than without, though this is less severe at higher sample sizes. We should profile the non-BLAS benchmark runs (using something like flamegraph) and see if we can fix the discrepancy.
enhancement help wanted

opened by YuhanLiin 1

Releases(0.6.0)

0.6.0(Jun 15, 2022)
Linfa's 0.6.0 release removes the mandatory dependency on external BLAS libraries (such as intel-mkl) by using a pure-Rust linear algebra library. It also adds the Naive Multinomial Bayes and Follow The Regularized Leader algorithms. Additionally, the AsTargets trait has been separated into AsSingleTargets and AsMultiTargets.

No more BLAS

With older versions of Linfa, algorithm crates that used advanced linear algebra routines needed to be linked against an external BLAS library such as Intel-MKL. This is done by adding feature flags like linfa/intel-mkl-static to the build, and it increased the compile times significantly. Version 0.6.0 replaces the BLAS library with a pure-Rust implementation of all the required routines, which Linfa uses by default. This means all Linfa crates now build properly and quickly without any extra feature flags. It is still possible for the affected algorithm crates to link against an external BLAS libary. Doing so requires enabling the crate's blas feature, along with the feature flag for the external BLAS library. The affected crates are as follows:

linfa-ica

linfa-reduction

linfa-clustering

linfa-preprocessing

linfa-pls

linfa-linear

linfa-elasticnet

New algorithms

Multinomial Naive Bayes is a family of Naive Bayes classifiers that assume independence between variables. The advantage is a linear fitting time with maximum-likelihood training in a closed form. The algorithm is added to linfa-bayes and an example can be found at linfa-bayes/examples/winequality_multinomial.rs.

Follow The Regularized Leader (FTRL) is a linear model for CTR prediction in online learning settings. It is a special type of linear model with sigmoid function which uses L1 and L2 regularization. The algorithm is contained in the newly-added linfa-ftrl crate, and an example can be found at linfa-ftrl/examples/winequality.rs.

Distinguish between single and multi-target

Version 0.6.0 introduces a major change to the AsTarget trait, which is now split into AsSingleTargets and AsMultiTargets. Additionally, the Dataset* types are parametrized by target dimensionality, instead of always using a 2D array. Furthermore, algorithms that work on single-target data will no longer accept multi-target datasets as input. This change may cause build errors in existing code that call the affected algorithms. The fix for it is as simple as adding Ix1 to the end of the type parameters for the dataset being passed in, which forces the dataset to be single-target.

Improvements

Remove SeedableRng trait bound from KMeans and GaussianMixture.

Replace uses of Isaac RNG with Xoshiro RNG.

cross_validate changed to cross_validate_single, which is for single-target data; cross_validate_multi changed to cross_validate, which is for both single and multi-target datasets.

The probability type Pr has been constrained to 0. <= prob <= 1.. Also, the simple Pr(x) constructor has been replaced by Pr::new(x), Pr::new_unchecked(x), and Pr::try_from(x), which ensure that the invariant for Pr is met.

Source code(tar.gz)
Source code(zip)
0.5.1(Mar 1, 2022)
Release 0.5.1

Linfa's 0.5.1 release fixes errors and bugs in the previous release, as well as removing useless trait bounds on the Dataset type. Note that the commits for this release are located in the 0-5-1 branch of the GitHub repo.

Improvements

remove Float trait bound from many Dataset impls, making non-float datasets usable

fix build errors in 0.5.0 caused by breaking minor releases from dependencies

fix bug in k-means where the termination condition of the algorithm was calculated incorrectly

fix build failure when building linfa alone, caused by incorrect feature selection for ndarray

Source code(tar.gz)
Source code(zip)
0.5.0(Oct 21, 2021)
Linfa's 0.5.0 release adds initial support for the OPTICS algorithm, multinomials logistic regression, and the family of nearest neighbor algorithms. Furthermore, we have improved documentation and introduced hyperparameter checking to all algorithms.

New algorithms

OPTICS is an algorithm for finding density-based clusters. It can produce reachability-plots, hierarchical structure of clusters. Analysing data without prior assumption of any distribution is a common use-case. The algorithm is added to linfa-clustering and an example can be find at linfa-clustering/examples/optics.rs.

Extending logistic regression to the multinomial distribution generalizes it to multiclass problems. This release adds support for multinomial logistic regression to linfa-logistic, you can experiment with the example at linfa-logistic/examples/winequality_multi.rs.

Nearest neighbor search finds the set of neighborhood points to a given sample. It appears in numerous fields of applications as a distance metric provider. (e.g. clustering) This release adds a family of nearest neighbor algorithms, namely Ball tree, K-d tree and naive linear search. You can find an example in the next section.

Improvements

use least-square solver from ndarray-linalg in linfa-linear

make clustering algorithms generic over distance metrics

bump ndarray to 0.15

introduce ParamGuard trait for explicit and implicit parameter checking (read more in the CONTRIBUTE.md)

improve documentation in various places

Nearest Neighbors

You can now choose from a growing list of NN implementations. The family provides efficient distance metrics to KMeans, DBSCAN etc. The example shows how to use KDTree nearest neighbor to find all the points in a set of observations that are within a certain range of a candidate point.

You can query nearest points explicitly:

// create a KDTree index consisting of all the points in the observations, using Euclidean distance let kdtree = CommonNearestNeighbour::KdTree.from_batch(observations, L2Dist)?; let candidate = observations.row(2); let points = kdtree.within_range(candidate.view(), range)?;

Or use one of the distance metrics implicitly, here demonstrated for KMeans:

use linfa_nn::distance::LInfDist; let model = KMeans::params_with(3, rng, LInfDist) .max_n_iterations(200) .tolerance(1e-5) .fit(&dataset)?;
Source code(tar.gz)
Source code(zip)
0.4.0(Apr 28, 2021)
Linfa's 0.4.0 release introduces four new algorithms, improves documentation of the ICA and K-means implementations, adds more benchmarks to K-Means and updates to ndarray's 0.14 version.

New algorithms

The Partial Least Squares Regression model family is added in this release (thanks to @relf). It projects the observable, as well as predicted variables to a latent space and maximizes the correlation for them. For problems with a large number of targets or collinear predictors it gives a better performance when compared to standard regression. For more information look into the documentation of linfa-pls.

A wrapper for Barnes-Hut t-SNE is also added in this release. The t-SNE algorithm is often used for data visualization and projects data in a high-dimensional space to a similar representation in two/three dimension. It does so by maximizing the Kullback-Leibler Divergence between the high dimensional source distribution to the target distribution. The Barnes-Hut approximation improves the runtime drastically while retaining the performance. Kudos to github/frjnn for providing an implementation!

A new preprocessing crate makes working with textual data and data normalization easy (thanks to @Sauro98). It implements count-vectorizer and IT-IDF normalization for text pre-processing. Normalizations for signals include linear scaling, norm scaling and whitening with PCA/ZCA/choelsky. An example with a Naive Bayes model achieves 84% F1 score for predicting categories alt.atheism, talk.religion.misc, comp.graphics and sci.space on a news dataset.

Platt scaling calibrates a real-valued classification model to probabilities over two classes. This is used for the SV classification when probabilities are required. Further a multi class model, combining multiple binary models (e.g. calibrated SVM models) into a single multi-class model is also added. These composing models are moved to the linfa/src/composing/ subfolder.

Improvements

Numerous improvements are added to the KMeans implementation, thanks to @YuhanLiin. The implementation is optimized for offline training, an incremental training model is added and KMeans++/KMeans|| initialization gives good initial cluster means for medium and large datasets.

We also moved to ndarray's version 0.14 and introduced F::cast for simpler floating point casting. The trait signature of linfa::Fit is changed such that it always returns a Result and error handling is added for the linfa-logistic and linfa-reduction subcrates.

You often have to compare several model parametrization with k-folding. For this a new function cross_validate is added which takes the number of folds, model parameters and a closure for the evaluation metric. It automatically calls k-folding and averages the metric over the folds. To compare different L1 ratios of an elasticnet model, you can use it in the following way:

// L1 ratios to compare let ratios = vec![0.1, 0.2, 0.5, 0.7, 1.0]; // create a model for each parameter let models = ratios .iter() .map(|ratio| ElasticNet::params().penalty(0.3).l1_ratio(*ratio)) .collect::<Vec<_>>(); // get the mean r2 validation score across 5 folds for each model let r2_values = dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(&truth))?; // show the mean r2 score for each parameter choice for (ratio, r2) in ratios.iter().zip(r2_values.iter()) { println!("L1 ratio: {}, r2 score: {}", ratio, r2); }

Other changes

fix for border points in the DBSCAN implementation

improved documentation of the ICA subcrate

prevent overflowing code example in website

Source code(tar.gz)
Source code(zip)
0.3.1(Mar 11, 2021)
In this release of Linfa the documentation is extended, new examples are added and the functionality of datasets improved. No new algorithms were added.

The meta-issue #82 gives a good overview of the necessary documentation improvements and testing/documentation/examples were considerably extended in this release.

Further new functionality was added to datasets and multi-target datasets are introduced. Bootstrapping is now possible for features and samples and you can cross-validate your model with k-folding. We polished various bits in the kernel machines and simplified the interface there.

The trait structure of regression metrics are simplified and the silhouette score introduced for easier testing of K-Means and other algorithms.

Changes

improve documentation in all algorithms, various commits

add a website to the infrastructure (c8acc785b)

add k-folding with and without copying (b0af80546f8)

add feature naming and pearson's cross correlation (71989627f)

improve ergonomics when handling kernels (1a7982b973)

improve TikZ generator in linfa-trees (9d71f603bbe)

introduce multi-target datasets (b231118629)

simplify regression metrics and add cluster metrics (d0363a1fa8ef)

Example

You can now perform cross-validation with k-folding. @Sauro98 actually implemented two versions, one which copies the dataset into k folds and one which avoid excessive memory operations by copying only the validation dataset around. For example to test a model with 8-folding:

// perform cross-validation with the F1 score let f1_runs = dataset .iter_fold(8, |v| params.fit(&v).unwrap()) .map(|(model, valid)| { let cm = model .predict(&valid) .mapv(|x| x > Pr::even()) .confusion_matrix(&valid).unwrap(); cm.f1_score() }) .collect::<Array1<_>>(); // calculate mean and standard deviation println!("F1 score: {}±{}", f1_runs.mean().unwrap(), f1_runs.std_axis(Axis(0), 0.0), );
Source code(tar.gz)
Source code(zip)
0.3.0(Jan 21, 2021)
New algorithms

Approximated DBSCAN has been added to linfa-clustering by [@Sauro98]

Gaussian Naive Bayes has been added to linfa-bayes by [@VasanthakumarV]

Elastic Net linear regression has been added to linfa-elasticnet by [@paulkoerbitz] and [@bytesnake]

Changes

Added benchmark to gaussian mixture models (a3eede55)

Fixed bugs in linear decision trees, added generator for TiKZ trees (bfa5aebe7)

Implemented serde for all crates behind feature flag (4f0b63bb)

Implemented new backend features (7296c9ec4)

Introduced linfa-datasets for easier testing (3cec12b4f)

Rename Dataset to DatasetBase and introduce Dataset and DatasetView (21dd579cf)

Improve kernel tests and documentation (8e81a6d)

Example

The following section shows a small example how datasets interact with the training and testing of a Linear Decision Tree.

You can load a dataset, shuffle it and then split it into training and validation sets:

// initialize pseudo random number generator with seed 42 let mut rng = Isaac64Rng::seed_from_u64(42); // load the Iris dataset, shuffle and split with ratio 0.8 let (train, test) = linfa_datasets::iris() .shuffle(&mut rng) .split_with_ratio(0.8);

With the training dataset a linear decision tree model can be trained. Entropy is used as a metric for the optimal split here:

let entropy_model = DecisionTree::params() .split_quality(SplitQuality::Entropy) .max_depth(Some(100)) .min_weight_split(10.0) .min_weight_leaf(10.0) .fit(&train);

The validation dataset is now used to estimate the error. For this the true labels are predicted and then a confusion matrix gives clue about the type of error:

let cm = entropy_model .predict(test.records().view()) .confusion_matrix(&test); println!("{:?}", cm); println!( "Test accuracy with Entropy criterion: {:.2}%", 100.0 * cm.accuracy() );

Finally you can analyze which features were used in the decision and export the whole tree it to a TeX file. It will contain a TiKZ tree with information on the splitting decision and impurity improvement:

let feats = entropy_model.features(); println!("Features trained in this tree {:?}", feats); let mut tikz = File::create("decision_tree_example.tex").unwrap(); tikz.write(gini_model.export_to_tikz().to_string().as_bytes()) .unwrap();

The whole example can be found in linfa-trees/examples/decision_tree.rs.
Source code(tar.gz)
Source code(zip)
0.2.1(Nov 29, 2020)
Changes

remove feature flags, blocked by https://github.com/rust-lang/cargo/issues/7915

make ready for crates.io

Source code(tar.gz)
Source code(zip)
0.2.0(Nov 26, 2020)
New algorithms

Ordinary Linear Regression has been added to linfa-linear by [@Nimpruda] and [@paulkoerbitz]

Generalized Linear Models has been added to linfa-linear by [@VasanthakumarV]

Linear decision trees were added to linfa-trees by [@mossbanay]

Fast independent component analysis (ICA) has been added to linfa-ica by [@VasanthakumarV]

Principal Component Analysis and Diffusion Maps have been added to linfa-reduction by [@bytesnake]

Support Vector Machines has been added to linfa-svm by [@bytesnake]

Logistic regression has been added to linfa-logistic by [@paulkoerbitz]

Hierarchical agglomerative clustering has been added to linfa-hierarchical by [@bytesnake]

Gaussian Mixture Models has been added to linfa-clustering by [@relf]

Changes

Common metrics for classification and regression have been added

A new dataset interface simplifies the work with targets and labels

New traits for Transformer, Fit and IncrementalFit standardizes the interface

Switched to Github Actions for better integration

Source code(tar.gz)
Source code(zip)