Statistical computation library for Rust

Last update: Jan 4, 2023

Related tags

Machine learning statrs

Overview

statrs

Current Version: v0.15.0

Should work for both nightly and stable Rust.

NOTE: While I will try to maintain backwards compatibility as much as possible, since this is still a 0.x.x project the API is not considered stable and thus subject to possible breaking changes up until v1.0.0

Description

Statrs provides a host of statistical utilities for Rust scientific computing. Included are a number of common distributions that can be sampled (i.e. Normal, Exponential, Student's T, Gamma, Uniform, etc.) plus common statistical functions like the gamma function, beta function, and error function.

This library is a work-in-progress port of the statistical capabilities in the C# Math.NET library. All unit tests in the library borrowed from Math.NET when possible and filled-in when not.

This library is a work-in-progress and not complete. Planned for future releases are continued implementations of distributions as well as porting over more statistical utilities

Please check out the documentation here

Usage

Add the most recent release to your Cargo.toml

[dependencies]
statrs = "0.15"

Examples

Statrs comes with a number of commonly used distributions including Normal, Gamma, Student's T, Exponential, Weibull, etc. The common use case is to set up the distributions and sample from them which depends on the Rand crate for random number generation

use statrs::distribution::Exp;
use rand::distributions::Distribution;

let mut r = rand::rngs::OsRng;
let n = Exp::new(0.5).unwrap();
print!("{}", n.sample(&mut r);

Statrs also comes with a number of useful utility traits for more detailed introspection of distributions

use statrs::distribution::{Exp, Continuous, ContinuousCDF};
use statrs::statistics::Distribution;

let n = Exp::new(1.0).unwrap();
assert_eq!(n.mean(), Some(1.0));
assert_eq!(n.variance(), Some(1.0));
assert_eq!(n.entropy(), Some(1.0));
assert_eq!(n.skewness(), Some(2.0));
assert_eq!(n.cdf(1.0), 0.6321205588285576784045);
assert_eq!(n.pdf(1.0), 0.3678794411714423215955);

as well as utility functions including erf, gamma, ln_gamma, beta, etc.

use statrs::statistics::Distribution;
use statrs::distribution::FisherSnedecor;

let n = FisherSnedecor::new(1.0, 1.0).unwrap();
assert!(n.variance().is_none());

Contributing

Want to contribute? Check out some of the issues marked help wanted

How to contribute

Clone the repo:

git clone https://github.com/statrs-dev/statrs

Create a feature branch:

git checkout -b <feature_branch> master

After commiting your code:

git push -u origin <feature_branch>

Then submit a PR, preferably referencing the relevant issue.

Style

This repo makes use of rustfmt with the configuration specified in rustfmt.toml. See https://github.com/rust-lang-nursery/rustfmt for instructions on installation and usage and run the formatter using rustfmt --write-mode overwrite *.rs in the src directory before committing.

Commit messages

Please be explicit and and purposeful with commit messages.

Bad

Modify test code

Good

test: Update statrs::distribution::Normal test_cdf

Comments

Review of iterator statistics trait

Currently the iterator statistics trait is treated as a special case since to act over the iterator the methods need to take a mutable reference, so all the traits from statrs::statistics are (going to be) combined in the IterStatistics trait that is implemented for all Iterators. I haven't come up with a better solution but for some reason this implementation doesn't sit too well with me and I'd love to have someone review it and provide feedback.
help wanted

opened by boxtown 20
License compliance issue

I would like to bring to your attention the fact that your licensing is incompatible with your dependencies.

Your crate is MIT licensed but depends upon nalgebra which is Apache-2.0 licensed only. While Apache-2.0 projects can use MIT licensed components, the reverse is not so.

Consider for instance that MIT is GPL v3 compatible, while Apache-2.0 is not; Your crate being MIT licensed is thus misleading to any potential GPL-v3 projects that consider using your crate - they'd end up constructing a non-compliant product.

opened by jnqnfe 9
Update to nalgebra 0.27.1 to avoid RUSTSEC-2021-0070

statrs's latest version is 0.14.0. (Even though the website says it is 0.13.0, in the README.md and the link to the docs in the "About" section at the top right.)

statrs 0.14.0 depends on nalgebra 0.26. nalgebra has a RUSTSEC-2021-0070 advisory against it. Among other things, this causes cargo deny to fail.

Version 0.27.1 of nalgebra fixes the advisory.

It would be very helpful if statrs could have its dependency on nalgebra updated to 0.27.1, and then a new version of statrs (0.14.1 or 0.15.0) be released. Thank you.

opened by nnethercote 8
Fix a bug in the uniform continuous distribution
Hi and thanks for the library!

Problem

I think there's a bug in the implementation of the continuous uniform distribution. According to the documentation, it should return values in the range [min, max], i.e. min ≤ random value ≤ max.

Unfortunately, the current implementation generates values in the range [min, max + 1), i.e. min ≤ random value < max + 1.

I think it might be a copy-paste bug 'inherited' from the discrete uniform distribution, where you really need to add 1 to the upper bound.

Solution

This PR is a partial fix for this bug: It changes the range of the continuous uniform distribution to [min, max), i.e. min ≤ random value < max. I don't have a quick fix to include the upper bound, but I think it's important to at least fix the + 1.0 issue. As of 0.7.0, you cannot really sample values between 0 and 1 without resorting to wild workarounds.

References

MathNET implementation

discrete uniform distribution in statrs

rand::Rng trait, which says that gen_range() does not include the upper bound
opened by mp4096 8
Release request

Hello! This is a (hopefully) polite request to check if statrs is in a state ready to have a release cut? It's been 10mo according to crates.io and I for one am eager to get off rand 0.7. Thanks again for all your hard work!

opened by nlhepler 7
[RFC] Student's T inverse CDF
This branch is not fit for merging (see below), but I wanted to gauge interest in having this functionality in statrs.

Issues with the implementation in this branch:

[ ] Ignores location and scale parameters (they're assumed to be 0 and 1 respectively)

[x] Pulls in "special" and "approx" as deps

[x] No CheckedInverseCDF impl

[ ] 400 lines of unit tests is a bit much

[ ] Docs don't describe the formula
opened by asayers 7
Error handling: Panics vs Result

Currently the responsibility for guarding against exceptional cases (e.g. input not in valid domain, mathematically invalid operations etc) is passed to the user. We panic when an operation does not make mathematical sense (e.g. calculating the cumulative distribution function for discrete distributions at a negative input) which forces users to double check to make sure their inputs are valid. While this results in technically correct and predictable behavior from the API, I'm not sure if it's ergonomic or idiomatic and have been mulling over possibly introducing a Result based API either replacing or in addition to the stricter panic based API. This however warrants some discussion and I would love feedback from the community
help wanted discussion

opened by boxtown 7
Update to rand >= 0.8

Currently, statrs relies on rand 0.7 and nalgebra 0.23 (which itself relies on rand 0.7). The newer releases of nalgebra update their dependency to the latest rand, which is currently 0.8.3. One of the minor, but frustrating, differences between 0.7 and 0.8 is a change in the syntax of the rand_range function, from taking 2 arguments to taking a single range argument.

It would be great if statrs could be updated to rely on a newer nalgebra and a newer statrs. Currently, if one is depending on rand >= 0.8 (or is depending on any package that depends on this), then multiple different versions of rand are pulled down. Not only does this bloat the build, but it can run the risk of confusing the compiler about definitions that appear in both versions of the package.

The actual changes to conform with the new API interface of rnd_range are small, but I think there are some other changes that would need to be made, since the version of nalgebra should be bumped (but perhaps not to the very latest (0.26.1), unless other changes are made because they have deprecated some interfaces that are currently used in statrs).

opened by rob-p 6
Use dev-dependencies for random number generation
I just saw this in the code:

#[ignore] #[test] fn test_mean_variance_stability() { // TODO: Implement tests. Depends on Mersenne Twister RNG implementation. // Currently hesistant to bring extra dependency just for test }

You can add dependencies to the Cargo.toml that are only used when running the tests, but not when using the library as a dependency: http://doc.crates.io/specifying-dependencies.html#development-dependencies
enhancement help wanted
opened by vks 6

Allow vector of floats for mean in statsrs::distribution::Normal (as per numpy.random.normal)

The code below understandably runs into a type mismatch as the argument mean of statsrs::distribution::Normal requires a f64.

Similar to np.random.normal in Python's numpy package, it would be good to add support for Vector arguments of mean in statsrs.

Rust code (doesn't work due to type mismatch in let n...)

let x0: Vec<f64> = thread_rng().sample_iter(Standard).take(200).collect();

let endpoint_mean: Vec<_> = x0
  .iter()
  .map(|&x| x * (-0.5).exp())
  .collect();

let endpoint_variance: f64 =
  (SIGMA.pow(2) as f64 /  (1.0 - (-1.0).exp())).sqrt();

// this should output a vector n of some arbitrary length (here, 200). the argument endpoint_mean could be a Vec<f64> (as per numpy), but currently must be f64.
let n = Normal::new(endpoint_mean, endpoint_variance).unwrap();

Python code (works with an array of floats)

x0 = np.random.normal(loc = 0, scale = 1, size = 10000) // initial points, outputs a 10_000 vector
mean = X0*np.exp(-0.5) // 10_000 vector 
variance = np.sqrt(4/(1-np.exp(-1)) // = 1.5901201952413002
xt = np.random.normal(m,v) // = 10_000 vector

opened by 0jg 5

Removes gamma special cases

Removed also the tests which instantiated gamma with infinity as a parameter. Tests are still failing, but don't know what the motivation is behind those numbers. If it doesn't matter too much I'll just update the testvalues.
do not merge

opened by ghost 5
Mutable/Movable parameters for multivariate normal
Hello, I'm studying the possibility of using this crate for Markov Chain Monte Carlo (MCMC) based inference. In this use case, the log-density of a distribution is evaluated repeatedly at different parameter values. To do that currently, the crate requires re-creating the distributions at each iteration. This isn't much of a problem for scalar distributions, but for the multivariate normal, I have to re-allocate the mean vector and covariance matrix at each iteration (since distributions are immutable), which impacts performance.

Allowing the user to re-set the parameters separately would work:

pub fn set_mean(&mut self, mu : &[f64]); pub fn set_variance(&mut self, var : &[f64]);

But a solution that moves the parameters out of the struct would also work (therefore preserving the intended immutable API):

pub fn take_parameters(self) -> (DVector<f64>, DMatrix<f64>);

Are there any plans to offer something like that?
opened by limads 0
Add CDF for multivariate normal
Multivariate Normal CDF

Implements the multivariate normal CDF

Algorithm

Uses the algorithm as explained in Section 4.2.2 in Computation of Multivariate Normal and t Probabilities by Alan Genz and Frank Bretz, together with the cholesky decomposition with dynamic changing of rows explained in Section 4.1.3. Specifically we use a Quasi Monte Carlo method.

Additions

Trait ContinuousMultivariateCDF in mod.rs

Module MultivariateUniform in multivariate_uniform.rs (mainly for me wanting an in-house way to get uniform distribution in $[0,1]^n$). Implements mean, mode, pdf, cdf, min, max, ln_pdf.

Function chol_chrows for computing the Cholesky decomposition dynamically whilst changing rows for better integration limits

Function integrate_pdf to integrate a multivariate pdf between limits a and b

Implement ContinuousMultivariateCDF for MultivariateNormal (and MultivariateUniform), where cdf uses integrate_pdf with left limit a=[f64::NEG_INFINITY; dim] and right limit x=b

Tests cdf against scipy.stats.multivariate_normal.cdf in python, as well as MvNormalCDF in Julia

Import crate primes for generating first $n$ primes as Richtmyer generators in the Quasi MC algorithm
opened by henryjac 0
Multivariate students t distribution

Implementation of Multivariate students t distribution in a very similar way to Multivariate normal. Includes sampling, mean, covariance, mode, pdf and log pdf functions.

Testing with exact values from python scipy.stats.multivariate_t functions. Large degrees of freedoms does not work yet

opened by henryjac 0
Calculate the coefficient of variation without calculating the mean twice?

With this crate, is it possible to calculate the coefficient of variation / relative std dev (data.std_dev()/data.mean()) without calculating the mean twice?

opened by Boscop 0
Mention quantile function in docstring

First, of all, thanks for making and maintaining this project. I'm trying to make a wasm project and I would have probably given up without statrs.

Anyway, I'm always confused by the many synonyms used in statistics. This PR suggests to add one synonym to the inverse_cdf to the docstring of the normal distribution. This should make it at least visible when searching the docs?

Adding the note to all inverse_cdf implementations seems a bit redundant so that's why I kept it to the Normal for now.

opened by rikhuijzer 0

Releases(v0.15.0)

v0.15.0(Jun 30, 2021)

Bumping nalgebra due to RUSTSEC-2021-0070 advisory.
Source code(tar.gz)
Source code(zip)
v0.14.0(May 16, 2021)

Source code(tar.gz)
Source code(zip)

Owner

GitHub https://docs.rs/statrs/0.13.0/statrs/

Statistical routines for ndarray

ndarray-stats This crate provides statistical methods for ndarray's ArrayBase type. Currently available routines include: order statistics (minimum, m

150 Dec 26, 2022

Msgpack serialization/deserialization library for Python, written in Rust using PyO3, and rust-msgpack. Reboot of orjson. msgpack.org[Python]

ormsgpack ormsgpack is a fast msgpack library for Python. It is a fork/reboot of orjson It serializes faster than msgpack-python and deserializes a bi

139 Dec 30, 2022

A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio