Discrust

Supervised discretization in Rust

The discrust package provides a supervised discretization algorithm. Under the hood it implements a decision tree, using information value to find the optimal splits, and provides several different methods to constrain the final discretization scheme.

The package draws heavily from the ivpy package, both in the algorithm and the parameter controls.

Usage

The package has a single user facing class, Discretizer that can be instantiated with the following arguments.

min_obs (Optional[float], optional): Minimum number of observations required in a bin. Defaults to 5.
max_bins (Optional[int], optional): Maximum number of bins to split the variable into. Defaults to 10.
min_iv (Optional[float], optional): Minimum information value required to make a split. Defaults to 0.001.
min_pos (Optional[float], optional): Minimum number of records with a value of one that should be present in a split. Defaults to 5.
mono (Optional[int], optional): The monotonicity required between the binned variable and the binary performance outcome. A value of -1 will result in negative correlation between the binned x and y variables, while a value of 1 will result in a positive correlation between the binned x variable and the y variable. Specifying a value of 0 will result in binning x, with no monotonicity constraint. If a value of None is specified the monotonicity will be determined the monotonicity of the first split. Defaults to None.

The fit method can be called on data and accepts the following parameters.

x (ArrayLike): An arraylike numeric field that will be discretized based on the values of y, and the constraints the Discretizer was initialized with.
y (ArrayLike): An arraylike binary field.
sample_weight (Optional[ArrayLike], optional): Optional sample weight array to be used when calculating the optimal breaks. Defaults to None.

This method will return a list of the optimal split values for the feature given the constraints. After being fit the discretizer will have a splits_ attribute with this list.

import seaborn as sns

df = sns.load_dataset("titanic")

from discrust import Discretizer

ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
# [-inf, 6.95, 7.125, 7.7292, 10.4625, 15.1, 50.4958, 52.0, 73.5, 79.65, inf]

The predict method can be called and will discretize the feature, and then perform weight of evidence substitution on each binned level. This method takes the following arguments.

x (ArrayLike): An arraylike numeric field.

ds.predict(df["fare"])[0:5]
array([-0.84846814, 0.78344263, -0.787529, 0.78344263, -0.787529])

Installation

From PyPi

For Windows users, the package can be installed directly from pypi with the following command.

python -m pip install discrust

Building from Source

The package can be built from source, it utalizes the maturin tool as a build backend. This tool requires you have python, and a working Rust compiler installed, see here for details. If these two requirements are met, you can clone this repository, and run the following command in the repositories root directory.

python -m pip install . -v

This should invoke the maturin tool, which will handle the building of the Rust code and installation of the package. Alternativly, if you simply want to build a wheel, you can run the following command after installing maturin.

maturin build --release

I have had some problems building packages with maturin directly in a conda environment, this is actually a bug on anaconda's side that will hopefully be resolved. If this does give you any problems, it's usually easiest to build a wheel inside of a venv and then install the wheel.

Additional TODOs

Support for exception values
Support for missing values in both the dependant and independent variables

Comments

Update feature aggs

Updated how aggregations are calculated and used, now it uses a precomputed vector of mins and max values, so aggregations are only performed once for the feature.

opened by jinlow 0
Git actions

This pull requests adds github actions to the project, to support building and testing the package for multiple OS versions, as well as different python version.

opened by jinlow 0

Releases(0.1.6)

0.1.6(May 8, 2022)

This release adds a number of optimizations to the package. Aggregations only happen once, and the pre-computed values are always used, instead of recalculating statistics for each split. A special sort method that allows for missing values is only used if exception values are provided. A split index is determined and used throughout the crate, instead of it ever needing to be recalculated.
Source code(tar.gz)
Source code(zip)