m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Bayes' Witnesses

Last update: Dec 31, 2022

Related tags

Machine learning javascript ruby python c java go rust php machine-learning haskell r lightning csharp scikit-learn statistical-learning xgboost lightgbm statsmodels dartlang

Overview

m2cgen

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go, JavaScript, Visual Basic, C#, PowerShell, R, PHP, Dart, Haskell, Ruby, F#, Rust, Elixir).

Installation
Supported Languages
Supported Models
Classification Output
Usage
CLI
FAQ

Installation

Supported Python version is >= 3.7.

pip install m2cgen

Supported Languages

C
C#
Dart
F#
Go
Haskell
Java
JavaScript
PHP
PowerShell
Python
R
Ruby
Rust
Visual Basic (VBA-compatible)
Elixir

Supported Models

	Classification	Regression
Linear	scikit-learn LogisticRegression LogisticRegressionCV PassiveAggressiveClassifier Perceptron RidgeClassifier RidgeClassifierCV SGDClassifier lightning AdaGradClassifier CDClassifier FistaClassifier SAGAClassifier SAGClassifier SDCAClassifier SGDClassifier	scikit-learn ARDRegression BayesianRidge ElasticNet ElasticNetCV GammaRegressor HuberRegressor Lars LarsCV Lasso LassoCV LassoLars LassoLarsCV LassoLarsIC LinearRegression OrthogonalMatchingPursuit OrthogonalMatchingPursuitCV PassiveAggressiveRegressor PoissonRegressor RANSACRegressor(only supported regression estimators can be used as a base estimator) Ridge RidgeCV SGDRegressor TheilSenRegressor TweedieRegressor StatsModels Generalized Least Squares (GLS) Generalized Least Squares with AR Errors (GLSAR) Generalized Linear Models (GLM) Ordinary Least Squares (OLS) [Gaussian] Process Regression Using Maximum Likelihood-based Estimation (ProcessMLE) Quantile Regression (QuantReg) Weighted Least Squares (WLS) lightning AdaGradRegressor CDRegressor FistaRegressor SAGARegressor SAGRegressor SDCARegressor SGDRegressor
SVM	scikit-learn LinearSVC NuSVC OneClassSVM SVC lightning KernelSVC LinearSVC	scikit-learn LinearSVR NuSVR SVR lightning LinearSVR
Tree	DecisionTreeClassifier ExtraTreeClassifier	DecisionTreeRegressor ExtraTreeRegressor
Random Forest	ExtraTreesClassifier LGBMClassifier(rf booster only) RandomForestClassifier XGBRFClassifier	ExtraTreesRegressor LGBMRegressor(rf booster only) RandomForestRegressor XGBRFRegressor
Boosting	LGBMClassifier(gbdt/dart/goss booster only) XGBClassifier(gbtree(including boosted forests)/gblinear booster only)	LGBMRegressor(gbdt/dart/goss booster only) XGBRegressor(gbtree(including boosted forests)/gblinear booster only)

You can find versions of packages with which compatibility is guaranteed by CI tests here. Other versions can also be supported but they are untested.

Classification Output

Linear / Linear SVM / Kernel SVM

Binary

Scalar value; signed distance of the sample to the hyperplane for the second class.

Multiclass

Vector value; signed distance of the sample to the hyperplane per each class.

Comment

The output is consistent with the output of LinearClassifierMixin.decision_function.

SVM

Outlier detection

Scalar value; signed distance of the sample to the separating hyperplane: positive for an inlier and negative for an outlier.

Binary

Scalar value; signed distance of the sample to the hyperplane for the second class.

Multiclass

Vector value; one-vs-one score for each class, shape (n_samples, n_classes * (n_classes-1) / 2).

Comment

The output is consistent with the output of BaseSVC.decision_function when the decision_function_shape is set to ovo.

Tree / Random Forest / Boosting

Binary

Vector value; class probabilities.

Multiclass

Vector value; class probabilities.

Comment

The output is consistent with the output of the predict_proba method of DecisionTreeClassifier / ExtraTreeClassifier / ExtraTreesClassifier / RandomForestClassifier / XGBRFClassifier / XGBClassifier / LGBMClassifier.

Usage

Here's a simple example of how a linear model trained in Python environment can be represented in Java code:

from sklearn.datasets import load_diabetes
from sklearn import linear_model
import m2cgen as m2c

X, y = load_diabetes(return_X_y=True)

estimator = linear_model.LinearRegression()
estimator.fit(X, y)

code = m2c.export_to_java(estimator)

Generated Java code:

public class Model {
    public static double score(double[] input) {
        return ((((((((((152.1334841628965) + ((input[0]) * (-10.012197817470472))) + ((input[1]) * (-239.81908936565458))) + ((input[2]) * (519.8397867901342))) + ((input[3]) * (324.39042768937657))) + ((input[4]) * (-792.1841616283054))) + ((input[5]) * (476.74583782366153))) + ((input[6]) * (101.04457032134408))) + ((input[7]) * (177.06417623225025))) + ((input[8]) * (751.2793210873945))) + ((input[9]) * (67.62538639104406));
    }
}

You can find more examples of generated code for different models/languages here.

CLI

m2cgen can be used as a CLI tool to generate code using serialized model objects (pickle protocol):

$ m2cgen  --language  [--indent ] [--function_name ]
         [--class_name ] [--module_name ] [--package_name ]
         [--namespace ] [--recursion-limit ]

Don't forget that for unpickling serialized model objects their classes must be defined in the top level of an importable module in the unpickling environment.

Piping is also supported:

$ cat  | m2cgen --language

FAQ

Q: Generation fails with RecursionError: maximum recursion depth exceeded error.

A: If this error occurs while generating code using an ensemble model, try to reduce the number of trained estimators within that model. Alternatively you can increase the maximum recursion depth with sys.setrecursionlimit().

Q: Generation fails with ImportError: No module named error while transpiling model from a serialized model object.

A: This error indicates that pickle protocol cannot deserialize model object. For unpickling serialized model objects, it is required that their classes must be defined in the top level of an importable module in the unpickling environment. So installation of package which provided model's class definition should solve the problem.

Q: Generated by m2cgen code provides different results for some inputs compared to original Python model from which the code were obtained.

A: Some models force input data to be particular type during prediction phase in their native Python libraries. Currently, m2cgen works only with float64 (double) data type. You can try to cast your input data to another type manually and check results again. Also, some small differences can happen due to specific implementation of floating-point arithmetic in a target language.

Comments

Code generated for XGBoost models returns invalid scores when tree_method is set to "hist"

I have trained xgboost models in Python and am using the CLI interface to convert the serialized models to pure python. However, when I use the pure python, the results differ from the predictions using the model directly.

Python 3.7 xgboost 0.90

My model has a large number of parameters (somewhat over 500). Here are predicted class probabilities from the original model:

Here are the same predicted probabilities using the generated python code via m2cgen:

We can see that the results are similar but not the same. The result is a significant number of cases that are moved into different classes between the two sets of predictions.

I have also tested this with binary classification models and have the same issues.

opened by eafpres 21
In Java interpreter ignore subroutines and perform code split based on the AST size
After investigating possible solutions for https://github.com/BayesWitnesses/m2cgen/issues/152, I came to a conclusion that with the existing design it's extremely hard to come up with the optimal algorithm to split code into subroutines on the interpreter side (and not in assemblers). The primary reason for that is that since we always interpret one expression at a time it's hard to predict both the depth of the current subtree and the number of expressions that are left to interpret in other branches. I've achieved some progress by splitting expressions into separate subroutines based on the size of the code generated so far (i.e. code size threshold), but more often than not I'll get some stupid subroutines like this one:

public static double subroutine2(double[] input) { return 22.640634908349323; }

That's why I took a simpler approach and attempted to optimize an interpreter that caused trouble in the first place - the R one. I slightly modified its behavior: when the binary expressions count threshold is exceeded, it no longer split them into separate variable assignments, but moves them into their own subroutines. Although it might not be the most optimal way for simpler models (like linear ones), it helps tremendously with gradient boosting and random forest models. Since those models are summation of independent estimators, we end up putting every N (5 by default) estimators into their own subroutine, improving this way the execution time. @StrikerRUS please let me know what you think.
opened by izeigerman 14

added possibility to write generated code into file

Closed #110.

Real-life frustrating example:

import sys

from sklearn.datasets import load_boston

import lightgbm as lgb
import m2cgen as m2c

X, y = load_boston(True)
est = lgb.LGBMRegressor(n_estimators=1000).fit(X, y)

sys.setrecursionlimit(1<<30)
print(m2c.export_to_python(est))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

m2c.export_to_python(est, 'test.txt') works fine in this scenario.

opened by StrikerRUS 12

Dart language support

For those building Flutter apps that would like to be able to utilize static models trained in scikit on-device, this tool would be a perfect fit. And if the Flutter dev team decides to add a hot code push feature to the framework, models from m2cgen could be updated on the fly.

opened by mattc-eostar 11
added support for PowerShell

With this PR Windows users will be able to execute ML models from "command line" without the need to install any programming language (PowerShell is already installed in Windows).

opened by StrikerRUS 11
Handle missing values replacement in LightGBM
Sometimes exported LGBMRegressor model's prediction doesn't match predictions from the original model. This happens when model encounters values missing during training. More detailed discussion could be found here https://github.com/microsoft/LightGBM/issues/2921

This is by no means a complete fix for the problem, it only addresses this part of the LightGBM behavior: "for numerical features, if not missing is seen in training, the missing value will be converted to zero, and then check it with the threshold. So it is not always the left side."

Fix has also being tested on fairly big regression model with numerical features and it works as expected.

How to reproduce:

import numpy as np import lightgbm as lgb import m2cgen as m2c from sklearn.datasets import load_diabetes dataset = load_diabetes() gbm = lgb.LGBMRegressor(num_leaves=51, learning_rate=0.05, n_estimators=100) gbm.fit(dataset['data'], dataset['target']) test = np.array([-2.175, 0.797, np.NaN, 1.193, 0.0, 0.0, 0.0, np.NaN, np.NaN, np.NaN]) print(gbm.predict(np.array([test]))[0]) code = m2c.export_to_python(gbm) with open('model.py', 'w') as fp: fp.write(code) import model as m print(m.score(test))
opened by Aulust 10

Code generated from XGBoost model includes "None"

When transpiling XGBRegressor and XGBClassifier models such as the following basic example:

from xgboost import XGBRegressor
from sklearn import datasets
import m2cgen as m2c

iris_data = datasets.load_iris(return_X_y=True)

mod = XGBRegressor(booster="gblinear", max_depth=2)
X, y = iris_data
mod.fit(X[:120], y[:120])

code = m2c.export_to_c(mod)

print(code)

the resulting c-code includes a Pythonesque None :

double score(double * input) {
    return (None) + (((((-0.391196) + ((input[0]) * (-0.0196191))) + ((input[1]) * (-0.11313))) + ((input[2]) * (0.137024))) + ((input[3]) * (0.645197)));
}

Probably I am missing some basic step?

opened by robinvanemden 10

added Visual Basic code generator

The motivation behind this PR is allowing users with poor programming skills access to strong ML models inside Office applications (mainly in Excel).

Also, if I'm not mistaken, VBA projects can be used in SOLIDWORKS.

After merging this PR users will be able to use ML models inside Excel in the following way.

Usage Example

As usual, generate a model via supported ML algorithm:

from sklearn.datasets import load_boston
from sklearn.svm import SVR

import m2cgen as m2c

X, y = load_boston(True)
X = X[:4, :2]
y = y[:4]

reg = SVR()
reg.fit(X, y)

After that output VBA code representation of the model via the m2cgen Python package:

print(m2c.export_to_vba(reg))

Function score(ByRef input_vector() As Double) As Double
    Dim var0 As Double
    var0 = (0) - (0.3333333333333333)
    score = ((((28.70000000001455) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.00632) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((18.0) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((2.31) - (input_vector(2)), 2))))) * (-1.0))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.02731) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0.0) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((7.07) - (input_vector(2)), 2))))) * (-1.0))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.02729) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0.0) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((7.07) - (input_vector(2)), 2))))) * (1.0))) + ((Exp((var0) * (((Application.WorksheetFunction.Power((0.03237) - (input_vector(0)), 2)) + (Application.WorksheetFunction.Power((0.0) - (input_vector(1)), 2))) + (Application.WorksheetFunction.Power((2.18) - (input_vector(2)), 2))))) * (1.0))
End Function

Create empty Visual Basic file example_module.bas and paste the copied output there.

Now open Excel, enable Developer tab and click Developer -> Visual Basic (Alt + F11). In VBA editor click File -> Import File and choose previously created example_module.bas file.

After doing that, one more required action is writing a proxy function that will convert Excel Range object to Array and call the model. For instance, such function for regression, for row-based features placed inside Excel can be:

Function SCOREROW(features As Range) As Double
    Dim arr() As Double
    ReDim Preserve arr(features.Columns.Count - 1)
    Dim i As Integer
    For i = 0 To UBound(arr)
        arr(i) = features(1, i + 1)
    Next i
    SCOREROW = score(arr)
End Function

Now this proxy function can be used on Excel sheet as any built-in Excel functions:

Let's compare Excel predictions with ones from the native Python model:

reg.predict(X)

array([27.7       , 28.70034543, 28.70034543, 29.7       ])

Seems that everything is fine!

opened by StrikerRUS 10

Fix #168. Enforce float32 type for split condition values for GBT models created using XGBoost

As it turns out the issue reported in https://github.com/BayesWitnesses/m2cgen/issues/168 is not unique to the "hist" tree construction algorithm. It seems that with "hist" method the likelihood of reprdocue is much higher due to relying on feature histograms. I was able to reproduce the same discrepancy with non-hist methods on a larger sample of test data.

The issue occurs due to a double precision error and reproduces every time when the feature value matches the split condition in one of the tree's nodes.

Example: feature value = 0.671, split condition = 0.671000004. When we hit this condition in the generated code the outcome of 0.671 < 0.671000004 is "true" (or "yes" branch). While in XGBoost the same condition leads to the "no" branch.

After some investigation I noticed that the XGBoost's DMatrix forces all values to be float32 (https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L565). At the same time in our assemblers we rely on default 64-bit floats. Forcing the split condition to be float32 seem to address the issue. At least I couldn't reproduce it so far.

opened by izeigerman 9
add option to save generated code into file

I'm sorry if I missed this functionality, but CLI version hasn't it for sure (I saw the related code only in generate_code_examples.py). I guess it will be very useful to eliminate copy-paste phase, especially for large models.

Of course, piping is a solution, but not for development in Jupyter Notebook, for example.
enhancement good first issue

opened by StrikerRUS 9
add: Make function_name parametrized
Hello everyone,

First of all, thanks a ton for putting this tool/library together -- especially in resource-stranded environments, it does have a potential to literally save lives!

One small problem I was fighting with while using it was the score function it uses in the generated modules. When they are used as drop-in replacements for trained models, using score is a bit strange, as the API generally provides function like predict or predict_proba. It would therefore be of great help to me if this name could be dynamically changed and I would not have to do so manually.

Please do let me know if something like this sounds like a sensible addition. I'd be happy to update the code so that it reflect your vision, so please feel free to just let me know whenever that may be the case.

Thanks!

Currently m2cgen generates a module in various languages that has a "score"/"Score" function/method. This is not always desirable, as many of the trained modules that are to be exported may provide their prediction via API functions with different names (such as predict).

This commit adds a way of specifying the name of the function both via the CLI and in the exporters (that is, in the export_to_ funcitons) by specifying the function_name option/parameter while keeping the default set to "score"/"Score" for backwards compatilibity.

Signed-off-by: mr.Shu [email protected]
opened by mrshu 8
Bump scipy from 1.9.1 to 1.10.0
Bumps scipy from 1.9.1 to 1.10.0.

Release notes

Sourced from scipy's releases.

SciPy 1.10.0 Release Notes

SciPy 1.10.0 is the culmination of 6 months of hard work. It contains many new features, numerous bug-fixes, improved test coverage and better documentation. There have been a number of deprecations and API changes in this release, which are documented below. All users are encouraged to upgrade to this release, as there are a large number of bug-fixes and optimizations. Before upgrading, we recommend that users check that their own code does not use deprecated SciPy functionality (to do so, run your code with python -Wd and check for DeprecationWarning s). Our development attention will now shift to bug-fix releases on the 1.10.x branch, and on adding new features on the main branch.

This release requires Python 3.8+ and NumPy 1.19.5 or greater.

For running on PyPy, PyPy3 6.0+ is required.

Highlights of this release

A new dedicated datasets submodule (scipy.datasets) has been added, and is now preferred over usage of scipy.misc for dataset retrieval.

A new scipy.interpolate.make_smoothing_spline function was added. This function constructs a smoothing cubic spline from noisy data, using the generalized cross-validation (GCV) criterion to find the tradeoff between smoothness and proximity to data points.

scipy.stats has three new distributions, two new hypothesis tests, three new sample statistics, a class for greater control over calculations involving covariance matrices, and many other enhancements.

New features

scipy.datasets introduction

A new dedicated datasets submodule has been added. The submodules is meant for datasets that are relevant to other SciPy submodules ands content (tutorials, examples, tests), as well as contain a curated set of datasets that are of wider interest. As of this release, all the datasets from scipy.misc have been added to scipy.datasets (and deprecated in scipy.misc).

The submodule is based on Pooch (a new optional dependency for SciPy), a Python package to simplify fetching data files. This move will, in a subsequent release, facilitate SciPy to trim down the sdist/wheel sizes, by decoupling the data files and moving them out of the SciPy repository, hosting them externally and

... (truncated)

Commits

dde5059 REL: 1.10.0 final [wheel build]

7856f28 Merge pull request #17696 from tylerjereddy/treddy_110_final_prep

205b624 DOC: add missing author

1ab9f1b DOC: update 1.10.0 relnotes

ac2f45f MAINT: integrate._qmc_quad: mark as private with preceding underscore

3e0ae1a REV: integrate.qmc_quad: delay release to SciPy 1.11.0

34cdf05 MAINT: FFT pybind11 fixups

843500a Merge pull request #17689 from mdhaber/gh17686

089924b REL: integrate.qmc_quad: remove from release notes

3e47110 REL: 1.10.0rc3 unreleased

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0
Bump numpy from 1.23.3 to 1.24.1
Bumps numpy from 1.23.3 to 1.24.1.

Release notes

Sourced from numpy's releases.

v1.24.1

NumPy 1.24.1 Release Notes

NumPy 1.24.1 is a maintenance release that fixes bugs and regressions discovered after the 1.24.0 release. The Python versions supported by this release are 3.8-3.11.

Contributors

A total of 12 people contributed to this release. People with a "+" by their names contributed a patch for the first time.

Andrew Nelson

Ben Greiner +

Charles Harris

Clément Robert

Matteo Raso

Matti Picus

Melissa Weber Mendonça

Miles Cranmer

Ralf Gommers

Rohit Goswami

Sayed Adel

Sebastian Berg

Pull requests merged

A total of 18 pull requests were merged for this release.

#22820: BLD: add workaround in setup.py for newer setuptools

#22830: BLD: CIRRUS_TAG redux

#22831: DOC: fix a couple typos in 1.23 notes

#22832: BUG: Fix refcounting errors found using pytest-leaks

#22834: BUG, SIMD: Fix invalid value encountered in several ufuncs

#22837: TST: ignore more np.distutils.log imports

#22839: BUG: Do not use getdata() in np.ma.masked_invalid

#22847: BUG: Ensure correct behavior for rows ending in delimiter in...

#22848: BUG, SIMD: Fix the bitmask of the boolean comparison

#22857: BLD: Help raspian arm + clang 13 about __builtin_mul_overflow

#22858: API: Ensure a full mask is returned for masked_invalid

#22866: BUG: Polynomials now copy properly (#22669)

#22867: BUG, SIMD: Fix memory overlap in ufunc comparison loops

#22868: BUG: Fortify string casts against floating point warnings

#22875: TST: Ignore nan-warnings in randomized out tests

#22883: MAINT: restore npymath implementations needed for freebsd

#22884: BUG: Fix integer overflow in in1d for mixed integer dtypes #22877

#22887: BUG: Use whole file for encoding checks with charset_normalizer.

Checksums

... (truncated)

Commits

a28f4f2 Merge pull request #22888 from charris/prepare-1.24.1-release

f8fea39 REL: Prepare for the NumPY 1.24.1 release.

6f491e0 Merge pull request #22887 from charris/backport-22872

48f5fe4 BUG: Use whole file for encoding checks with charset_normalizer [f2py] (#22...

0f3484a Merge pull request #22883 from charris/backport-22882

002c60d Merge pull request #22884 from charris/backport-22878

38ef9ce BUG: Fix integer overflow in in1d for mixed integer dtypes #22877 (#22878)

bb00c68 MAINT: restore npymath implementations needed for freebsd

64e09c3 Merge pull request #22875 from charris/backport-22869

dc7bac6 TST: Ignore nan-warnings in randomized out tests

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0
Bump xgboost from 1.6.2 to 1.7.2
Bumps xgboost from 1.6.2 to 1.7.2.

Release notes

Sourced from xgboost's releases.

1.7.2 Patch Release

v1.7.2 (2022 Dec 8)

This is a patch release for bug fixes.

Work with newer thrust and libcudacxx (#8432)

Support null value in CUDA array interface namespace. (#8486)

Use getsockname instead of SO_DOMAIN on AIX. (#8437)

[pyspark] Make QDM optional based on a cuDF check (#8471)

[pyspark] sort qid for SparkRanker. (#8497)

[dask] Properly await async method client.wait_for_workers. (#8558)

[R] Fix CRAN test notes. (#8428)

[doc] Fix outdated document [skip ci]. (#8527)

[CI] Fix github action mismatched glibcxx. (#8551)

Artifacts

You can verify the downloaded packages by running this on your Unix shell:

echo "<hash> <artifact>" | shasum -a 256 --check

15be5a96e86c3c539112a2052a5be585ab9831119cd6bc3db7048f7e3d356bac xgboost_r_gpu_linux_1.7.2.tar.gz 0dd38b08f04ab15298ec21c4c43b17c667d313eada09b5a4ac0d35f8d9ba15d7 xgboost_r_gpu_win64_1.7.2.tar.gz

1.7.1 Patch Release

v1.7.1 (2022 November 3)

This is a patch release to incorporate the following hotfix:

Add back xgboost.rabit for backwards compatibility (#8411)

Release 1.7.0 stable

Note. The source distribution of Python XGBoost 1.7.0 was defective (#8415). Since PyPI does not allow us to replace existing artifacts, we released 1.7.0.post0 version to upload the new source distribution. Everything in 1.7.0.post0 is identical to 1.7.0 otherwise.

v1.7.0 (2022 Oct 20)

We are excited to announce the feature packed XGBoost 1.7 release. The release note will walk through some of the major new features first, then make a summary for other improvements and language-binding-specific changes.

PySpark

XGBoost 1.7 features initial support for PySpark integration. The new interface is adapted from the existing PySpark XGBoost interface developed by databricks with additional features like QuantileDMatrix and the rapidsai plugin (GPU pipeline) support. The new Spark XGBoost Python estimators not only benefit from PySpark ml facilities for powerful distributed computing but also enjoy the rest of the Python ecosystem. Users can define a custom objective, callbacks, and metrics in Python and use them with this interface on distributed clusters. The support is labeled as experimental with more features to come in future releases. For a brief introduction please visit the tutorial on XGBoost's document page. (#8355, #8344, #8335, #8284, #8271, #8283, #8250, #8231, #8219, #8245, #8217, #8200, #8173, #8172, #8145, #8117, #8131, #8088, #8082, #8085, #8066, #8068, #8067, #8020, #8385)

Due to its initial support status, the new interface has some limitations; categorical features and multi-output models are not yet supported.

Development of categorical data support

More progress on the experimental support for categorical features. In 1.7, XGBoost can handle missing values in categorical features and features a new parameter max_cat_threshold, which limits the number of categories that can be used in the split evaluation. The parameter is enabled when the partitioning algorithm is used and helps prevent over-fitting. Also, the sklearn interface can now accept the feature_types parameter to use data types other than dataframe for categorical features. (#8280, #7821, #8285, #8080, #7948, #7858, #7853, #8212, #7957, #7937, #7934)

... (truncated)

Changelog

Sourced from xgboost's changelog.

XGBoost Change Log

This file records the changes in xgboost library in reverse chronological order.

v1.7.0 (2022 Oct 20)

We are excited to announce the feature packed XGBoost 1.7 release. The release note will walk through some of the major new features first, then make a summary for other improvements and language-binding-specific changes.

PySpark

XGBoost 1.7 features initial support for PySpark integration. The new interface is adapted from the existing PySpark XGBoost interface developed by databricks with additional features like QuantileDMatrix and the rapidsai plugin (GPU pipeline) support. The new Spark XGBoost Python estimators not only benefit from PySpark ml facilities for powerful distributed computing but also enjoy the rest of the Python ecosystem. Users can define a custom objective, callbacks, and metrics in Python and use them with this interface on distributed clusters. The support is labeled as experimental with more features to come in future releases. For a brief introduction please visit the tutorial on XGBoost's document page. (#8355, #8344, #8335, #8284, #8271, #8283, #8250, #8231, #8219, #8245, #8217, #8200, #8173, #8172, #8145, #8117, #8131, #8088, #8082, #8085, #8066, #8068, #8067, #8020, #8385)

Due to its initial support status, the new interface has some limitations; categorical features and multi-output models are not yet supported.

Development of categorical data support

More progress on the experimental support for categorical features. In 1.7, XGBoost can handle missing values in categorical features and features a new parameter max_cat_threshold, which limits the number of categories that can be used in the split evaluation. The parameter is enabled when the partitioning algorithm is used and helps prevent over-fitting. Also, the sklearn interface can now accept the feature_types parameter to use data types other than dataframe for categorical features. (#8280, #7821, #8285, #8080, #7948, #7858, #7853, #8212, #7957, #7937, #7934)

Experimental support for federated learning and new communication collective

An exciting addition to XGBoost is the experimental federated learning support. The federated learning is implemented with a gRPC federated server that aggregates allreduce calls, and federated clients that train on local data and use existing tree methods (approx, hist, gpu_hist). Currently, this only supports horizontal federated learning (samples are split across participants, and each participant has all the features and labels). Future plans include vertical federated learning (features split across participants), and stronger privacy guarantees with homomorphic encryption and differential privacy. See Demo with NVFlare integration for example usage with nvflare.

As part of the work, XGBoost 1.7 has replaced the old rabit module with the new collective module as the network communication interface with added support for runtime backend selection. In previous versions, the backend is defined at compile time and can not be changed once built. In this new release, users can choose between rabit and federated. (#8029, #8351, #8350, #8342, #8340, #8325, #8279, #8181, #8027, #7958, #7831, #7879, #8257, #8316, #8242, #8057, #8203, #8038, #7965, #7930, #7911)

The feature is available in the public PyPI binary package for testing.

Quantile DMatrix

Before 1.7, XGBoost has an internal data structure called DeviceQuantileDMatrix (and its distributed version). We now extend its support to CPU and renamed it to QuantileDMatrix. This data structure is used for optimizing memory usage for the hist and gpu_hist tree methods. The new feature helps reduce CPU memory usage significantly, especially for dense data. The new QuantileDMatrix can be initialized from both CPU and GPU data, and regardless of where the data comes from, the constructed instance can be used by both the CPU algorithm and GPU algorithm including training and prediction (with some overhead of conversion if the device of data and training algorithm doesn't match). Also, a new parameter ref is added to QuantileDMatrix, which can be used to construct validation/test datasets. Lastly, it's set as default in the scikit-learn interface when a supported tree method is specified by users. (#7889, #7923, #8136, #8215, #8284, #8268, #8220, #8346, #8327, #8130, #8116, #8103, #8094, #8086, #7898, #8060, #8019, #8045, #7901, #7912, #7922)

Mean absolute error

The mean absolute error is a new member of the collection of objectives in XGBoost. It's noteworthy since MAE has zero hessian value, which is unusual to XGBoost as XGBoost relies on Newton optimization. Without valid Hessian values, the convergence speed can be slow. As part of the support for MAE, we added line searches into the XGBoost training algorithm to overcome the difficulty of training without valid Hessian values. In the future, we will extend the line search to other objectives where it's appropriate for faster convergence speed. (#8343, #8107, #7812, #8380)

XGBoost on Browser

With the help of the pyodide project, you can now run XGBoost on browsers. (#7954, #8369)

Experimental IPv6 Support for Dask

With the growing adaption of the new internet protocol, XGBoost joined the club. In the latest release, the Dask interface can be used on IPv6 clusters, see XGBoost's Dask tutorial for details. (#8225, #8234)

Optimizations

We have new optimizations for both the hist and gpu_hist tree methods to make XGBoost's training even more efficient.

Hist Hist now supports optional by-column histogram build, which is automatically configured based on various conditions of input data. This helps the XGBoost CPU hist algorithm to scale better with different shapes of training datasets. (#8233, #8259). Also, the build histogram kernel now can better utilize CPU registers (#8218)

GPU Hist GPU hist performance is significantly improved for wide datasets. GPU hist now supports batched node build, which reduces kernel latency and increases throughput. The improvement is particularly significant when growing deep trees with the default depthwise policy. (#7919, #8073, #8051, #8118, #7867, #7964, #8026)

Breaking Changes

... (truncated)

Commits

62ed8b5 Bump release version to 1.7.2. (#8569)

a980e10 Properly await async method client.wait_for_workers (#8558) (#8567)

59c54e3 [pyspark] Make QDM optional based on cuDF check (#8471) (#8556)

60a8c8e [pyspark] sort qid for SparkRanker (#8497) (#8555)

58bc225 [backport] [CI] Fix github action mismatched glibcxx. (#8551) (#8552)

850b531 [backport] [doc] Fix outdated document [skip ci] (#8527) (#8553)

67b657d SO_DOMAIN do not support on IBM i, using getsockname instead (#8437) (#8500)

db14e3f Support null value in CUDA array interface. (#8486) (#8499)

9372370 Work with newer thrust and libcudacxx (#8432)

1136a7e Fix CRAN note on cleanup. (#8447)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0
Bump flake8 from 5.0.4 to 6.0.0
Bumps flake8 from 5.0.4 to 6.0.0.

Commits

b9a7794 Release 6.0.0

b5cac87 Merge pull request #1748 from PyCQA/upgrade-pyflakes

489be4d upgrade pyflakes to 3.0.0

8c06197 Merge pull request #1746 from PyCQA/bump-pycodestyle

047e6f8 upgrade pycodestyle to 2.10

647996c Merge pull request #1744 from PyCQA/pre-commit-ci-update-config

646ad20 [pre-commit.ci] pre-commit autoupdate

b87034d Merge pull request #1741 from PyCQA/drop-py37

aa002ee require python 3.8.1+

16c371d Merge pull request #1739 from PyCQA/remove-optparse

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0
Feature Request: support for multioutput regression

Nice library thanks!

Perhaps I missed something, but it looks like multi-output regression is unsupported? If so, is it on the roadmap? Happy to help if needed.

opened by ageron 0
Bump statsmodels from 0.13.2 to 0.13.5
Bumps statsmodels from 0.13.2 to 0.13.5.

Release notes

Sourced from statsmodels's releases.

Release 0.13.5

The statsmodels developers are happy to announce the Python 3.11 compatibility release for the 0.13 branch.

This release contains no bug fixes other than any needed to ensure statsmodels is compatible with Python 3.11. It also resolves an issue with PyPI that affects 0.13.4.

Release 0.13.4

The statsmodels developers are happy to announce the Python 3.11 compatibility release for the 0.13 branch. This release contains no bug fixes other than any needed to ensure statsmodels is compatible with Python 3.11. It also resolves an issue with the source code generation in 0.13.3 that affects installs on Python 3.11 that use the source tarball.

Release 0.13.3

The statsmodels developers are happy to announce the Python 3.11 compatibility release for the 0.13 branch. This release contains no bug fixes other than any needed to ensure statsmodels is compatible with Python 3.11.

Commits

6a9ce0a Merge pull request #8501 from bashtage/final-docs-0.13.5

e5c58f8 DOC: Add release notes for .4 and .5

9d6b9d7 Merge pull request #8492 from statsmodels/pins

1b59dcc Merge pull request #8493 from statsmodels/final-docs

9500047 DOC: Final 0.13.3 docs

6a9f391 MAINT: Refine pins

28b05f8 Merge pull request #8491 from statsmodels/pins

743a257 MAINT: Refine pins

11eb51e Merge pull request #8489 from statsmodels/pins

714e669 Set some pins

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0

Releases(v0.10.0)

v0.10.0(Apr 25, 2022)
Python 3.6 is no longer supported.

Added support for Python 3.9 and 3.10.

Trained models can now be transpiled into Rust and Elixir 🎉

Model support:

Added support for SGDRegressor from the lightning package.

Added support for extremely randomized trees in the LightGBM package.

Added support for OneClassSVM from the scikit-learn package.

Various improvements to handle the latest versions of the supported models.

Various CI/CD improvements including migration from coveralls to codecov, automated generation of the code examples and automated GitHub Release creation.

Minor codebase cleanup.

Significantly reduced the number of redundant parentheses and return statements in the generated code.

Latest Dart language versions are supported.

Programming languages can provide native implementation of sigmoid and softmax functions.

Improved code generation speed by adding new lines at the end of a generated code.

Source code(tar.gz)
Source code(zip)
m2cgen-0.10.0-py3-none-any.whl(90.07 KB)
m2cgen-0.10.0.tar.gz(54.48 KB)
v0.9.0(Sep 18, 2020)
Python 3.5 is no longer supported.

Trained models can now be transpiled into F# 🎉 .

Model support:

Added support for GLM models from the scikit-learn package.

Introduced support for a variety of objectives in LightGBM models.

The cauchy function is now supported for GLM models.

Improved conversion of floating point numbers into string literals. This leads to improved accuracy of results returned by generated code.

Improved handling of missing values in LightGBM models. Kudos to our first time contributor @Aulust 🎉

Various improvements of the code generation runtime.

Source code(tar.gz)
Source code(zip)
v0.8.0(Jun 18, 2020)
This release is the last one which supports Python 3.5. Next release will require Python >= 3.6.

Trained models can now be transpiled into Haskell and Ruby 🎉

Various improvements of the code generation runtime:

Introduced caching of the interpreter handler names.

A string buffer is now used to store generated code.

We moved away from using the string.Template.

The numpy dependency is no longer required at runtime for the generated Python code.

Improved model support:

Enabled multiclass support for XGBoost Random Forest models.

Added support of Boosted Random Forest models from the XGBoost package.

Added support of GLM models from the statsmodels package.

Introduced fallback expressions for a variety of functions which rely on simpler language constructs. This should simplify implementation of new interpreters since the number of functions that must be provided by the standard library or by a developer of the given interpreter has been reduced. Note that fallback expressions are optional and can be overridden by a manually written implementation or a corresponding function from the standard library. Among functions for which fallback AST expressions have been introduced are: abs, tanh, sqrt, exp, sigmoid and softmax.

Kudos to @StrikerRUS who's responsible for all these amazing updates 💪
Source code(tar.gz)
Source code(zip)
v0.7.0(Apr 7, 2020)
Bug fixes:

Thresholds for XGBoost trees are forced to be float32 now (https://github.com/BayesWitnesses/m2cgen/issues/168).

Fixed support for newer versions of XGBoost, in which the default value for the base_score parameter became None (https://github.com/BayesWitnesses/m2cgen/issues/182).

Models can now be transpiled into the Dart language. Kudos to @MattConflitti for this great addition 🎉

Support for following models has been introduced:

Models from the statsmodels package are now supported. The list of added models includes: GLS, GLSAR, OLS, ProcessMLE, QuantReg and WLS.

Models from the lightning package: AdaGradRegressor/AdaGradClassifier, CDRegressor/CDClassifier, FistaRegressor/FistaClassifier, SAGARegressor/SAGAClassifier, SAGRegressor/SAGClassifier, SDCARegressor/SDCAClassifier, SGDClassifier, LinearSVR/LinearSVC and KernelSVC.

RANSACRegressor from the scikit-learn package.

The name of the scoring function can now be changed via a parameter. Thanks @mrshu 💪

The SubroutineExpr expression has been removed from AST. The logic of how to split the generated code into subroutines is now focused in interpreters and was completely removed from assemblers.

Source code(tar.gz)
Source code(zip)
v0.6.0(Feb 17, 2020)
Trained models can now be transpiled into R, PowerShell and PHP. Major effort delivered solely by @StrikerRUS .

In Java interpreter introduced a logic that splits code into methods that is based on heuristics and which doesn't rely on SubroutineExpr from AST.

Added support of LightGBM and XGBoost Random Forest models.

XGBoost linear models are now supported.

LassoLarsCV, Perceptron and PassiveAggressiveClassifier estimators from scikit-learn package are now supported.

Source code(tar.gz)
Source code(zip)
v0.5.0(Dec 1, 2019)
Quite a few awesome updates in this release. Many thanks to @StrikerRUS and @chris-smith-zocdoc for making this release happen.

Visual Basic and C# joined the list of supported languages. Thanks @StrikerRUS for all the hard work!

The numpy dependency is no longer required for generated Python code when no linear algebra is involved. Thanks @StrikerRUS for this update.

Fixed the bug when generated Java code exceeded the JVM method size constraints in case when individual estimators of a GBT model contained a large number of leaves. Kudos to @chris-smith-zocdoc for discovering and fixing this issue.

Source code(tar.gz)
Source code(zip)
v0.4.0(Sep 28, 2019)
JavaScript is now among supported languages. Kudos to @bcampbell-prosper for this contribution.

Source code(tar.gz)
Source code(zip)
v0.3.1(Aug 15, 2019)
Fixed generation of XGBoost models in case when feature names are not specified in a model object (https://github.com/BayesWitnesses/m2cgen/pull/93). Thanks @akhvorov for contributing the fix.

Source code(tar.gz)
Source code(zip)
v0.3.0(May 21, 2019)
Added support of the following SVM model implementations from scikit-learn: SVC, NuSVC, SVR and NuSVR.

Source code(tar.gz)
Source code(zip)
v0.2.1(Apr 17, 2019)
For XGBoost models add support of the best_ntree_limit attribute to limit the number of estimators used during prediction. Thanks @arshamg for helping with that.

Source code(tar.gz)
Source code(zip)
v0.2.0(Mar 22, 2019)
Golang joins the family of languages supported by m2cgen 🎉 Credit goes to @matbur for making such a significant contribution 🥇

For generated C code the custom assign_array function that was used to assign vector values has been replaced with plain memcpy.

Source code(tar.gz)
Source code(zip)
v0.1.1(Mar 5, 2019)
Fix handling of the "default_left" value in the LightGBM assembler.

Source code(tar.gz)
Source code(zip)
v0.1.0(Feb 12, 2019)

First release.
Source code(tar.gz)
Source code(zip)