Fwumious Wabbit, fast on-line machine learning toolkit written in Rust

Outbrain

Last update: Dec 9, 2022

Related tags

Overview

Fwumious Wabbit is

a very fast machine learning tool
built with Rust
inspired by and partially compatible with Vowpal Wabbit (much love! read more about compatibility here)
currently supports logistic regression and field-aware factorization machines

Fwumious Wabbit is actively used in Outbrain for offline research, as well as for some production flows. It enables "high bandwidth research" when doing feature engineering, feature selection, hyper-parameter tuning, and the like.

Data scientists can train hundreds of models over hundreds of millions of examples in a matter of hours on a single machine.

For our tested scenarios it is almost two orders of magnitude faster than the fastest Tensorflow implementation of Logistic Regression and FFMs that we could come up with. It is an order of magnitude faster than Vowpal Wabbit for some specific use-cases.

Check out our benchmark, here's a teaser:

Why is it faster? (see here for more details)

Only implements Logistic Regression and Field-aware Factorization Machines
Uses hashing trick, lookup table for AdaGrad and a tight encoding format for the "input cache"
Features' namespaces have to be declared up-front
Prefetching of weights from memory (avoiding pipeline stalls)
Written in Rust with heavy use of code specialization (via macros and traits)

Comments

Java interoperation

Looking forward to a moment when there will be some way to test interoperation with Java - to use at least predict part in Java using JNI/JNA or something similar, the same as we currently use VW maven component (https://github.com/VowpalWabbit/vowpal_wabbit/tree/master/java).

opened by josepowera 4
Add an alignment check before calling slice::from_raw_parts

A Vec<u8>'s buffer is only guaranteed to be 1-aligned, but it's used here in code assuming it's 4-aligned. That might work, depending on what the allocator does, but it should be checked to avoid UB.

(Alternatively it could over-allocate the buffer and do something like align_to, but I didn't want to make broad changes.)

opened by scottmcm 3
Additional tests of classification performance and weight properties
parameterized training data generator to simplify running many ad hoc experiments if required

added a bash script that trains fw on generated data, stores the outputs, computes relevant classification metrics (on training data, intentionally) and alerts the user if something is off --- including properties of prediction files, as well as classification capabilities vs. random must be in good shape/of quality. Current experiments indicate that random vs. fw balanced accuracy on training data ((sensitivity + specificity) / 2) difference is more than 0.2, hence this is the current margin considered for the test to pass.

added an action that runs this for each commit, so we know where things went south if that's the case

The plan is to include this as a git action so that for each new binary we have a few learning-level sanity checks conducted. Currently, we included the main properties we are most interested in with each new version; more can be added should the need arise.
opened by SkBlaz 2
Use `if let` to deconstruct a single pattern, instead of `match`

Hi there, I've recently started my rust journey and was hoping you'd be open for some ornamental changes recommended by clippy.

This PR replaces match ... Some(...) blocks with the more idiomatic if let.

opened by sed-i 2
Combine
fix a small bug with not accepting "w" as a valid character for a namespace

allow for transformations of transformed namespaces

implement Combine transformer
opened by andraztori 2
Binning

Add binning basic support with a few default binners implemented.

This allows for any kind of transformations of float values before making them categorical features to be used in LR/FFM.

opened by andraztori 2
Reduce examples in benchmark, benchmark only fw by default
Instead of running the benchmark for 10 million train & test examples each - running only 1 million (should be indicative enough and will run more quickly)

Benchmarking scripts run the benchmark only on fw, not on both fw and vw
opened by bbenshalom 1
Use `if let` to deconstruct a single pattern, instead of `match`

(Reposting #61 under a different branch name.)

Hi there, I've recently started my rust journey and was hoping you'd be open for some ornamental changes recommended by clippy.

This PR replaces match ... Some(...) blocks with the more idiomatic if let.

opened by sed-i 1
discarding of temporary pointer as it may become invalid in case the vector is reallocated

while investigating some FW crashes due to segmentation fault, caused by malloc trying to allocate a block and complaining that the block CRC isn't the same as when it was freed, I ran valgrind (on our reproducible setup offline) and got a hint that something bad is going on in some lines in parser.rs the only thing I could imagine happening is the "buf" pointer somehow becoming invalid, it seems this can happen if the output_buffer vec grows and is reallocated. got rid of it - and the valgrind complaints went away, as well as the crashes. still not sure exactly about the scenario though - because the vector is preallocated generously on startup, so we might want to continue looking into the input as there may be something fishy going on there. WDYT?

opened by yonatankarni 1
Verbose namespaces

A) Introduce two new parameters --linear (to be used instead of --interaction and --keep) and --ffm_field_verbose (to be used instead of --ffm_field)

This now allows for passing feature names / namespaces as full namespace names as found in vw_namespace_map.csv.

It's a first step to unlock more flexible namespace definitions in input files.

B) Implement multi-byte namespaces in vw_namespaces_map.csv and when parsing vw files

opened by andraztori 1
Version
LMK what you think - discussed this with @flaunderg - currently we publish artifacts internally by explicitly triggering a build from Jenkins. the produced artifact is put in artifactory at fw-/fw-<branch_name>- - and that's when we set a git tag "fw-<branch_name>-version" in the repo.

we suggest that - only when creating a new tag for "main" branch, two additional things will happen:

the version.rs file will be overwritten, with the auto-incremented version (current is 0.1, so next is 0.2 etc.)

the benchmark will be run and BENCHMARK.md and benchmark_results.png will be committed

so - when someone builds (or if we choose to publish artifacts from "main") - we'll be able to tell the binary version, and not just have to rely on commit # (which we can also add to the version info as with vw, btw). this way there will be less potential for conflict when we merge branches where we already published artifacts to try them out - the version will always be taken from main. if main was promoted and you pull - you'll get the updated version.
opened by yonatankarni 1
Field interactions

implement ability to specify specific field interaction parameters the feature gets turned on with --ffm_interaction_matrix then field weights can be expressed by --ffm_interaction field_id_1:field_id_2:weight

weight 0 means the interaction is fully masked out. default is weight 1.0 everywhere field ids are sequential ids of fields as passed by field declaration parameters

WARNING: since fields are not named, this means that any change to order of field declarations requires careful adjustment of interactions too...

ping @adischw

opened by andraztori 0

Owner

Outbrain

GitHub

convolutions-rs is a crate that provides a fast, well-tested convolutions library for machine learning

convolutions-rs convolutions-rs is a crate that provides a fast, well-tested convolutions library for machine learning written entirely in Rust with m

10 Jun 28, 2022

A Machine Learning Framework for High Performance written in Rust

polarlight polarlight is a machine learning framework for high performance written in Rust. Key Features TBA Quick Start TBA How To Contribute Contrib

25 Aug 23, 2022

A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio