Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Sonos, Inc.

Last update: Jan 2, 2023

Related tags

Data processing rust tensorflow artificial-intelligence neural-networks rust-library onnx

Overview

Sonos' Neural Network inference engine.

This project used to be called tfdeploy, or Tensorflow-deploy-rust.

What ?

tract is a Neural Network inference toolkit. It can read Tensorflow 1, ONNX or NNEF, optimize them and run data through them.

Quick start

Tract in the landscape

ONNX

As of today (October 2020), tract passes successfully about 85% of ONNX backends tests. All "real life" integration tests in Onnx test suite are passing: bvlc_alexnet, densenet121, inception_v1, inception_v2, resnet50, shufflenet, squeezenet, vgg19, zfnet512.

The following operators are implemented and tested.

Abs, Acos, Acosh, Add, And, ArgMax, ArgMin, Asin, Asinh, Atan, Atanh, AveragePool, BatchNormalization, Cast, CategoryMapper, Ceil, Clip, Compress, Concat, Constant, ConstantLike, ConstantOfShape, Conv, ConvInteger, Cos, Cosh, DequantizeLinear, Div, Dropout, Elu, Equal, Erf, Exp, Expand, EyeLike, Flatten, Floor, GRU, Gather, Gemm, GlobalAveragePool, GlobalLpPool, GlobalMaxPool, Greater, GreaterOrEqual, HardSigmoid, Hardmax, Identity, InstanceNormalization, IsInf, IsNaN, LRN, LSTM, LeakyRelu, Less, LessOrEqual, Log, LogSoftmax, MatMul, MatMulInteger, Max, MaxPool, Mean, Min, Mod, Mul, Neg, NonZero, Not, Or, PRelu, Pad, ParametricSoftplus, Pow, QLinearConv, QLinearMatMul, QuantizeLinear, RNN, Reciprocal, ReduceL1, ReduceL2, ReduceLogSum, ReduceLogSumExp, ReduceMax, ReduceMean, ReduceMin, ReduceProd, ReduceSum, ReduceSumSquare, Relu, Reshape, Resize, Round, Rsqrt, ScaledTanh, Scan, Selu, Shape, Shrink, Sigmoid, Sign, Sin, Sinh, Size, Slice, Softmax, Softplus, Softsign, Split, Sqrt, Squeeze, Sub, Sum, Tan, Tanh, ThresholdedRelu, Tile, Transpose, Unsqueeze, Where, Xor

We test these operators against Onnx 1.4.1 (operator set 9), Onnx 1.5.0 (operator set 10), Onnx 1.6.0 (operator set 11), and Onnx 1.7.0 (operator set 12). Many networks in operator set 8 are also working.

TensorFlow

Even if tract is very far from supporting any arbitrary model, it can run Google Inception v3 and Snips wake word models. Missing operators are relatively easy to add. The lack of easy to reuse test suite, and the wide diversity of operators in Tensorflow make it difficult to target a full support.

The following operators are implemented and tested:

Abs, Add, AddN, AddV2, Assign, AvgPool, BatchToSpaceND, BiasAdd, BlockLSTM, Cast, Ceil, ConcatV2, Const, Conv2D, DepthwiseConv2dNative, Div, Enter, Equal, Exit, ExpandDims, FakeQuantWithMinMaxVars, Fill, FloorMod, FusedBatchNorm, GatherNd, GatherV2, Greater, GreaterEqual, Identity, Less, LessEqual, Log, LogicalAnd, LogicalOr, LoopCond, MatMul, Max, MaxPool, Maximum, Mean, Merge, Min, Minimum, Mul, Neg, NoOp, Pack, Pad, Placeholder, Pow, Prod, RandomUniform, RandomUniformInt, Range, RealDiv, Relu, Relu6, Reshape, Rsqrt, Shape, Sigmoid, Slice, Softmax, SpaceToBatchND, Squeeze, StridedSlice, Sub, Sum, Switch, Tanh, Tile, Transpose, VariableV2

TensorFlow-Lite

TensorFlow-Lite is a TensorFlow subproject that also focuses on inference on smaller devices. It uses a precompiler to transform a TensorFlow network to its own format. It only supports a subset of operators from TensorFlow though, and is only optimised for devices with Arm Neon support.

Tract supports a wider subset of TensorFlow operators, and has been optimised for CPU of the previous generation (ARM VFP), also targetting devices in the Raspberry Pi Zero family that TensorFlow Lite does not address.

NNEF

Long story short, TensorFlow and Onnx formats are good for designing and training networks. They need to move fast to follow the research field, tend to integrate new features and operators greedily. They also exhibit a high level of expressivity to facilitate network design.

On the other hand, only a subset of operators and network features actually reach production, so systems running production network do not have to deal with so many operators. Furthermore, some information required for training can be stripped from the network before going to production for prediction.

NNEF tries to bridge the gap between training frameworks and inference by proposing a format dedicated to production and prediction.

Tract supports NNEF:

tract_nnef can load and execute NNEF networks
tract supports most of the NNEF specification, the most notable exception being the ROI operators and deconvolution
tract introduces tract-OPL, a series of NNEF extensions to support other operators (or extend some operators semantics) in order to represent the full range of tract-core neural network support: any network understood by tract should be serializable to tract-OPL. This is a work in progress.
tract command line can translate networks from TensorFlow or ONNX to NNEF/OPL.

Example of supported networks

These models among others, are used to track tract performance evolution as part of the Continuous Integration jobs. See .travis/README.md and .travis/bundle-entrypoint.sh for more information.

Keyword spotting on Arm Cortex-M Microcontrollers

https://github.com/ARM-software/ML-KWS-for-MCU

ARM demonstrated the capabilited of the Cortex-M family by providing tutorials and pre-trained models for keyword spotting. While the exercise is ultimately meant for micro-controllers, tract can run the intermediate TensorFlow models.

For instance, on a Rasperry Pi Zero, the "CNN M" model runs in about 70 micro-seconds, and 11 micro-seconds on a Raspberry Pi 3.

Snips wake word models

https://arxiv.org/abs/1811.07684

Snips uses tract to run the wake word detectors. While earlier models were class-based and did not require any special treatment, tract pulsing capabilities made it possible to run WaveNet models efficiently enough for a Raspberry Pi Zero.

Inception v3

Device	Family	TensorFlow-lite	tract
Raspberry Pi Zero	Armv6 VFP	113s	39s
Raspberry Pi 2	Armv7 NEON	25s	7s
Raspberry Pi 3	aarch32 NEON	5s	5s

Notes:

while the Raspberry Pi 3 is an Armv8 device, this bench is running on Raspbian, an armv6 operating system, crippling the performance of both benches
there exists other benches on the internet that show better performance results for TensorFlow (not -Lite) on the Pi 3. They use all four cores of the device. Both TensorFlow-Lite and tract here have been made to run on a single-core.

Roadmap

One important guiding cross-concern: this library must cross-compile as easily as practical to small-ish devices (think 20$ boards).

nearly complete ONNX support, and wraps it as a backend
integrate other TF models to use as example, test and benches
- https://github.com/ARM-software/ML-KWS-for-MCU
- https://github.com/mozilla/DeepSpeech
consider acting as kaldi backend

License

Note: files in the tensorflow/protos directory are copied from the TensorFlow project and are not covered by the following licence statement.

Note: files in the onnx/protos directory are copied from the ONNX project and are not covered by the following licence statement.

Apache 2.0/MIT

All original work licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT) at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments

Integer-sizing a decluttered streaming TypedModel without Pulse (for non causal models)

Hey, I came across another problem trying the bidirectional LSTM model in a browser. It is the same LSTM that is now in CI (download link). Now normally I'd use code similar to this:

use tract_onnx::prelude::*;

fn main() -> TractResult<()> {
    let model = tract_onnx::onnx()
        .model_for_path("model.onnx")?
        .into_optimized()?
        .into_runnable()?;

    let input: Tensor = tract_ndarray::Array2::<u8>::zeros((1, 100)).into();
    model.run(tvec!(input))?;

    Ok(())
}

but I get an error:

➜  ~/Documents/Experiments/sblstmtest git:(master) ✗ cargo run
   Compiling sblstmtest v0.1.0 (/Users/bminixhofer/Documents/Experiments/sblstmtest)
    Finished dev [unoptimized + debuginfo] target(s) in 4.51s
     Running `target/debug/sblstmtest`
Error: TractError(Msg("Translating node #1 \"input\" Source ToTypedTranslator"), State { next_error: Some(TractError(Msg("Output type not determined"), State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })), backtrace: InternalBacktrace { backtrace: None } })

Running it without into_optimized, or with an input fact works. So I understand that the model can not be optimized because the shape of the input (batch size and seq len) is not known at the time of building. Is that correct? In practice I don't want to fix the input shape at build time because it has to work with different batch sizes.

Now so far it wouldn't be a problem, I'd just add an option optimize to the JS API to turn optimization on or off depending on whether dynamic shapes are needed during inference.

The problem comes when I try to store the model that I got by calling into_runnable without calling into_optimized before.

I get a model of type SimplePlan<InferenceFact, Box<dyn InferenceOp>, ModelImpl<InferenceFact, Box<dyn InferenceOp>>>. When I want to store such a model in a struct like:

use tract_onnx::prelude::*;

struct Model {
    inner: SimplePlan<InferenceFact, Box<dyn tract_hir::infer::ops::InferenceOp>, InferenceModel>,
}

I get an error which says that the module ops is private:

➜  ~/Documents/Experiments/sblstmtest git:(master) ✗ cargo run
   Compiling sblstmtest v0.1.0 (/Users/bminixhofer/Documents/Experiments/sblstmtest)
error[E0603]: module `ops` is private
  --> src/main.rs:4:64
   |
4  |     inner: SimplePlan<InferenceFact, Box<dyn tract_hir::infer::ops::InferenceOp>, InferenceModel>,
   |                                                                ^^^ private module
   |
note: the module `ops` is defined here
  --> /Users/bminixhofer/.cargo/registry/src/github.com-1ecc6299db9ec823/tract-hir-0.7.0/src/infer/mod.rs:12:1
   |
12 | mod ops;
   | ^^^^^^^^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0603`.
error: could not compile `sblstmtest`.

To learn more, run the command again with --verbose.

So I can't store the result. Am I missing something? And if not, is there some way to work around this?

Thanks for all your help :)

opened by bminixhofer 36

Input fact propagation wonky for NNEF

Maybe I was a bit too quick to close #718 as it seems to still have some issues when running depending on exact flags I pass.

I'll post these in one go as I think they're related; but we'll see.

As a base; I'm using the image.nnef.tar we can now generate. I'm using the following base command line:

tract image.nnef.tar --nnef-tract-core --nnef-tract-onnx -i input:1,3,224,224,f32 --allow-random-input

This always works with dump (except when passing --profile), but fails run with the following error:

[2022-06-14T13:55:45.961766439Z WARN  tract::tensor] Using random input for input called "input": 1,3,224,224,F32
[2022-06-14T13:55:45.969444249Z ERROR tract] Evaluating #1 "ConstantOfShape_25" MultiBroadcastTo

    Caused by:
        Undetermined symbol in expression: N

Adding --set N=1 to the run fixes this. I'd have expect something like --override-fact input:1,3,224,224,f32 to also work correctly as a more aggressive -i input:1,3,224,224,f32.

If I attempt to optimize the graph it fails with the following wonky error where it fails to unify two compatible shapes?

[2022-06-14T13:42:11.116655254Z ERROR tract] Error at stage optimize

    Caused by:
        0: codegen node #4 "Conv_0" ConvUnary
        1: Trying to substitute a N,768,7,7,F32 by 1,768,7,7,F32.
           ModelPatch { context: ["wire_as_lazy_im2col"], dont_apply_twice: None, model: Graph { nodes: [Node { id: 0, name: "incoming-3/0", inputs: [], op: TypedSource { fact: 1,3,224,224,F32 }, outputs: [1,3,224,224,F32 >1/0] }, Node { id: 1, name: "Conv_0.matmatmul", inputs: [0/0>], op: LirMatMulUnary { c_fact: 1,768,49,F32, c_m_axis: 1, c_n_axis: 2, micro_ops: [(2359296,F32 0.0050811768, 0.002538681, -0.0051002502, -0.0015630722, 0.0034770966, 0.0017652512, -0.0231781, -0.0051574707, -0.013504028, 0.002796173, 0.00044894218, -0.0076141357..., [Store])], shape=[1], strides=[1], layout=CFcf (0xf), dynamic ndim=1, c_final_shape: 1,768,49, geometry: Concrete(ConcreteMatMulGeometry { m: 768, k: 3072, n: 49, b_storage: VirtualPacking { packer: Packer { r: 5, alignment: 4, end_padding_record: 0 }, func: LazyIm2colSpec { n_bytes_offsets: [0, ...], k_bytes_offsets: [0, 4, ...] }, k: 3072 } }), mmm: MMM (fma_mmm_f32_16x5 16x5), reshape_post: [] }, outputs: [1,768,49,F32 >2/0] }, Node { id: 2, name: "Conv_0", inputs: [1/0>], op: Reshape(2, [Val(49)], [Val(7), Val(7)]), outputs: [1,768,7,7,F32 ] }], inputs: [0/0>], outputs: [], outlet_labels: {}, properties: {} }, inputs: {}, incoming: {0/0>: 3/0>}, shunt_outlet_by: {}, obliterate: [] }`

Looks like it's somehow fails to propagate the input facts?

opened by tgolsson 32

Add support for GPU inference
Address #688 .

Tasks:

[x] GPUTensor

[x] Import

[x] Proper type for imported tensor

[x] Export

[x] Intermediate data in GPU memory

[x] Pass tensor strides as uniforms

[x] Have way of processing rank 4 tensors

[ ] Ops

[x] Validate tensor props before applying ops, at least in debug builds

[ ] Convolution

[ ] Activations

[x] tanh

[x] sigmoid

[ ] relu

[ ] Fully-connected

[ ] Pooling

[ ] Softmax

[ ] Runner for models

[ ] Managing GPU memory

[ ] Free buffers no longer in use to allow for models larger than GPU memory

[ ] Examples working

[ ] tensorflow-mobilenet-v2

[ ] Test various platforms

[ ] Linux

[x] Vulkan + RADV

[ ] Other GPUs

[ ] Various embedded systems like RPi

[ ] Windows

[ ] macOS and iOS

[ ] Android

[ ] WASM

[ ] WebGPU

[ ] WebGL
opened by sh7dm 27

Some tensorflow extensions for keras layers support

I'm trying to load a model into rust and I'm getting an error when I run the model.

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: 
TractError(Msg("Translating #30 \"global_average_pooling1d/Mean\" Unimplemented(Mean)"), 
State { next_error: Some(TractError(Msg("Operator can not be made a TypedOp."), 
State { next_error: None, backtrace: InternalBacktrace { backtrace: None } })), 
backtrace: InternalBacktrace { backtrace: None } })

It seems that the mean operation of global_average_pooling1d is not supported, does anyone know anymore about this?

opened by CharlieBickerton 25

Unimplemented Unimplemented(RandomNormalLike) ToTypedTransla

I'm trying to make Soft-Actor Critic model from stable_baselines3 work with onnx and get following error.

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Translating node #39 "RandomNormalLike_25" Unimplemented(RandomNormalLike) ToTypedTranslator

Caused by:
    0: translating op UnimplementedOp { outputs: 1, name: "RandomNormalLike", message: "NodeProto { input: [\"onnx::RandomNormalLike_102\"], output: [\"onnx::Mul_103\"], name: \"RandomNormalLike_25\", op_type: \"RandomNormalLike\", domain: \"\", attribute: [], doc_string: \"\" }" }

Do you plan to support RandomNormalLike op?

opened by stillonearth 23

BERT support
Not too sure what specific operators the BERT architecture will require but:

https://github.com/onnx/models/tree/master/text/machine_comprehension/bert-squad use OneHot which is not implemented but easy to add (as an onnx primitive)

as reported here, https://github.com/snipsco/tract/issues/313#issuecomment-661937254 we may encounter ConstantOfShape with dynamic inputs
opened by kali 22
Tree ensemble ONNX ML ops [WIP]

@kali Opening this draft PR so as to collect some potential early feedback.

The core generic tree ensemble engine seems to be working (what I've managed to test so far, I was actually super surprised that the tests vs lightgbm passed from the first try after I got it to compile, lol), now need to pin it to protobuf config, set up the rules, run a few basic benchmarks, wrap it in classifier/regressor types, etc. All features (including score post transforms except probit, different various comparison ops, inputnan handling etc) that can be provided in ONNX protobuf config are already supported here.

It's definitely not implemented in the most efficient way now, but I think it shouldn't be too bad (although before there's benchmarks, I have no idea how much worse it would be than the existing lightgbm/xgboost c++ engines).

opened by aldanor 22
WebAssembly support

Hi!

At the moment, there is no easy way to reliably run ONNX models in the browser. ONNX.js exists but is apparently unmaintained and lacks support for important operations like Conv1d and LSTM.

The alternative is Tensorflow.js which does not directly support ONNX so a model would have to be converted from ONNX to TF, then to a TFJS model, which does also not work at the moment (see https://github.com/onnx/onnx-tensorflow/issues/490).

So there is a bit of a gap in the ecosystem there.

That gap could be filled by compiling tract to WASM, and exposing a higher-level API (i. e. load and predict functions) to Javascript. WebGL support would of course be missing but that is out of scope.

I did some prototyping today, and got the latest release (tract-onnx = "0.6.3") to work in the browser without any changes. So I think a JS wrapper would not be too hard to make.

I'll start working on this in the next couple of days. Depending on how it goes and if there is interest on your side, this could be merged back into tract at a later point.

It would be great if you could officially support compiling to WASM, and add WASM to the CI (e. g. the current master branch does not compile to WASM because of the memory maps from https://github.com/snipsco/tract/commit/99c622ad8279e676fde4485ab9b4db0e537418e4).

Thanks, and please let me know what you think!

opened by bminixhofer 21
Support TreeEnsembleClassifier op

(I'm aware that it overlaps somewhat with #56, but it's a bit more specific, hence opening it as a separate issue)

Given that it's now officially possible to convert LightGBM (and xgboost) tree ensemble classifiers into ONNX, how realistic would it be to expect tract to support TreeEnsembleClassifier op in the foreseeable future? This would potentially be a huge feature, instantly unlocking whole universe of tree ensemble classifiers (and potentially regressors as well).

// I'd be glad to help if there was some guidance on what to do and where, if needed; not quite sure how much of work it is to implement this since I'm not very familiar with the internals of tract.

Thanks!

opened by aldanor 21
Internal multithreading
Hi!

From #326:

tract does not make any effort to run a computation using multiple cores, but is safe to use in multiple threads. So you may get better results by calling run::() on several inputs (or several copies) from different thread (using a parallel iterator may do the trick).

Are there any plans to support internal multithreading? Tract is already very fast. With internal multithreading it could possibly be faster than onnxruntime and ONNX.js*.

*That is, if we can exploit multithreading in the browser, but there is already a working wasm-bindgen example with rayon so I'm confident we would get there.

I'm not very familiar with parallelized implementations of neural nets but I think there are three major points where parallelization is possible:

Slicing the input in chunks that get computed on different cores e. g. with batch sizes > 1.

Computing different operators on different cores, each operator could start it's computation once all its inputs are computed.

Internal parallelization of an operation, e. g. different convolution filters on different cores.

Feel free to close this issue if this does not align with your Roadmap for tract.
opened by bminixhofer 20

MobileNet ops not supported

I wanted to run the pretrained frozen .pb models from mobilenetv1 and mobilenetv2 with

let tfd = ::tract_tensorflow::tensorflow().model_for_path(mobilenetv1_frozen).unwrap();
let plan = ::tract::SimplePlan::new(&tfd).unwrap();
let input = load_image(img);
let outputs = plan.run(tvec![input]).unwrap();

But for MobilenetV1 I get

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TractError(Msg("Evaluating #13 \"MobilenetV1/MobilenetV1/Conv2d_0/Relu6\" Unimplemented(Relu6): unimplemented operation: Relu6"), State { next_error: None, backtrace: InternalBacktrace })', src/libcore/result.rs:997:5

and for MobilenetV2

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TractError(Msg("Node named MobilenetV2/Conv/BatchNorm/FusedBatchNorm not found"), State { next_error: None, backtrace: InternalBacktrace })', src/libcore/result.rs:997:5

Any plan to support Relu6 or FusedBatchNorm? Would you be willing to point me where can I add those?

opened by ehsanmok 20

DepthWise conv Inner loop f16 support

https://github.com/Rikorose/DeepFilterNet/pull/211#issuecomment-1353637586

Digging in a bit into why I was seeing so many f32/f16 conversions despite the A55 supporting fp16 storage and arithmetic, it seems like this is just a limitation of Rust’s f16 support.

To fully take advantage of FP16, I think avoiding these conversions is necessary… though, I’m not sure what the best solution is…

Maybe just rewriting the inner loop in assembly for f16 when the CPU says it supports f16?

Overriding the operators in the half crate might work too.

opened by VariantXYZ 22
Instructions on training new cost_models

It would be great to take advantage of the cost_model setup for arbitrary ARM CPUs (like the a57) and generate better cost models for then, but I’m not really sure on the procedure.

Digging in a little, it looks like the cost_model binary gets run on the platform and then that data gets processed by the train script, which generates the file. Seems straightforward, but there’s a lot of parameters that I’m not really sure about…

opened by VariantXYZ 18
Supporting ARM64 CPUs on systems without /proc/cpuinfo

Related to #847

It is possible to override the detector with environment variables... It's not documented, I thought it was only making sens in qemu test contexts where detection is confused. See https://github.com/sonos/tract/blob/main/linalg/src/arm64.rs#L61/

(I just happened to hit this issue since /proc/cpuinfo didn't exist)

Which part of what getrandom does do you refer to ? The ability to register a generator at runtime ?

Right, the ability to provide a fallback implementation for unsupported targets.

https://docs.rs/getrandom/latest/getrandom/macro.register_custom_getrandom.html

I'm not a huge fan of it though, as it involves modifying the crate to add the fallback implementation for a platform.

The environment variable setup would work, but it would be more ideal to have this available during compile-time.

I was thinking about using features, but ARM is a bit strange such that in-order/out-of-order execution is actually implied by the CPU name, which is a bit of a pain. Things like fp16, neon, etc... can be handled via this though (with the caveat that target_feature can only be applied to unsafe functions). Maybe the CPU ID could be provided via a regular feature?

In a situation where a user is building on a platform that isn't windows/linux, I think it's not unreasonable to expect them to provide a .cargo/config.toml or something similar to define their setup better.

opened by VariantXYZ 8
Unnecessary copy of inputs

Hello,

In the run method of Simpleplan, the inputs parameter is a TVec. Since the tensors eventually get converted into an Arc, it means each input tensor get copied once.

Of course this doesn't matter most of the time because the input is small enough, but we have a use case in which the inputs would be huge and reused between differents runs.

Therefore I was wondering if it is possible to allow the user to pass a TVec<Arc> to the run method, hence avoiding the later conversion. In the case you don't want to change the API, there is also the option of adding a new run_with_arcs method that would have the same effect.

If you agree with one of the two possibilities, I can do the corresponding PR shortly as there is almost nothing to do.

wdyt?

opened by mbrunel 1