Open deep learning compiler stack for cpu, gpu and specialized accelerators

The Apache Software Foundation

Last update: Jan 4, 2023

Related tags

Machine learning javascript machine-learning performance deep-learning metal compiler gpu vulkan opencl tensor spirv rocm tvm

Overview

Open Deep Learning Compiler Stack

Documentation | Contributors | Community | Release Notes

Apache TVM is a compiler stack for deep learning systems. It is designed to close the gap between the productivity-focused deep learning frameworks, and the performance- and efficiency-focused hardware backends. TVM works with deep learning frameworks to provide end to end compilation to different backends.

License

Contribute to TVM

TVM adopts apache committer model, we aim to create an open source project that is maintained and owned by the community. Check out the Contributor Guide.

Acknowledgement

We learned a lot from the following projects when building TVM.

Halide: Part of TVM's TIR and arithmetic simplification module originates from Halide. We also learned and adapted some part of lowering pipeline from Halide.
Loopy: use of integer set analysis and its loop transformation primitives.
Theano: the design inspiration of symbolic scan operator for recurrence.

Comments

[RFC][Quantization] Support quantized models from TensorflowLite
Let me reference @ajtulloch 's comment about quantization workflow firstly:

Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.

(optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.

Train the model as usual

Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by

calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)

using activation ranges learned during training (c2/tf).

Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.

Deploy the quantized graph.

However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.

In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of https://github.com/dmlc/tvm/pull/2116, it is just a supplement for TVM's quantization.

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

[x] Support TFLite FP32 Relay frontend. PR: https://github.com/dmlc/tvm/pull/2365

[ ] Support TFLite INT8 Relay frontend

[ ] Extend the attribute of the convolution and related ops to support quantization

[ ] Auto-TVM on ARM CPU can work with INT8

Welcome any feedback.
status: RFC
opened by FrozenGene 119
[RFC] NNVMv2 IR - Relay
[RFC]: Relay a new high level IR for TVM

Relay is a new high level intermediate representation (IR) intended to act as v2.0 of NNVM.

Motivation

Computation graphs are a powerful program representation as demonstrated by the first generation of DL frameworks. Most popular frameworks have employed computation graphs as their input, intermediate representation, and execution data structure.

However, as workloads continue to evolve, the design of our high level IRs needs to evolve to better support the needs of developers and users

Graph-level challenges such as control flow and sub-graphs have become necessary features to natively support and optimize.

The tight coupling between runtime representation and compile-time representation has limited flexibility and frustrated developers; Relay will decouple the representations.

Finally we believe the high level must be designed in tandem with the low level IR, allowing for the two layers to communicate during compilation to achieve optimal performance.

Design

The first version of NNVM set out to solve some of these challenges, and we view Relay as second generation IR designed specifically for integration into the TVM stack as the input layer. Our goal is to focus on TVM as our primary backend, easing development and maintenance for both TVM developers and current NNVM users, as well as enabling new features.

In order to address the challenges presented above we designed Relay to build on the things computation graphs are good at (pure, dataflow, compositional), and improve on the things they struggle with (control flow, subgraph, runtime/compilation distinction).

Core IR

Relay is a typed pure functional IR, with a few basic features such as functions, if-then-else control flow, recursion, operator and function calls, and variable binding.

We have iterated on Relay's design over the past 8 months. This versions represents the culmination of our experiments. This PR does not contain all the pieces of the previous version, instead we focus on introducing the core IR, its associated data structures, and a few integral passes.

The core IR is defined in just a few files:

include/tvm/relay/base.h (the base classes and common data)

include/tvm/relay/type.h (the type system and all relevant nodes)

include/tvm/relay/expr.h (the expression language)

Typing

All Relay programs are typed, similar to more conventional languages such as C++. A type system allows us to statically (i.e at compile time) distinguish between different sorts of values. This means we know whether an expression will evaluate to a tensor, a function (i.e (float32, float32) -> float32) or a tuple (float32, int32). Furthermore, our type system has the ability to be shape generic (i.e polymorphism, templating).

Type inference and checking take the place of shape inference in traditional computation graphs style IRs.

This PR implements type inference and checking for Relay, the code can be found in src/tvm/relay/pass/type_infer.cc, and relevant helper utilities in src/tvm/relay/pass.

Control Flow

Relay adds a notion of control flow to the IR, in the form of simple if (cond) { true_branch } else { false_branch}. Relay requires that the condition variable computes a single boolean value controlling which branch is taken. if is an expression in Relay, meaning the result of the entire expression is the result of the branch taken.

We introduce this to add a formal way to distinguish between data flow and control flow without having to conflate the two in the representation. Because we separate the control signal, we can easily batch a program without affecting control flow.

The definition of control flow can be found in include/tvm/relay/expr.h.

Abstraction

Relay supports the definition of functions which can be used to represent "sub-graphs" (i.e chunks of reusable computation).

Relay functions are like traditional functions: they represent some set of parameters (i.e placeholders) and a body which is a chunk of computation involving the the parameters (i.e sub-graph). We can build a full network/model by composing together functions.

Compilation

The Relay IR is designed as a compile time representation of models. The new features are exposed only in Relay's abstract syntax tree, and used for compile time program manipulation. We do not intend to use Relay's IR as a data structure for serious interpretation or execution.

Runtime

These new features increase the expressivity of the current computation model, and one may ask how to execute programs using these features with the existing runtime. Our goal is to introduce Relay as the compiler representation in this PR, and reuse the existing runtime maintaining compatibility on both the frontend and backend. We anticipate a new version of the runtime having native support for Relay's new constructs in the future.

TVM Co-design

We made an effort to model Relay's implementation after TVM and reuse much of the existing infrastructure in order to provide better compatibility between TOPI operators and Relay programs. One big design decision is reusing the TVM node system to expose the Relay language to Python in the style of TVM. Users who are familiar with TVM's expression language should feel comfortable working with the Relay AST's definition in C++, and Python. We also share representations for many data structures. For example tensor containers (i.e tvm::runtime::NDArray), and generic attributes which can be shared between Relay and TVM are two such shared structures.

Transitioning from NNVM

We plan on adding a guide for transitioning programs from NNVM to Relay. This is one of the remaining work items before releasing the Relay Alpha. The goal is users can use the Relay operators and builder API to construct Relay programs, and we will follow-up with a compatibility layer to make transitioning from NNVM smooth.

For an implementation see #1672 which implements this bit.
status: RFC
opened by jroesch 75
[TOPI][IMAGE][RESIZE] Bilinear interpolation for resize and upsampling.
* upsampling - migrated to cpp * bilinear resize implementation & test cases. * upsampling testcases enhanced for NHWC

Discuss: Above implementation merges both nn (upsampling) and bilinear at topi.cpp level as scale. How about merging upsampling & bilinear as scale at python tvm.topi and nnvm front end too ?

We can have just "scale" at all places in this case.
status: accepted
opened by srkreddy1238 63

[CI][Docker] xgboost 1.0.1 causes segfault on test_autotvm_xgboost_model.py

With the release of XGBoost 1.0.x (i.e xgboost-1.0.1-py3-none-manylinux1_x86_64.whl), it seems that installing TVM from scratch (rebuilding Docker containers) makes tests/python/unittest/test_autotvm_xgboost_model.py to fail with a segfault.

Investigating it a bit further, if I manually revert it to xgboost-0.90 it works fine. Using xgboost-1.0.1, this is the message I see:

tests/python/unittest/test_autotvm_xgboost_model.py::test_fit Fatal Python error: Segmentation fault

Thread 0x00007f4f98de4700 (most recent call first):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 463 in _handle_results
  File "/usr/lib/python3.6/threading.py", line 864 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f4f905e3700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 415 in _handle_tasks
  File "/usr/lib/python3.6/threading.py", line 864 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f4f8fde2700 (most recent call first):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 406 in _handle_workers
  File "/usr/lib/python3.6/threading.py", line 864 in run
  File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f4fb514c700 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/xgboost/core.py", line 1248 in update
  File "/usr/local/lib/python3.6/dist-packages/xgboost/training.py", line 74 in _train_internal
  File "/usr/local/lib/python3.6/dist-packages/xgboost/training.py", line 209 in train
  File "/workspace/python/tvm/autotvm/tuner/xgboost_cost_model.py", line 272 in fit_log
  File "/workspace/tests/python/unittest/test_autotvm_xgboost_model.py", line 35 in test_fit
  File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 167 in pytest_pyfunc_call
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 1445 in runtest
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 134 in pytest_runtest_call
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 237 in from_call
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in call_runtest_hook
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 185 in call_and_report
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 99 in runtestprotocol
  File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 84 in pytest_runtest_protocol
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 271 in pytest_runtestloop
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 247 in _main
  File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 197 in wrap_session
  File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
  File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py", line 93 in main
  File "/usr/local/lib/python3.6/dist-packages/pytest/__main__.py", line 7 in <module>
  File "/usr/lib/python3.6/runpy.py", line 85 in _run_code
  File "/usr/lib/python3.6/runpy.py", line 193 in _run_module_as_main
./tests/scripts/task_python_unittest.sh: line 27: 24582 Segmentation fault      (core dumped) TVM_FFI=ctypes python3 -m pytest -v tests/python/unittest

@tqchen, I didn't see any PR or discussion about it, but are you aware about any ongoing initiative to move TVM to XGBoost 1.0.x, or shall we pin xgboost to be 0.90, to prevent the error to happen? (note: I'm happy to send a patch to pin the version)

opened by leandron 61

Rebuild ci-arm, ci-cpu, and ci-gpu container
This is a tracking issue for the process of updating the TVM ci- containers to reflect the following PRs:

#8169

Steps:

[x] Pick the latest revision of main and use that specific git hash for following steps. 61a6ea185caa180afce548807c65261326589b19

[x] merge #8088

[x] merge #8245

[x] merge #8268

[x] merge #8291

[x] merge #8304

[x] merge #8306

[x] merge #8310

[x] merge #8312

[x] merge #8315

[x] merge #8316

[x] merge #8319

[x] Build docker containers, tagging them as e.g. <username>/ci-<container>:v0.<ver>

[x] ci-arm

[x] ci-cpu

[x] ci-gpu

[x] ci-qemu

[x] Push docker containers to Docker Hub

[x] Create a draft PR modifying Jenkinsfile to point all containers at <username>/ci-<container>:v0.<ver>

[x] https://github.com/apache/tvm/pull/8193

[x] Force-push the PR to ci-docker-staging branch

[x] Jenkins will notice the push and start a build here

[x] Debug the build and repeat these steps until the build passes

[x] Push the valid containers to tlcpack/ci-<container>:v0.<ver>

[x] Update the PR to point Jenkinsfile to the new containers.

[x] Merge the PR.
opened by areusch 54
[Relay] Add a PyTorch to Relay Parser

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

Note: No need to review yet, just here for visibility

Originally submitted PR in a fork of TVM but putting one here as well. May close other one depending on outcome of discussion here https://discuss.tvm.ai/t/discuss-adding-a-pytorch-frontend/5026/4.

Support PyTorch natively in TVM by providing a relay parser. Like other frontends, grab the relay module and paramaters to build via: mod, params = relay.frontend.from_pytorch(trace, input_shapes)

Tested against torchvision models in the included test_forward.py. Some discussion here: https://discuss.tvm.ai/t/discuss-adding-a-pytorch-frontend/5026/4
status: accepted

opened by alexwong 53
[RELAY][OP] Relay Operator Sprint
Now that the Relay RFC is being merged and we are stabilizing the type inference interface, we should sprint to add new operators to relay to make it on parity with NNVM.

#1798 shows an example on how to do so for conv2d operator.

General Steps of Porting

Implement the TypeRelation function, when necessary

The shapes represented by IndexExpr(symbolic integer)

When possible, support symbolic shape inference

You can, however, get the integer out from symbolic shape if it is a must, that will require the inference to work on concrete shapes.

User reporter->Assign to set the inferred result

Use reporter->AssertEQ to assert symbolic integer equivalence

It will return false if there is an unsatisfied constraint

Use tvm::Attrs to replace dmlc::Parameter

We switch to directly create python wrappers by calling into positional functions so that the operator signature is explicit in python

General Principles

Numpy consistency, always consistent with numpy

All binary operators broadcast

This means we will use add, subtract instead of broadcast_add, broadcast_sub ...

elemwise_add version will not be supported for now as we can just use the broadcast version

Consistent with nnvm when possible

Fields in Attrs

Use concrete types when possible(int, string, bool)

If you need None, you can use IndexExpr, which gives you that

List of Operators to be covered

Generally, we need to cover everything we have so far https://docs.tvm.ai/nnvm_top.html Please use this issue to coordinate what you will be working on. As we expect things to move quickly, try to do "fine grained locking" and only claim things that you are working on right now and aim to get things in a few days.

The List

Level 1: Common Basic Ops

Enough to get MLP

[x] nn.dense

[x] nn.relu

[x] tanh

[x] sigmoid

[x] exp

[x] log

[x] sqrt

[x] add

[x] subtract

[x] multiply

[x] divide

[x] mod

[x] nn.batch_flatten

[x] concatenate

[x] nn.softmax

[x] nn.log_softmax

[x] nn.batch_norm

[x] nn.dropout

[x] expand_dims

Level 2: Convolutions

Enough to get convnet

[x] nn.conv2d

[x] nn.conv2d_transpose

[x] nn.max_pool2d

[x] nn.avg_pool2d

[x] nn.global_max_pool2d

[x] nn.global_avg_pool2d

[x] nn.pad

[x] nn.lrn

Level 3: Additional Math And Transform Operators

[x] reshape

[x] copy

[x] negative

[x] floor

[x] ceil

[x] round

[x] trunc

[x] clip

[x] abs

[x] leaky_relu

[x] tranpose

[x] split

[x] squeeze

[x] take

[x] full

[x] zeros

[x] ones

[x] transpose

[x] zeros_like

[x] ones_like

Level 4: All broadcast and reduction functions that are not in previous level

[x] pow

[x] less

[x] greater

[x] less_than

[x] greater_than

[x] right_shift

[x] left_shift

[x] maximum

[x] minimum

[x] sum

[x] max

[x] prod

[x] argmax, argmin

[x] strided_slice

[x] broadcast_to

[x] where

Level 5: Vision Operators

[x] image.resize

[x] vision.multibox_prior

[x] vision.nms

Level 10: Backend Operators

Operators necessary as intermediate stage of optimizations, or gradient, can be influx
status: help wanted
opened by tqchen 52
[TOPI][AutoTVM] NHWC conv2d templates for ARM
Per https://github.com/dmlc/tvm/pull/3754 and https://github.com/dmlc/tvm/pull/3141#issuecomment-526434398 , we are enabling NHWC conv2d templates for ARM as a nearly final solution. The benefits include:

Enable NHWC schedule directly. Previously, we need to transpose between NCHW and NHWC.

AutoTVM now can tune NHWC directly. Previously, we need to build a NCHW network to tune.

Potential performance advantage in NHWC which known to community.

Cowork with @FrozenGene and @etaf .

This is a draft to loop people who may have interest, @anijain2305 @tmoreau89 . Will loop more when the PR is ready, thank you. :)
opened by zhenhuaw-me 50
[VTA][Chisel,de10nano] Chisel fixes and de10nano support
This PR provides fixes to the VTA Chisel implementation, as well as support and enhancements for the tsim and de10nano targets.

With fixes in, the deploy classification tutorial now runs correctly on the de10nano for Resnet18 and Resnet34 workloads, matching the results obtained when running with cpu, fsim, and tsim targets.

A summary of the PR contributions is reported below, more details can be found in the individual commits.

Bug fixes:

Corrupted DRAM stores and loads when crossing page boundaries.

Mismatched LoadUop state and output FSM logic.

Enhancements:

Added de10nano host FPGA programming.

Enabled de10nano user defined target frequency, tested at 100MHz.

Improved FSIM/TSIM/FPGA xref debug.
opened by pasqoc 48

TVM for ROCm 2.x is currently not working

Environment: Ubuntu 18.04 + ROCm 2.2 + TVM (built from current master with ROCM = ON)

I ensure the target TVM library successfully detect and link with ROCM, and the tuning procedure runs successfully, however, while executing tvm.build(s, arg_bufs, 'rocm', name='matmul'), it failed with the following error:

WARNING:autotvm:Too many errors happen in the tuning. Now is in debug mode
Finish loading 500 records
DEBUG:autotvm:Finish loading 500 records
Cannot find config for target=rocm, workload=('tvm_matmul_tune_op', 4, 256, 256). A fallback configuration is used, which may bring great performance regression.
WARNING:autotvm:Cannot find config for target=rocm, workload=('tvm_matmul_tune_op', 4, 256, 256). A fallback configuration is used, which may bring great performance regression.

Best config:
,None,None
[14:47:54] /host/docker/matmul_tvm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
[14:47:54] /host/docker/matmul_tvm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
Traceback (most recent call last):
  File "matmul_autotvm.py", line 260, in <module>
    search_matmul_config(4, 256, 256, 500) # m, k, n, num_trials
  File "matmul_autotvm.py", line 165, in search_matmul_config
    func = tvm.build(s, arg_bufs, 'rocm', name='matmul')
  File "/host/docker/matmul_tvm/tvm/python/tvm/build_module.py", line 617, in build
    fhost, mdev = _build_for_device(flist, tar, target_host)
  File "/host/docker/matmul_tvm/tvm/python/tvm/build_module.py", line 484, in _build_for_device
    mdev = codegen.build_module(fdevice, str(target)) if fdevice else None
  File "/host/docker/matmul_tvm/tvm/python/tvm/codegen.py", line 36, in build_module
    return _Build(lowered_func, target)
  File "/host/docker/matmul_tvm/tvm/python/tvm/_ffi/_ctypes/function.py", line 206, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (2) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(TVMFuncCall+0x61) [0x7f9598de3f01]
  [bt] (1) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(+0x14b2e9) [0x7f95986992e9]
  [bt] (0) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(+0x231aaa) [0x7f959877faaa]
  File "/host/docker/matmul_tvm/tvm/src/codegen/codegen.cc", line 46
TVMError: Check failed: bf != nullptr: Target rocm is not enabled

opened by ghostplant 48

[QNN][TFLite] TFLite rounding mode support

Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

Add tflite rounding support with corresponding test cases. The tflite rounding mode golden results are generated with a testbench using MultiplyByQuantizedMultiplier function here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/common.h#L148

@FrozenGene @anijain2305

Might help fix the problem described here: https://discuss.tvm.ai/t/supporting-bit-exact-tflite-qnn-inference/5528
status: inactive

opened by fwd4 47
[BugFix][Runtime] Fix Incorrect node information

node["attrs"] and node["shape"] may read incorrect node information due to incorrect indexing, especially nodes with multiple outputs in the graph.

opened by zhaojinxi 1
[Bug] Which tvm version currently supports rocm？
Expected behavior

Execute tvm code on rocm platform

Actual behavior

What actually happened

Environment

tvm 0.10 ubuntu 20.04

Steps to reproduce

Preferably a minimal script to cause the issue to occur.

Triage

There is no specified version of rocm is not in the document

needs-triage

type: bug needs-triage
opened by wangzy0327 1
[microTVM][Zephyr]Add project files for mlperftiny submission
This PR makes these changes:

add source/header files for generating a zephyr project which is compatible with EEMBC runner for MLPerfTiny

adjust microtvm_api_server.py and CMakeLists.template to support mlperftiny project type

adds EEMBC api files from https://github.com/mlcommons/tiny in thirdparty/tiny.

This pull request was co-authored by @alanmacd, @mkatanbaf, @guberti and @areusch as part of our effort to submit to MLPerfTiny. You can find our submission results here: https://mlcommons.org/en/inference-tiny-10/
opened by mehrdadh 1
[docs] Remove empty code blocks

This fixes some of the issues highlighted in #13668, the parser that checks to ensure that the hidden import is placed in the right spot was incorrect, this includes some fixes to get it working for the cases in that issue.

Example: https://pr-docs.tlcpack.ai/PR-13689/1/docs/how_to/deploy_models/deploy_prequantized.html

opened by driazati 1

Releases(v0.10.0)

v0.10.0(Oct 17, 2022)
Introduction

The TVM community has worked since the v0.9 release to deliver the following new exciting improvments!

Metaschedule

Software pipelining and padding for irregular shapes for auto tensorization

Stabilized and polished user-interfaces (e.g. database changes, tune_relay)

A new MLP-based cost model

TIR

New schedule primitive for PadEinsum

A new TIR node: DeclBuffer

INT8 Intrinsics for TensorCores for CUDA!

microTVM

Improved schedule primitives for ARM v8-m ISA

And many other general improvements to code quality, TVMScript, and more! Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.9.0...v0.10.0rc0.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

Issue Triage Workflow RFC (#93) (3345cc1)

[RFC] Add Commit Message Guideline (#88) (e8a2d8b)

Add Target Features RFC (#78) (1ab898d)

[RFC] TVMScript Metaprogramming (#79) (ffbf686)

Add Target Pre-processing RFC (#71) (78423c5)

[RFC] Name mangling in IRModules (#84) (831d702)

Asynchronous stage in software pipeline (#80) (aecb219)

[RFC] Buffer Layout Padding (#77) (ca695fe)

[RFC] Create LLVM scope class for use with LLVM libraries (#83) (22d1d11)

What's Changed

Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.9.0...v0.10.0rc0.

Note that this list is not comprehensive of all PRs and discussions since v0.9. A non-truncated summary can be found here: https://github.com/apache/tvm/issues/12979

TIR

#12720 - [TIR] Implement API for padded layout transformations

#12797 - [TIR] Construct the inverse in SuggestIndexMap

#12827 - [TIR] Support pattern matching argmax/argmin generated by TOPI

#12750 - [TIR, Schedule] Add schedule primitive PadEinsum

#11639 - [TIR][Meta-Schedule] Tuple-reduction scheduling support

#12515 - [TIR][Arith] Add more strict checking in imm construction and folding.

#12717 - [TIR, Schedule] Check consumer in-bound and covered in reverse_compute_inline

#12652 - [TIR] Handle axis_separators during FlattenBuffer

#12623 - [TIR] Expose MMA-related PTX builtins

#12607 - [TIR][Schedule] enhance compute_at and reverse_compute_at primitive to choose possible position ...

Source code(tar.gz)
Source code(zip)
apache-tvm-src-v0.10.0.rc0.tar.gz(19.39 MB)
apache-tvm-src-v0.10.0.rc0.tar.gz.asc(833 bytes)
apache-tvm-src-v0.10.0.rc0.tar.gz.sha512(164 bytes)
v0.9.0(Jul 14, 2022)
Introduction

The TVM community has worked since the v0.8 release to deliver many exciting features and improvements. v0.9.0 is the first release on the new quarterly release schedule and includes many highlights, such as:

MetaSchedule's full implementation

ARM cascading scheduler for Arm Ethos(TM)-U NPUs

Collage which brings tuning to BYOC

Several microTVM improvements

New tvm.relay.build parameters - runtime=, executor=,

AOT - Support for the C++ runtime (with llvm and c targets only) and support for host-driven AOT in the C runtime

Hexagon RPC support

Testing via Hexagon SDK simulator and on device via Snapdragon-based HDK boards and phones

AOT and USMP support

Threading

Initial op support

MLF - Support for multiple modules in a single MLF artifact

Several TIR schedule primitives and transforms including (abridged):

schedule.transform_layout - Applies a layout transformation to a buffer as specified by an IndexMap.

schedule.transform_block_layout - Applies a schedule transformation to a block as specified by an IndexMap.

schedule.set_axis_separators - Sets axis separators in a buffer to lower to multi-dimensional memory (e.g. texture memory).

transform.InjectSoftwarePipeline - Transforms annotated loop nest into a pipeline prologue, body and epilogue where producers and consumers are overlapped.

transform.CommonSubexprElimTIR - Implements common-subexpression elimination for TIR.

transform.InjectPTXAsyncCopy - Rewrites global to shared memory copies in CUDA with async copy when annotated tir::attr::async_scope.

transform.LowerCrossThreadReduction - Enables support for reductions across threads on GPUs.

And many more! See the list of RFCs and PRs included in v0.9.0 for a complete list, as well as the full change list.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

[RFC] TUNIP: TVMScript Unified Printer (#74) (48d47c5)

[RFC][Backend] RFC-CSI-NN2-Integration (#75) (cfcf114)

[RFC] Introducing DeclBuffer (#70) (87ff1fa)

[RFC][MLF] Model Library Format with Multiple Modules (#76) (f47c6ad)

[RFC] UMA Universal Modular Accelerator Interface (#60) (6990e13)

[RFC] DietCode: An Auto-Scheduler for Dynamic Tensor Programs (#72) (a518000)

[RFC] Quarterly Releases (#67) (70293c7)

RFC-BYOC-DNNL-Integration (#73) (7aed0ca)

[RFC] Relay Next Roadmap (#69) (ac15f2a)

RFC: clarifying buffer declaration and access (#63) (de4fe97)

Inclusive Language RFC (#68) (#68) (4203bd2)

[USMP] Adding U4 usecase (#65) (b9e246f)

Collage RFC (#62) (23250f5)

Replace codeowners with more relevant automation (#58) (540c1f8)

[RFC][TIR] Layout transformations on buffer access (#39) (b675ef8)

Module Based Model Runtime for AOT (#46) (d9dd6eb)

@slow test RFC (#55) (9b6203a)

[RFC][Roadmap] TVM Continuous Integration & Testing Roadmap (#54) (41e5ba0)

Bring PackedFunc into TVM Object System (#51) (2e0de6c)

[RFC][OpenCLML] OpenCLML integration as BYOC (#52) (f5ef65f)

Introduce the Arm(R) Ethos(TM)-U Cascading Scheduler (#37) (f9fa824)

[RFC][Roadmap] microTVM roadmap (#53) (1b14456)

Add Managed Jenkins Infrastructure for TVM RFC (#49) (a3a7d2c)

TVM Roadmap RFC (#50) (263335f)

[RFC] Integrate LIBXSMM with TVM. (#47) (1a3d4f1)

[RELAY][AST] Add virtual device as a first class field to Relay expressions (#45) (67c39d2)

What's Changed

Note that this list is not comprehensive of all PRs and discussions since v0.8. Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.8.0...v0.9.0.rc0.

AOT

#11208 - Calculate used memory at the callsite of primitive functions

#11365 - Fix function number datatype from char to uint16_t

#11091 - Enable A-Normal Form in the AOT executor

#10753 - Support LLVM backend with C++ runtime

#10518 - Use python temporary directory for AOT tests

#10337 - BugFix of workspace calculation

#10282 - [runtime] Add Metadata classes for AOTExecutor

#9501 - [3/3][DeviceAPI] Wire up cpacked Device API context

#9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close

#9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators

BYOC

#11474 - Two helper passes for external codegen using RelayToTIR custom pass machinery

#11144 - Remove support for run-time linked-params from codegen

#10590 - Add order to functions in C Codegen

#11638 - [DNNL][CBLAS]Unifles all MKLDNN/DNNL to DNNL

#11619 - RelayToTIR custom codegen passes can still depend on dynamic shape functions

DNNL - #11902, #11642, #11513, #11571, #11560, #11345, #11111, #10837, #10421, #9995, #9797

TensorRT - #11923, #11203, #10759, #10772, #10388

CMSIS-NN - #11732, #11625, #10939, #11013, #10817, #10563, #10224, #10148, #10100, #9338, #9531, #9409, #9331

OpenCLML - #10243

CUTLASS - #11631, #10185, #10177, #10110, #10036, #9899, #9820, #9800, #9795, #9746, #9737, #9698, #9595, #9571

CUDNN - #10997, #9986, #9948

ACL - #10801

PTX - #10855, #10339, #9909

CUBLAS - #10826, #10820

CI

#11313 - Refactor of tvm.testing.requires_* annotations

#11666 - Enable pylint for tests/python/ci

#11657 - Apply linting rules to AOT tests

#11380 - Restructure Jenkinsfile

Automation - #11813, #11775, #11480, #11437, #10833, #10056, #9973, #9934

User experience improvements - #11470, #11329, #11553, #11497, #11051, #10933, #10960, #10525, #10425, #10322, #10121, #9971, #9554, #9752, #9556

Reduce CI runtime - #11402, #11349, #11258, #11132, #10946, #10743, #10359

Code cleanups - #10968, #10740

Frontends

PaddlePaddle - #11537, #9724, #9564

TFLite - #10915, #10566

Oneflow - #11321, #11036, #8790

PyTorch - #11190, #10504, #10184, #10091

ONNX - #10949, #9438, #9186, #9493, #9475

Keras - #7006

Hexagon

#11549 - Initial clip operator for Hexagon

#11834 - Add op resize2d for hexagon

#11559 - Softmax slice op initial version

#11529 - Slice ops added - add, subtract, multiply

#11720 - [testing] add max_pool2d benchmark

#11417 - Implement avg_pool2d slice op

#11653 - Add HexagonThreadManager

#11547 - Run single RPC server on Android in each testing session

#11490 - [testing] add TVMScript elemwise-add

#11400 - [testing] refactor benchmark-table code

#11277 - moves conftest.py to tvm.contrib.hexagon so outside repos can access the testing fixtures

#11319 - Add unit tests for Hexagon Device API

#11279 - Add USMP tests

#11283 - Update Readme

#11239 - capture gtest output and return over FFI

#11175 - Add schedule and test for conv2d_transpose_nchw

#11018 - [Runtime] Add QuRT thread pool backend

#11145 - Add support for on-device unit testing using gtest

#11138 - Add test for depthwise conv2d schedule

#11016 - Add test for registered schedules

#11104 - Add mobilenet test

#11090 - Delete offload runtime, move files to right places

#11065 - AoT with LLVM Codegen on Hexagon

#11025 - Deprecate USE_HEXAGON_DEVICE, introduce USE_HEXAGON

#10604 - HVX scheduling and bench-marking of TE element-wise add

#10905 - [LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout

#10907 - Move aot/graph_executor interactions into launcher

#10919 - Register basic strategies and schedules for common operators

#10904 - Add unit tests executing 2-d VTCM usage

#10910 - Refactor to keep HexagonBuffer private to the device api

#10908 - [LLVM][CodeGen] Make CodeGenHexagon a subclass of CodeGenCPU

#10878 - Generalized HexagonBuffer::CopyTo/CopyFrom

#10846 - Support both 1-d and 2-d VTCM allocations

#10581 - Improved ergonomics of HexagonLauncher in unit tests.

#10616 - Refactor tvm.contrib.hexagon, NFC

#10612 - Deprecate SDK 3.x, rewrite HexagonSDK.cmake

#10586 - Codegen for 2d Load/Store

#10558 - Generalize builtin for Nd memory alloc with storage scope and add lowering for VTCM / Hexagon

#10543 - [Runtime][PipelineExecutor] Add the pipeline internal forwarding logic.

#10507 - Add doc on TVM - Hexagon RPC flow

#10520 - Resolve breakage in test_hexagon/test_cache_read_write

#10311 - [runtime]AOTExecutor implementation for C Codegen

#10454 - Allow execution on target or simulator from HexagonLauncher

#10365 - Lower cache_read and cache_write to Hexagon DMA via tensorize

#10361 - RPC server/client for simulator

#10302 - [CI]Add Hexagon Tests to pipeline

#10263 - [Docker]Add docker file and scripts

#10227 - Refactor Hexagon.cmake

#10217 - Adding support for Hexagon User DMA Engine

#10068 - Update hexagon API build instruction and cleanup hexagon_proxy_rpc

#9970 - Do not auto-build apps when building TVM

#9736 - Add unit tests for HexagonBuffer

#9525 - Add Hexagon VTCM and discontiguous allocation support

#9631 - Add RPC Mechanism for Hexagon

#9473 - cleanup Hexagon conv2d tests

MetaSchedule

#11884 - Postproc: Rewrite-Layout

#11848 - [OpStrategy] Support MetaSchedule Layout

#11845 - [Relay][Pass] Meta-Schedule-Layout-Rewrite

#11758 - [Runtime] Enhance Runner RandomFill

#11683 - Distributed Measurement

#11751 - [Minor] Organize Testing Scripts

#11735 - Modify Profiler Timers

#11727 - Developer Ergonomics Enhancement II

#11692 - Apply-History-Best Task Filtering

#11486 - Add Profiler Support For Tuning Efficiency Optimization

#11680 - JSONDatabase Utilities

#11641 - Generate MetaSchedule Dataset

#11622 - Developer Ergonomics Enhancement

#11604 - Resolve dependencies between header files

#11587 - Add Testing Script with ONNX Support

#11590 - Evo Independence from TaskScheduler

#11534 - No explicit unrolling for spatial PrimFunc

#11512 - Enable Task Filtering

#11177 - AutoBind rule and MutateThreadBinding

#11157 - Logging Interface Unification

#11088 - Auto tensorization for CPU / GPU dot product

#10986 - [Refactor] Introduce TuneConfig

#11020 - [Metaschedule, Refactor] Move MultiLevelTilingNode decl to a header

#10927 - [Refactor] Clarify Integration Logic

#10876 - Add utility API to ease using manual schedules

#10885 - [BugFix] Fix skipped tests

#10366 - Add Gradient Based Task Scheduler

#10823 - Fine-Grained Rewrite Unbound Block

#10793 - Add demonstration of selectively tuning relay ops with TIR schedules

#10811 - Support grouping in the cost model

#10810 - Extract task weights during task extraction

#10782 - [TIR]Estimate TIR FLOPs

#10776 - Misc updates for tuning end-to-end workloads

#10689 - Upstream the leftover changes

#10648 - [Meta Schedule] Refactor meta schedule testing utils

#10578 - New relay backend for meta schedule task extraction

#10534 - Bug Fix for Relay Integration

#10501 - Update scripts for subgraph tuning

#10497 - Refactor testing workloads

#10461 - Enable AutoTVM-style template-based search space

#10368 - Fix Cyclic Dependency in PyClass Family

#10403 - Arithmetic analysis

#10367 - Update Tuning Interfaces.

#10079 - [M4a] User-API: Tune-TE/TIR/Relay

#10081 - [M4a] Rewrite-Cooperative-Fetch

#10055 - [M4b] Testcases for TensorRT builder/runner

#10092 - [M4a] Mutator: Mutate-Tile-Size

#10096 - [M4a] Mutator: Mutate Parallel

#10071 - [M4a] PostProcessor: Rewrite-Parallel-Vectorize-Unroll

#10043 - [M4a] Schedule Rule: Multi-Level-Tiling

#10045 - Mutator: Mutate-Unroll

#10033 - [M4a] Schedule Rule: Parallelize-Vectorize-Unroll

#10027 - [M4a] PostProcessor: Rewrite-Unbound-Block

#10028 - Mutator: Mutate-Compute-Location

#9997 - [M4a] PostProcessor: Disallow-Dynamic-Loop

#9994 - [M4a] Schedule Rule: Cross-Thread-Reduction

#10013 - [M4a] PostProcessor: Rewrite Reduction Block

#9975 - [M4a] Schedule Rule: Add-RFactor

#9945 - [M4a] PostProcessor: Verify-GPU-Code

#9940 - [M4a] Schedule Rule: Random-Compute-Location

#9943 - [M4a] Schedule Rule: Auto-Inline

#9860 - [M3c] Add Per-Store-Feature

#9859 - [M3c] XGB-based Cost Model

#9836 - [M4a] Add EvolutionarySearch Search Strategy

#9799 - [M4a] Add ReplayFunc Search Strategy

#9789 - [M3c] Update TuneContext, TaskScheduler & Search Strategy Design

#9780 - [M3c] Add More Measure Callbacks

#9761 - [M4a] Add ScheduleRule class & PostOrderApply space generator

#9760 - [M3c] Random Feature Extractor

MicroTVM

#11741 - Refactor RVM scripts and fix DNS network issue

#11472 - [ARM]Add tests for arm schedules

#11634 - Update pyproject to python3.7

Zephyr support - #11650

RPC - #11227, #10967

Relay

#11825 - [realy][pass]add split infer shape with convert op layout pass

#11674 - Finish implementations of WithFields

#11481 - IndexedGraph improvements in preparation for Collage

#11432 - Plumb external codegen target via Target.current()

#11494 - [Pass] Add MaxPool, AvgPool to FoldExplicitPadding

#11183 - Add unidirectional sequence lstm

#11442 - Add 'static_library' runtime::Module

#11413 - [Topi]Support for FP16 ERF on CPU.

#11382 - Finish support for list-of-targets

#11386 - [Tests] Replace the Relay interpreter with the VM in the op tests

#11224 - Support i16, f16 scalars in Relay text

#11337 - Fix eltwise alter op layout for broadcast axis

#11199 - Flexible shape dispatch transformation

#11173 - Support 'external codegen targets'.

#10996 - Add FlattenAtrousConv transformation

#10871 - [CUDNN] Add cuDNN as a Relay partitioning target (BYOC)

#10787 - [Pass][Bugfix] Disable re-use of non-flat buffers in StorageRewrite.

#10378 - [FQ2I] Add leaky relu to FQ21

#10400 - RelayViz graphviz renderer

#10352 - [VIRTUALDEVICE] Change syntax for device planning and store parameter virtual devices in virtual_device_ field

#10310 - [ARM_CPU] Conv2d int8 intrinsic for cortex-A72

#10085 - RelayViz interface and terminal ast-dump

#10239 - Add a conversion of individual operations in FQ2I pass.

#10236 - [Refactor] Clean up type relations that are declared as template for no reason

#10156 - Fix broadcast InferCorrectLayout

#10026 - [VM] Relay VM memory liveness/lifetime analysis

#10089 - [Pass] Add a relay pass to extract fake quantized ops

#9690 - Change function constructors to WithFields

#10069 - [DefuseOps pass] bug fix: To support function body types other…

#9954 - Add conv2d_backward_weight op (without topi)

#9838 - [FoldScaleAxis] Support dense and bias_add op in fold scale axis

#9816 - Add sliding_window operator

#9874 - Add a JSON converter for 0.7 -> 0.8 and 0.8 -> 0.9

#9735 - [AMP][Pass][Typing] Add faster type inference

#9723 - [Frontend] Add Span filling for frontends to Relay

#9749 - Fix invalid shape function for "copy" operator

#9759 - s/SEScope/VirtualDevice/g

#9734 - Support large constants saved/loaded outside of VM executable

#9613 - Re-run PlanDevices after LowerTE to flow new memory scope constraints.

#9693 - PlanDevices supports 'free' on_device annotations

#9641 - [AST] Add virtual_device as a first class field in Relay

#9483 - Switch the VM to use the LowerTE pass instead of TECompiler::{Lower,LowerShapeFunc}.

#9569 - WithFields method for Call, Function, Var, TupleGetItem, If, Let, RefCreate, RefRead, RefWrite, Match, and Clause

#9533 - WithFields for Tuples

#9550 - Prepare for switching VM to LowerTEPass.

#9542 - Prepare DeadCodeElimination for running post LowerTEPass/ManifestAlloc.

#9352 - [TVMC]Introduce executor and runtime parameters

#9457 - Add the Arm(R) Ethos(TM)-U NPU identity operator

#9326 - Switch PlanDevices pass to be w.r.t. SEScopes instead of DLDeviceTypes.

QNN - #11228, #10718, #10086, #10053, #9637, #9982

Runtime

#11334 - [PipelineExecutor] Add graph manually splitting logic into the unit test.

#11133 - [PipelineExecutor] Refactor PipelineExecutor.py and Add cross compile support for pipeline executor.

#11172 - Move WrapTimeEvaluator from RPC to profiling, NFC

#10990 - [PipelineExecutor]Add forwarding queue logic for set input.

#10953 - [Vulkan] Add RGP support to TVM for vulkan device

#10723 - [PipelineExecutor] Getting the asynchronous output

#10283 - AOTExecutor implementation and c target code-generator

#9802 - [ThreadPool]Refactor affinity function and support CPU affinity list setting.

#10234 - [Pipeline Executor] multiple threads management and the data forwarding notification mechanism.

#10326 - Improved log information with function signature

#10032 - [PackedFunc] Bring PackedFunc into TVM Object System

#10082 - [PipelineExecutor] Pipeline Executor Sequential execution

#10010 - [PipelineExecutor] Add Pipeline Executor Interface

#9846 - [Pipeline executor] Global parameters group name and runtime modules parameters map.

#9889 - [GraphExecutor] Add API get_input_info to graph_executor

#9751 - [Pipeline Executor] Add the map logic of global input and subgraph input.

TE

#11589 - Support schedulable TIR compute definitions in TOPI

#11341 - Optimized version of concatenation layer

#10561 - [TECompiler] Decouple TE compute and schedule lowering in ScheduleBuilder

TIR

#11592 - HoistExpression, generalization of HoistIfThenElse

#11870 - [Pass] Remove-Weight-Layout-Rewrite-Block

#11740 - [TIR, analysis] Add GetAutoTensorizeMappingInfo to generate transforms for auto tensorization

#11585 - Add preserve-unit-iters

#11677 - Register CUDA WMMA tensor intrinsics

#11658 - [TIR, CUDA] Add pass to replace global to shared memory copy with cp.async

#11624 - [Schedule] Allow named block and buffer arguments in Schedule

#11628 - [PASS] Refactor a couple of TIR passes - BindTarget, AnnotateEntryFunc, Filter, LowerInitBlock

#11574 - CSE pass : Restrict the equivalence to be decided by a normal form - avoids comparison of terms

#11575 - Schedule Primitive: Add-Unit-Loop

#11515 - Add schedule primitive ReIndex

#11524 - [Arith] Additional Simplifications Inside Conditionals

#11485 - Add schedule primitive TransformBlockLayout

#11495 - [Software pipeline] Fix hardcoded index in access_ptr rewriting, add a GPU test with depth 4

#11269 - [Schedule] Transform layout quality of life

#11355 - Support tensorization using ldmatrix + MMA

#11289 - [Schedule] Allowed typing.Tuple in tir.schedule._type_checker

#11317 - Support affine expressions as indices in reverse compute inline

#11235 - [Arith] Implemented padded inverses in IndexMap

#11238 - [ROOFLINE] Calculate roofline from existing TIR PrimFunc

#11225 - Add schedule primitive SetAxisSeparator

#11110 - Get read/write access precisely for opaque access.

#11106 - Enhance software pipeline validation and fix predicate of epilogue

#10843 - StmtFunctor RenewDefs

#11075 - Add function to tile a block according to a given tensor intrinsic

#11050 - Utility function to decide loop mapping for auto tensorization

#11009 - [ROCM] DP4A intrinsic support for TE/TIR

#10925 - VNNI and ARM dot product intrinsic for tensorization

#10887 - [Schedule] Relax reorder primitive's affine binding check

#10732 - [Analysis] Add SuggestIndexMap for layout rewriting

#10538 - [Schedule] Transform layout

#10638 - Change the behavior of read/write region analysis for reduction blocks.

#10705 - Use local complete block and local reduction block to identify compact dataflow

#10671 - Tuple Reduction Support in CreatePrimFunc

#9727 - [TE]Implement layout transformations, non-flat memory buffers

#10405 - [TensorIR] Update VerifyGPU

#10401 - [TensorIR] Renormalize split pattern

#10112 - [TIR, Relay] improve bfloat16 support

#8509 - Tir constants integration into compilation pipeline

#9996 - add support for multi-blocking layout and their transformation

#10066 - Add software pipelining

#10207 - Support sub warp reduction for CUDA target.

#9482 - Implementation of Common Subexpression Elimination for TIR

#9527 - Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern

#10158 - [Schedule] Update compact_dataflow constraint

#9871 - [Schedule] Blockize and Tensorize

#10016 - [BugFix]Fix cross-thread reduction when single reduction loop with predicate

#9880 - Encode conditional accesses info into block read/write regions

#9699 - Affine utility support iter lowerbound and diagnostics

#9742 - [Schedule] Add Annotate/Unannotate primitive

#9738 - [TensorIR] Primitive "SetScope"

#9743 - [Schedule] Analysis functions to check if compute_inline and com…

#9689 - Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs

#9559 - [TensorIR][UX] Type annotation-based runtime type checking

#9444 - Add a 'rolling_buffer' scheduling primitive

#9360 - [TensorIR] Cross-Thread Reduction

TOPI

#11531 - TE implementation of LSTM using scan

#11161 - Add Adreno GPU target and topi supporting textures with dynamically allocated textures

#10332 - VNNI support for batch matmul

#9873 - Add support for groupped conv3d

#10230 - VNNI support for int8 dense

#10098 - [Op]5 ops can accept unsigned integers as indices

#9832 - Support grouped conv1d

#9694 - Add generic batch norm

#9233 - Cortex-M DSP support

TVMScript

#11308 - Represent ramp as index slice

#10099 - Support T.buffer_decl using data pointer from Let/Allocate

#9680 - Improve printer for TIR syntax sugar

#9492 - Add syntax sugar for T.handle and T.match_buffer

#9620 - Add for loop syntax sugar

#9543 - Misc error message improvements

#9505 - [Fix] Add type hints for more uncovered cases

USMP

#11015 - U3 use case

#10189 - Adding support for U1 usecase for constant pools

#10785 - Adding support for U4 usecase

#10193 - adding support for U2 and U3 usecases

#10005 - Add performance characteristics to PoolInfo

#9565 - [TIR]Integrating USMP to AoT Executor

#9704 - Hill Climb allocator

#9418 - [TIR]adding the pass to convert to pool offsets

#9649 - [TIR]Augmenting the algo interface with memory pressure

#9214 - [TIR]Greedy memory planning algorithm

#8468 - [TIR]Added buffer info extraction pass

microNPU

#11468 - Optimize separate padding operation for conv2d

#11453 - Add transform matrices and part matcher to identity op

#11410 - add E2E tests with cascader wo striping

#11288 - Expose compute cycle annotations to TIR lowering

#10959 - Add a pass to reorder copy and compute nodes

#10509 - Add various options to the cascader

#11263 - Adding a option to enable striping

#10251 - Add support for conv2d running on two cores on U65

#10862 - Integrate the cascader

#10344 - Integrate rolling buffers in Arm(R) Ethos(TM)-U

#10824 - Some housekeeping in the test_ethosu folder

#10763 - Tweak a layout transform matrix

#10725 - Add a pass to move allocate nodes to the outer scope

#10695 - Determine block configs using the cascader

#10599 - Refactor Relay to TIR hook

#10508 - Improve cascader memory transfer estimates

#10345 - Add support for TFLite FULLY_CONNECTED

#10254 - Introduce a pass to remove redundant identity operations

#10062 - [5] Convert Proposals to te.Schedules

#9959 - [4] Add the cascader Proposal generator

#10022 - enable USMP

#10127 - Add support for LeakyReLU

#10004 - Add FreeRTOS variant of NPU demo

#10060 - Refactor type inference data type checks

#9960 - Add support for pack and unpack

#10143 - Fix layout assignment in layout optimizer pass

#9890 - [3] Plan generation for the cascader

#9855 - Add support for transpose convolution

#9841 - Add support for nearest neighbor and bilinear upsampling

#9951 - Removing constant args from PrimFunc

#9929 - Refactor base address determination to codegen

#9910 - Add support for requantize

#9831 - Move optimization passes to be a module pass and ensure they are running

#9785 - [2d] Add more Part matchers to cascader

#9778 - [2c] Add performance modelling to cascader

#9471 - [2b] Create CascaderGraphs from TE graphs

#9469 - [2a] Add CascaderGraph for cascading analysis

#9621 - Add support for SPLIT and SPLIT_V

#9508 - Update Conv2D Tests to Use TF API to Gen Test Cases

#9627 - Add support for SIGMOID

#9589 - Add support for TFLite concatenate

#9623 - Refactor codegen tests

#9561 - Add NHWC -> NHCWB16 layout transformation pass

#9576 - Mean legalization support

#9597 - Move the compilation to use Target Hooks.

#9458 - [1] Add affine analysis structures for the cascader

#9547 - Add the infrastructure for lookup table and TANH

#9521 - Support binary elementwise with non-4D inputs

#9560 - Fix incorrectly calculated stride when converting NHWC to NHCWB16

#9530 - Add unary elementwise operator infrastructure with ABS

#9514 - Adding rounding mode attribute to operators

#9515 - Allow constants to be given as input to an operator

microTVM

#11250 - [ARM] Add Relay tests for conv2d registered schedules

#11232 - [rpc] Implemented rpc logging

#11044 - Add support for host-driven AoT Executor

#11043 - Better version handling for Arduino

#10555 - Enable micro tvmc tutorial testing in CI

#10194 - [RVM] Add scripts for automated build and testing

#10144 - TVMCon 2021 Zephyr Demo with CMSIS-NN

#10024 - [tvmc] Add TVMC Micro tutorial for Zephyr

#9684 - Fix zephye/test_zephyr_armv7m test

#9584 - [TVMC] Add TVMC test for Arduino and Zephyr

#9526 - Add minimal forwarding RPC server for host driven python execution on Hexagon

Zephyr support - #11362, #10138

Misc

#11465 - Add cooldown interval logic for the profiling functional

#11888 - [LLVM] Include LLVM headers in files that use them, not in llvm_common.h

#11646 - [Arith] Simplification of ceil, log2, and left_shift

#11464 - [MLF] Add support for multiple modules in Model Library Format

#11632 - [AutoTVM][Autoscheduler] Default build funcs inherit PassContext

#11543 - [OpenCL] Implement conv2d_winograd algorithm for Adreno

#11287 - [Arith] Merge surjective/non-surjective iter mapping detections

#11393 - Add utility to replace direct call to pytest.main

#11252 - [ROOFLINE] Roofline analysis over RPC

#11000 - [Graph Debugger] Expose way to benchmark individual nodes.

#10794 - bump PyTorch version to 1.11

#10821 - [REFACTOR] Remove legacy nnvm folder

#10798 - [Arith] Remove diagnostic ctx argument from DetectIterMap

#10567 - [Refactor] Reduced repetition in CodeGenLLVM's buffer access

#10455 - [AUTO_SCHEDULER] Add feature extraction directly from PrimFunc

#7401 - RFC: initial stab at TorchScript fallback

#10391 - [vulkan] Add integer dot product (4xint8, 4xuint8) tensorization for the vulkan SPIR-V target.

#10293 - [VirtualMachine] new method allowing to set one input tensor by its index or name

#10191 - Generate correct output tensor names in C Interface API

#9276 - Parameterize test_link_params

#9808 - [Rust] Update Rust bindings

#9553 - [PROFILING] Add ability to profile a single function_profiling

#9611 - [CMAKE] Automatically detect newly added source files

#9544 - [Target] enable -arch=sm_xx for assigning cuda target arch and deprecate autotvm.measure.set_cuda_target_arch api

Profiler - #11530, #11066

Docs - #10921, #11403, #10774, #10912, #9633, #9906, #9534, #9307, #9654, #9580

Android - #11241

ETHOSN - #11261, #10486, #10018, #9596

TVMC - #11012, #10962, #10722, #9817, #9529, #9229

Source code(tar.gz)
Source code(zip)
apache-tvm-src-v0.9.0.tar.gz(18.82 MB)
apache-tvm-src-v0.9.0.tar.gz.asc(659 bytes)
apache-tvm-src-v0.9.0.tar.gz.sha512(163 bytes)
v0.8.0(Nov 24, 2021)
Overview

Accepted RFCs

Features and Improvements

TE, TIR, TVMScript

AutoTVM, AutoScheduler, Meta Schedule

Operator Coverage

Training

Relay

MicroTVM, AOT, Graph Executor and VM

Arithmetic Analysis

Frontends

Codegen Backends and Runtime

BYOC Integration with Vendor Libraries: TensorRT, ACL, VitisAI

TVMC

Rust Binding

Misc

Overview

Apache TVM v0.8 brings several major exciting experimental features, including:

PaddlePaddle frontend

TVMScript: round-trippable python-based syntax for TIR

TorchScript integration

TensorIR scheduling language

TensorRT and CUTLASS integration via BYOC

Int4 TensorCore support in AutoTVM

MicroTVM Project API and Zephyr, Arduino support

AOT executor

Robust Windows support

Affine analysis infra: iter-affine-map

Improved Vulkan backend

CUDA graph support in TVM runtime

Besides, The community has been working together to refactor and evolve the existing infrastructure, including but not limited to:

Relay compilation engine

Relay pattern language

CI and build process

Refactoring documentation and tutorials

Stablizing AutoScheduler

Stablizing TVMC command line driver interface

Stablizing target system

Frontend coverage, quantization, dynamic shape, training

Full changelog: https://gist.github.com/junrushao1994/c669905dbc41edc2e691316df49d8562.

Accepted RFCs

The community has adopted a formal RFC process. Below is a list of the formal RFCs accepted by the community since then:

[RFC-0005] Meta schedule (AutoTIR)

[RFC-0006] Automatic mixed-precision pass and support

[RFC-0007] Parametrized unit tests

[RFC-0008] MicroTVM Project API

[RFC-0009] Unified static memory planner

[RFC-0010] Target-registered compiler flow customisation

[RFC-0011] Arm® Ethos-U integration

[RFC-0014] Pipeline executor

[RFC-0015] Use CMSIS-NN with TVM

[RFC-0019] Add PaddlePaddle frontend

[RFC-0020] Extend metadata in project option

[RFC-0022] TIR non-scalar constants

[RFC-0023] Adding annotation field to tir.allocate nodes

[RFC-0025] PyTorchTVM

[RFC-0027] Formalize TVM documentation organization

[RFC-0028] Command line composition from internal registry

[RFC-0029] Migrating target attributes to IRModule

[RFC-0030] Command line configuration files

[RFC-0031] C Device API

[RFC-0036] TVMScript namespace

[RFC-0041] Update TVMScript block syntax

Features and Improvements

TE, TIR, TVMScript

TVMScript parser and printer #7630 #9115 #9286

Scheduleable TIR (S-TIR) infrastructure, analysis and lowering passes #7553 #7765 #7847 #8114 #8121 #7873 #7923 #7962 #7848 #8044 #7806

S-TIR schedule primitives: compute-inline, reverse-compute-inline, fuse, split, rfactor, storage-align, vectorize, unroll, bind, reorder, cache-read, cache-write, compute-at, reverse-compute-at, decompose-reduction #8170 #8467 #8544 #8693 #8716 #8767 #8863 #8943 #9041

While loop in TIR #7425 #9004

Metaprogramming in S-TIR via specialize #8354

Support Return value in TIR #7084 #7932

Storage scope support in PointerType #8017 #8366 #8463

Creation of S-TIR via TE compute #7987

AutoTVM, AutoScheduler, Meta Schedule

PopenPoolExecutor is used to replace python native library to provide better multiprocessing support as well as enable auto-tuning in Jupyter notebooks for AutoTVM and AutoScheduler #6959 #8492 #8913 #8820 #8851

AutoScheduler improvement and stabilization: task scheduler, layout rewrite, early stopping, dispatching #6945 #6750 #6987 #7156 #8862 #8995 #7571 #7376 #7377 #7344 #7185

AutoScheduler support for sparse workloads #7313 #7635 #8065

AutoScheduler support for Vulkan, ROCm, Mali #7626 #7038 #7132

AutoTVM support for int4 TensorCore #7831 #8402

Meta Schedule core infrastructure, builder runner and database #8615 #8623 #8642 #8817 #9079 #9132 #9154 #9053 #9059 #9044 #9111 #9061 #9153

Operator Coverage

Operators for Int-8 vision transformer on GPU #7814

Optimizing NMS and ROI-related kernel on GPU #7257 #7172 #7136 #7796 #7463 #6516 #7440 #7666 #8174

Support and optimize sparse operators #8605 #7477 #7435 #6889 #6580 #8437

Sort-related operators and optimization #9184 #7669 #8672 #7611 #7195 #7056 #6978

Support for einsum operator #6370

Matmul, dense operators and their optimization #8921 #8527 #8234 #8250 #6616 #8229 #8401 #7404 #8669

Convolution and pooling operators and their optimization #8620 #8936 #8584 #7075 #7142 #7515 #6999 #6899 #6840 #6137 #6802 #6445 #6711 #6714 #8167 #8222 #8275 #8276 #8422 #8430 #6687 #7928 #8897

Scatter and gather operators and their optimization #8479 #7600 #7044 #7464 #7233 #6533 #6856 #6854 #7927 #8105

Prefix scan, cumsum and cumprod #7722 #7303 #7314 #7334 #7123 #6868

Dynamic shape and shape functions #7414 #6979 #6912 #6898 #6373 #8068 #7490 #7487

Miscellaneous improvement. Operators including: reshape, resize, pad, PRNG, transpose, where, softmax, concat, nll_loss, space_to_batch_nd, batch_to_space_nd, slice_like; Libraries including thrust, cuDNN, cuBLAS, MIOpen; Improving schedules for generic reduction and softmax. #8592 #7375 #7287 #7184 #7131 #7086 #7083 #8030 #6851 #6477 #8346 #6759 #8028 #8056 #8369 #7468 #7458 #7194 #8138 #8543

Training

Relay AutoDiff #7677 #8318

TE AutoDiff #7321

Gradient operators #7685 #7340 #6767 #8307 #7357 #6827

Relay

Pattern language and mixed-mode visitor: matching more IR constructs, fuzzy matching; converting more passes to non-recursive. #8843 #7754 #7355 #7332 #7282 #7151 #7120 #6958 #7507 #8325 #8774 #7817 #7374 #6695 #6704

Improving or adding passes including ExtractOperators, SimplifyExpr, DynamicToStatic, DefuseOps, ConvertLayout, FoldConstant. Added a set of utilities that allows a model to be run efficiently on TensorCores #9253 #9245 #8996 #7827 #9034 #7807 #8755 #7731 #7368 #7603 #7656 #7423 #7354 #6946 #6748 #6720 #6776 #7835 #7895 #8205

TECompiler and refactoring of compilation workflow #9103 #8974 #8886 #8802 #8501 #8526 #8486 #8597 #7518 #7552 #8914 #9130

Quantization and automatic-mixed precision #8883 #8810 #8644 #7613 #8069 #8341 #8126 #8460

Parser, printer and diagnostic #7347 #6274 #6692 #8352 #8000

MicroTVM, AOT, Graph Executor and VM

Pipeline Executor #8702 #9108

CUDA graph integration in graph executor #7616

Enable add set_output_zero_copy in graph executor #8497

VM: memory allocation improvement, shape function improvement and misc #7746 #7451 #7413 #7210 #8040 #6938 #8661 #7676 #8285

AOT compilation and execution #8697 #7785 #8014 #8023 #8096 #8075

Project API infrastructure: #8380 #8963 #8708 #8019

MicroTVM, Zephyr, Arduino RVM, AutoTVM support #9320 #8941 #7804 #7786 #7449 #7891 #7915 #8055 #8037 #8386 #8519 #8748 8154 #8945 #8624 #8701 #7723 #8715 #7225 #6964 #7813 #7528

The pure C runtime (CRT) #7398 #7333 #7095 #7225

Model library format #8270 #8072 #7938

Arithmetic Analysis

Tighter bounds and more simplification on cast #6771 #7045

Introducing iterator (quasi-) affine map detection #6667 #7752 #7759

Inverse of iterator affine map #8384 #8427

Subspace division in iterator affine map #7760

Frontends

PaddlePaddle initial support #8645 #9124 #9126 #9295 #9370 #9236 #9283

ONNX support, including better handling of control flow, coverage of more operators, better dynamic shape support, more tests. #9265 #9178 #9146 #8894 #8966 #8967 #7818 #9000 #9001 #9066 #9028 #9002 #8985 #9019 #9017 #8972 #7802 #7800 #7781 #8919 #9054 #8906 #8933 #8959 #8907 #7771 #8923 #8924 #7755 #7720 #8773 #8872 #7655 #8741 #7633 #8781 #8866 #8867 #7522 #7519 #7489 #7438 #7429 #7364 #7300 #7259 #7243 #7237 #7208 #7189 #7115 #7109 #7089 #7036 #7031 #6839 #6351 #7842 #7844 #6646 #6647 #6681 #6700 #7883 #6726 #6730 #7899 #7900 #7906 #7934 #7956 #8007 #8011 #8084 #8099 #8189 #8191 #8304 #8321 #8337 #8356 #8385 #8502 #8426 #8440 #8456 #8475 #7391 #7394 #8621 #8322 #8323 #8435 #8436 #8455 #7353 #7215

TensorFlow and TFLite, including more operators, better TensorArray support and quantization #9404 #9256 #8689 #7789 #7736 #8763 #8647 #8648 #8558 #8780 #8538 #7659 #7639 #7531 #7520 #7502 #7496 #7473 #7452 #7442 #7441 #7400 #7320 #7293 #7267 #7159 #7148 #7114 #7113 #7093 #7074 #7048 #7030 #6998 #6984 #6970 #6949 #6933 #6918 #6901 #6885 #6849 #5767 #6589 #6670 #6674 #6675 #7866 #6685 #7885 #6729 #7901 #6774 #6783 #6799 #7951 #8024 #8051 #8060 #8074 #8142 #8179 #8251 #8277 #8335 #8364 #8375 #8431 #8454 #6818 #8483 #9099 #9165

PyTorch: more operators including activations, inplace operators, RNNs, NMS #9371 #9204 #9185 #9135 #9133 #9015 #8839 #8718 #8699 #8692 #7712 #8753 #7694 #8583 #7675 #7646 #7606 #7592 #7569 #7544 #7549 #7535 #7517 #7465 #7397 #7371 #7348 #7346 #7325 #7231 #7174 #7154 #7137 #7134 #7133 #7128 #7088 #7023 #6900 #6602 #7845 #6659 #6740 #6782 #6784 #7958 #8192 #8397 #8398 #8403 #8447 #6829

MXNet support. More operators and NLP model coverage in GluonNLP #7568 #7409 #7209 #7191 #7062 #6561 #6699

Misc: CoreML, Keras, DarkNet, etc. #7667 #6676 #6651 #6963 #7949 #7035 #7446 #8562 #8599

Codegen Backends and Runtime

LLVM backend: recover LLVM support on windows; support target feature strings in function attributes; atomic support in NVPTX, ROCm; LLVM compatibility to LLVM 12+ #9305 #9223 #9138 #8860 #8958 #6763 #6698 #6717 #6738 #8293 #6907 #7051

ROCm 3.9 bitcode files search #6865

Vulkan and SPIR-V refactoring and major improvement in codegen and runtime. A critical bug fix in SPIRV codegen allows the Vulkan backend to produce correct outputs on more hardwares and drivers. Added support for querying device specific hardware parameters and capabilities, dynamic shapes, irregular ops such as sorting and NMS, UBO, fp16, and vectorization. We can now run complicated models like MaskRCNN on Vulkan end to end. #8904 #7833 #7717 #7681 #8746 #8813 #7609 #8882 #7607 #7591 #7574 #7572 #7833 #6662 #7969 #8013 #8048 #8098 #8102 #8107 #8127 #8151 #8196 #8320 #8588 #8332 #8333 #8348 #8528

Metal language version upgrade (MTLLanguageVersion2_3), better codegen support, int64 support, various bug fixes #7830 #7819 #7714 #7118 #7116 #7105 #7980 #8054 #8175 #8202 #8206 #8313

OpenCL, VTA, Verilator: refactored code generator, better error messages, various bug fixes #7834 #7777 #7761 #7100 #6125 #6126 #6191 #7834 #8256 #8257 #8731 #8756 #8973

CUDA: enable __launch_bounds__, dynamic shared memory, TensorCore, BF16, half2, NVCC version upgrade #9341 #8678 #7561 #7273 #7146 #7147 #7099 #7065 #7033 #7014 #7907 #7964 #9087 #8135 #8137 #8457 #8466 #8571

ARM: CMSIS-NN, Ethos-N #8653 #7628 #8951 #7506 #7443 #7858 #6982 #8795 #8806 #8833 #9147 #9159 #9160 #9162 #9163 #9167 #9209 #9386 #9387

Hexagon: build, compilation, model launcher, more target options and better runtime #7784 #6718 #8821 #8822 #9033 #8823 #8859 #8865 #8915 #8954 #9024 #9025 #8960 #8986 #9010 #9011 #9189 #9220 #9355 #9356

WASM: Update support for latest emcc, add ffi test. #6751

BYOC Integration with Vendor Libraries: TensorRT, ACL, VitisAI

TensorRT initial integration, stabilization, int8 calibration, dynamism support #6395 #7702 #7595 #7581 #7412 #7372 #9047 #8073 #8808 #6905 #7967 #8005 #8172 #8461 #8506 #8607 #7205 #7026 #7016 #7011 #6955 #6872 #7253 #6805 #9324

Arm Compute Library (ACL) integration #7649 #7206 #6532 #7121 #6724 #8149 #7251 #9396

Verilator integration #7406 #7351 #7286 #8094

VitisAI integration #6343 #7350

BYOC infrastructure enhancement: improving control flow, AnnotateTarget, custom codegen #6641 #6655 #6697 #6786 #7977 #8464

TVMC

MacOS support #8396

AutoScheduler support #7070

Support cross compiler options #7922

Python scripting #7823 #7698

More flexible input specification #7366 #7788

More options, --disable-pass and --config #7816 #8253

Allow passing optional arguments to importers #7674

Model library format (MLF) support #8086 #8331

More backend and library support: metal, ACL, Vulkan, OpenCL, ROCm, Vitis AI #8282 #7508 #8359 #6831 #8896 #7577

Support for the new target system #7651 #7654 #6788 #7304 #6855

Rust Binding

Rust bindings installable via Cargo #7503 #6678 #8631 #8665

Initial support for diagnostic interface #6656

Fixes for using Python APIs from Rust #7085

Improve NDArray, GraphRt, Relay, IRModule, Array, Attrs bindings #6563 #6741 #7138 #8353 #7082

Improve error handling, error messages and fix memory leaks #8289 #6815 #8714 #8725

Misc

Enhanced CPP-RPC implementation: allow user supplied work dir, support of CPP-RPC server for Apple, support adb-shell style CPP-RPC #7670 #8224 #8223 #7766 #7013

Use PopenWorker to handle RPC system: #7889 #7757 #7961

Fold target host into target #7462 #7791 #7534 #8835

Target-based intrinsic lowering and legalization #7936 #7809

Add target tags for all existing CUDA GPU models #7410

Linear Congruential Random Engine #8642

Source code(tar.gz)
Source code(zip)
apache-tvm-src-v0.8.0.tar.gz(17.33 MB)
apache-tvm-src-v0.8.0.tar.gz.asc(833 bytes)
apache-tvm-src-v0.8.0.tar.gz.sha512(159 bytes)