Open deep learning compiler stack for cpu, gpu and specialized accelerators

Overview

Open Deep Learning Compiler Stack

Documentation | Contributors | Community | Release Notes

Build Status WinMacBuild

Apache TVM is a compiler stack for deep learning systems. It is designed to close the gap between the productivity-focused deep learning frameworks, and the performance- and efficiency-focused hardware backends. TVM works with deep learning frameworks to provide end to end compilation to different backends.

License

© Contributors Licensed under an Apache-2.0 license.

Contribute to TVM

TVM adopts apache committer model, we aim to create an open source project that is maintained and owned by the community. Check out the Contributor Guide.

Acknowledgement

We learned a lot from the following projects when building TVM.

  • Halide: Part of TVM's TIR and arithmetic simplification module originates from Halide. We also learned and adapted some part of lowering pipeline from Halide.
  • Loopy: use of integer set analysis and its loop transformation primitives.
  • Theano: the design inspiration of symbolic scan operator for recurrence.
Comments
  • [RFC][Quantization] Support quantized models from TensorflowLite

    [RFC][Quantization] Support quantized models from TensorflowLite

    Let me reference @ajtulloch 's comment about quantization workflow firstly:

    1. Implement a model in a standard ML framework, generally using fp16/bfloat16/fp32 compute precision as this has highest throughput on most commonly-used training hardware.

    2. (optionally) insert fake quantization (here, called simulated quantization) nodes at quantization boundaries (i.e. if your backend implements a fused Int8Conv + Int8Relu, you'd insert them after a Conv + Relu block), to simulate the quantization numerics at training time.

    3. Train the model as usual

    4. Implement a graph rewriting pass (i.e. TF's toco, C2's int8_converter, MXNet's quantization, etc) that rewrites the graph to target the int8 operators directly — i.e. remapping subgraphs of e.g. FP32Conv + FP32Relu to be a fused Int8ConvRelu operator. This requires computing output quantization parameters at requantization boundaries, which can be done either by

    • calibration to an example set of activations, via e.g. l-p norm or kl minimization (c2/tf/mxnet/tensorrt)
    • using activation ranges learned during training (c2/tf).
    1. Using this quantized graph, evaluate various metrics to verify the quantization-induced error/loss is acceptable.

    2. Deploy the quantized graph.

    However, we have framework can do step 1 -> step 5 well like Tensorflow. For example, Tensorflow has quantization-aware training which will do step 2 and get good accuracy at last.

    In the industry development, one common scenario is company will divide algorithm and engine / framework into two different teams. Algorithm team just send an model to engine team to boost the performance. So if algorithm team can use Tensorflow's quantization-aware training, they will know the accuracy before delivering the model to engine team. Engine team just be responsible for boosting the performance.

    I will make several PRs to support importing exist quantized model (TFLite INT8 model) In TVM for previous reason. This is not an replacement of https://github.com/dmlc/tvm/pull/2116, it is just a supplement for TVM's quantization.

    After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU.

    • [x] Support TFLite FP32 Relay frontend. PR: https://github.com/dmlc/tvm/pull/2365

    • [ ] Support TFLite INT8 Relay frontend

    • [ ] Extend the attribute of the convolution and related ops to support quantization

    • [ ] Auto-TVM on ARM CPU can work with INT8

    Welcome any feedback.

    status: RFC 
    opened by FrozenGene 119
  • [RFC] NNVMv2 IR - Relay

    [RFC] NNVMv2 IR - Relay

    [RFC]: Relay a new high level IR for TVM

    Relay is a new high level intermediate representation (IR) intended to act as v2.0 of NNVM.

    Motivation

    Computation graphs are a powerful program representation as demonstrated by the first generation of DL frameworks. Most popular frameworks have employed computation graphs as their input, intermediate representation, and execution data structure.

    However, as workloads continue to evolve, the design of our high level IRs needs to evolve to better support the needs of developers and users

    Graph-level challenges such as control flow and sub-graphs have become necessary features to natively support and optimize.

    The tight coupling between runtime representation and compile-time representation has limited flexibility and frustrated developers; Relay will decouple the representations.

    Finally we believe the high level must be designed in tandem with the low level IR, allowing for the two layers to communicate during compilation to achieve optimal performance.

    Design

    The first version of NNVM set out to solve some of these challenges, and we view Relay as second generation IR designed specifically for integration into the TVM stack as the input layer. Our goal is to focus on TVM as our primary backend, easing development and maintenance for both TVM developers and current NNVM users, as well as enabling new features.

    In order to address the challenges presented above we designed Relay to build on the things computation graphs are good at (pure, dataflow, compositional), and improve on the things they struggle with (control flow, subgraph, runtime/compilation distinction).

    Core IR

    Relay is a typed pure functional IR, with a few basic features such as functions, if-then-else control flow, recursion, operator and function calls, and variable binding.

    We have iterated on Relay's design over the past 8 months. This versions represents the culmination of our experiments. This PR does not contain all the pieces of the previous version, instead we focus on introducing the core IR, its associated data structures, and a few integral passes.

    The core IR is defined in just a few files:

    • include/tvm/relay/base.h (the base classes and common data)
    • include/tvm/relay/type.h (the type system and all relevant nodes)
    • include/tvm/relay/expr.h (the expression language)

    Typing

    All Relay programs are typed, similar to more conventional languages such as C++. A type system allows us to statically (i.e at compile time) distinguish between different sorts of values. This means we know whether an expression will evaluate to a tensor, a function (i.e (float32, float32) -> float32) or a tuple (float32, int32). Furthermore, our type system has the ability to be shape generic (i.e polymorphism, templating).

    Type inference and checking take the place of shape inference in traditional computation graphs style IRs.

    This PR implements type inference and checking for Relay, the code can be found in src/tvm/relay/pass/type_infer.cc, and relevant helper utilities in src/tvm/relay/pass.

    Control Flow

    Relay adds a notion of control flow to the IR, in the form of simple if (cond) { true_branch } else { false_branch}. Relay requires that the condition variable computes a single boolean value controlling which branch is taken. if is an expression in Relay, meaning the result of the entire expression is the result of the branch taken.

    We introduce this to add a formal way to distinguish between data flow and control flow without having to conflate the two in the representation. Because we separate the control signal, we can easily batch a program without affecting control flow.

    The definition of control flow can be found in include/tvm/relay/expr.h.

    Abstraction

    Relay supports the definition of functions which can be used to represent "sub-graphs" (i.e chunks of reusable computation).

    Relay functions are like traditional functions: they represent some set of parameters (i.e placeholders) and a body which is a chunk of computation involving the the parameters (i.e sub-graph). We can build a full network/model by composing together functions.

    Compilation

    The Relay IR is designed as a compile time representation of models. The new features are exposed only in Relay's abstract syntax tree, and used for compile time program manipulation. We do not intend to use Relay's IR as a data structure for serious interpretation or execution.

    Runtime

    These new features increase the expressivity of the current computation model, and one may ask how to execute programs using these features with the existing runtime. Our goal is to introduce Relay as the compiler representation in this PR, and reuse the existing runtime maintaining compatibility on both the frontend and backend. We anticipate a new version of the runtime having native support for Relay's new constructs in the future.

    TVM Co-design

    We made an effort to model Relay's implementation after TVM and reuse much of the existing infrastructure in order to provide better compatibility between TOPI operators and Relay programs. One big design decision is reusing the TVM node system to expose the Relay language to Python in the style of TVM. Users who are familiar with TVM's expression language should feel comfortable working with the Relay AST's definition in C++, and Python. We also share representations for many data structures. For example tensor containers (i.e tvm::runtime::NDArray), and generic attributes which can be shared between Relay and TVM are two such shared structures.

    Transitioning from NNVM

    We plan on adding a guide for transitioning programs from NNVM to Relay. This is one of the remaining work items before releasing the Relay Alpha. The goal is users can use the Relay operators and builder API to construct Relay programs, and we will follow-up with a compatibility layer to make transitioning from NNVM smooth.

    For an implementation see #1672 which implements this bit.

    status: RFC 
    opened by jroesch 75
  • [TOPI][IMAGE][RESIZE] Bilinear interpolation for resize and upsampling.

    [TOPI][IMAGE][RESIZE] Bilinear interpolation for resize and upsampling.

        * upsampling - migrated to cpp
        * bilinear resize implementation & test cases.
        * upsampling testcases enhanced for NHWC
    

    Discuss: Above implementation merges both nn (upsampling) and bilinear at topi.cpp level as scale. How about merging upsampling & bilinear as scale at python tvm.topi and nnvm front end too ?

    We can have just "scale" at all places in this case.

    status: accepted 
    opened by srkreddy1238 63
  • [CI][Docker] xgboost 1.0.1 causes segfault on test_autotvm_xgboost_model.py

    [CI][Docker] xgboost 1.0.1 causes segfault on test_autotvm_xgboost_model.py

    With the release of XGBoost 1.0.x (i.e xgboost-1.0.1-py3-none-manylinux1_x86_64.whl), it seems that installing TVM from scratch (rebuilding Docker containers) makes tests/python/unittest/test_autotvm_xgboost_model.py to fail with a segfault.

    Investigating it a bit further, if I manually revert it to xgboost-0.90 it works fine. Using xgboost-1.0.1, this is the message I see:

    tests/python/unittest/test_autotvm_xgboost_model.py::test_fit Fatal Python error: Segmentation fault
    
    Thread 0x00007f4f98de4700 (most recent call first):
      File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
      File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
      File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 463 in _handle_results
      File "/usr/lib/python3.6/threading.py", line 864 in run
      File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
      File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap
    
    Thread 0x00007f4f905e3700 (most recent call first):
      File "/usr/lib/python3.6/threading.py", line 295 in wait
      File "/usr/lib/python3.6/queue.py", line 164 in get
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 415 in _handle_tasks
      File "/usr/lib/python3.6/threading.py", line 864 in run
      File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
      File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap
    
    Thread 0x00007f4f8fde2700 (most recent call first):
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 406 in _handle_workers
      File "/usr/lib/python3.6/threading.py", line 864 in run
      File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner
      File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap
    
    Current thread 0x00007f4fb514c700 (most recent call first):
      File "/usr/local/lib/python3.6/dist-packages/xgboost/core.py", line 1248 in update
      File "/usr/local/lib/python3.6/dist-packages/xgboost/training.py", line 74 in _train_internal
      File "/usr/local/lib/python3.6/dist-packages/xgboost/training.py", line 209 in train
      File "/workspace/python/tvm/autotvm/tuner/xgboost_cost_model.py", line 272 in fit_log
      File "/workspace/tests/python/unittest/test_autotvm_xgboost_model.py", line 35 in test_fit
      File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 167 in pytest_pyfunc_call
      File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
      File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
      File "/usr/local/lib/python3.6/dist-packages/_pytest/python.py", line 1445 in runtest
      File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 134 in pytest_runtest_call
      File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
      File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
      File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in <lambda>
      File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 237 in from_call
      File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 210 in call_runtest_hook
      File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 185 in call_and_report
      File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 99 in runtestprotocol
      File "/usr/local/lib/python3.6/dist-packages/_pytest/runner.py", line 84 in pytest_runtest_protocol
      File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
      File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
      File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 271 in pytest_runtestloop
      File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
      File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
      File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 247 in _main
      File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 197 in wrap_session
      File "/usr/local/lib/python3.6/dist-packages/_pytest/main.py", line 240 in pytest_cmdline_main
      File "/usr/local/lib/python3.6/dist-packages/pluggy/callers.py", line 187 in _multicall
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 87 in <lambda>
      File "/usr/local/lib/python3.6/dist-packages/pluggy/manager.py", line 93 in _hookexec
      File "/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py", line 286 in __call__
      File "/usr/local/lib/python3.6/dist-packages/_pytest/config/__init__.py", line 93 in main
      File "/usr/local/lib/python3.6/dist-packages/pytest/__main__.py", line 7 in <module>
      File "/usr/lib/python3.6/runpy.py", line 85 in _run_code
      File "/usr/lib/python3.6/runpy.py", line 193 in _run_module_as_main
    ./tests/scripts/task_python_unittest.sh: line 27: 24582 Segmentation fault      (core dumped) TVM_FFI=ctypes python3 -m pytest -v tests/python/unittest
    

    @tqchen, I didn't see any PR or discussion about it, but are you aware about any ongoing initiative to move TVM to XGBoost 1.0.x, or shall we pin xgboost to be 0.90, to prevent the error to happen? (note: I'm happy to send a patch to pin the version)

    opened by leandron 61
  • Rebuild ci-arm, ci-cpu, and ci-gpu container

    Rebuild ci-arm, ci-cpu, and ci-gpu container

    This is a tracking issue for the process of updating the TVM ci- containers to reflect the following PRs:

    #8169

    Steps:

    • [x] Pick the latest revision of main and use that specific git hash for following steps. 61a6ea185caa180afce548807c65261326589b19
      • [x] merge #8088
      • [x] merge #8245
      • [x] merge #8268
      • [x] merge #8291
      • [x] merge #8304
      • [x] merge #8306
      • [x] merge #8310
      • [x] merge #8312
      • [x] merge #8315
      • [x] merge #8316
      • [x] merge #8319
    • [x] Build docker containers, tagging them as e.g. <username>/ci-<container>:v0.<ver>
    • [x] Push docker containers to Docker Hub
    • [x] Create a draft PR modifying Jenkinsfile to point all containers at <username>/ci-<container>:v0.<ver>
      • [x] https://github.com/apache/tvm/pull/8193
    • [x] Force-push the PR to ci-docker-staging branch
    • [x] Jenkins will notice the push and start a build here
    • [x] Debug the build and repeat these steps until the build passes
    • [x] Push the valid containers to tlcpack/ci-<container>:v0.<ver>
    • [x] Update the PR to point Jenkinsfile to the new containers.
    • [x] Merge the PR.
    opened by areusch 54
  • [Relay] Add a PyTorch to Relay Parser

    [Relay] Add a PyTorch to Relay Parser

    Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

    Note: No need to review yet, just here for visibility

    Originally submitted PR in a fork of TVM but putting one here as well. May close other one depending on outcome of discussion here https://discuss.tvm.ai/t/discuss-adding-a-pytorch-frontend/5026/4.

    Support PyTorch natively in TVM by providing a relay parser. Like other frontends, grab the relay module and paramaters to build via: mod, params = relay.frontend.from_pytorch(trace, input_shapes)

    Tested against torchvision models in the included test_forward.py. Some discussion here: https://discuss.tvm.ai/t/discuss-adding-a-pytorch-frontend/5026/4

    status: accepted 
    opened by alexwong 53
  • [RELAY][OP] Relay Operator Sprint

    [RELAY][OP] Relay Operator Sprint

    Now that the Relay RFC is being merged and we are stabilizing the type inference interface, we should sprint to add new operators to relay to make it on parity with NNVM.

    #1798 shows an example on how to do so for conv2d operator.

    General Steps of Porting

    • Implement the TypeRelation function, when necessary
      • The shapes represented by IndexExpr(symbolic integer)
        • When possible, support symbolic shape inference
        • You can, however, get the integer out from symbolic shape if it is a must, that will require the inference to work on concrete shapes.
      • User reporter->Assign to set the inferred result
      • Use reporter->AssertEQ to assert symbolic integer equivalence
        • It will return false if there is an unsatisfied constraint
    • Use tvm::Attrs to replace dmlc::Parameter
    • We switch to directly create python wrappers by calling into positional functions so that the operator signature is explicit in python

    General Principles

    • Numpy consistency, always consistent with numpy
      • All binary operators broadcast
      • This means we will use add, subtract instead of broadcast_add, broadcast_sub ...
      • elemwise_add version will not be supported for now as we can just use the broadcast version
    • Consistent with nnvm when possible
    • Fields in Attrs
      • Use concrete types when possible(int, string, bool)
      • If you need None, you can use IndexExpr, which gives you that

    List of Operators to be covered

    Generally, we need to cover everything we have so far https://docs.tvm.ai/nnvm_top.html Please use this issue to coordinate what you will be working on. As we expect things to move quickly, try to do "fine grained locking" and only claim things that you are working on right now and aim to get things in a few days.

    The List

    Level 1: Common Basic Ops

    Enough to get MLP

    • [x] nn.dense
    • [x] nn.relu
    • [x] tanh
    • [x] sigmoid
    • [x] exp
    • [x] log
    • [x] sqrt
    • [x] add
    • [x] subtract
    • [x] multiply
    • [x] divide
    • [x] mod
    • [x] nn.batch_flatten
    • [x] concatenate
    • [x] nn.softmax
    • [x] nn.log_softmax
    • [x] nn.batch_norm
    • [x] nn.dropout
    • [x] expand_dims

    Level 2: Convolutions

    Enough to get convnet

    • [x] nn.conv2d
    • [x] nn.conv2d_transpose
    • [x] nn.max_pool2d
    • [x] nn.avg_pool2d
    • [x] nn.global_max_pool2d
    • [x] nn.global_avg_pool2d
    • [x] nn.pad
    • [x] nn.lrn

    Level 3: Additional Math And Transform Operators

    • [x] reshape
    • [x] copy
    • [x] negative
    • [x] floor
    • [x] ceil
    • [x] round
    • [x] trunc
    • [x] clip
    • [x] abs
    • [x] leaky_relu
    • [x] tranpose
    • [x] split
    • [x] squeeze
    • [x] take
    • [x] full
    • [x] zeros
    • [x] ones
    • [x] transpose
    • [x] zeros_like
    • [x] ones_like

    Level 4: All broadcast and reduction functions that are not in previous level

    • [x] pow
    • [x] less
    • [x] greater
    • [x] less_than
    • [x] greater_than
    • [x] right_shift
    • [x] left_shift
    • [x] maximum
    • [x] minimum
    • [x] sum
    • [x] max
    • [x] prod
    • [x] argmax, argmin
    • [x] strided_slice
    • [x] broadcast_to
    • [x] where

    Level 5: Vision Operators

    • [x] image.resize
    • [x] vision.multibox_prior
    • [x] vision.nms

    Level 10: Backend Operators

    Operators necessary as intermediate stage of optimizations, or gradient, can be influx

    status: help wanted 
    opened by tqchen 52
  • [TOPI][AutoTVM] NHWC conv2d templates for ARM

    [TOPI][AutoTVM] NHWC conv2d templates for ARM

    Per https://github.com/dmlc/tvm/pull/3754 and https://github.com/dmlc/tvm/pull/3141#issuecomment-526434398 , we are enabling NHWC conv2d templates for ARM as a nearly final solution. The benefits include:

    • Enable NHWC schedule directly. Previously, we need to transpose between NCHW and NHWC.
    • AutoTVM now can tune NHWC directly. Previously, we need to build a NCHW network to tune.
    • Potential performance advantage in NHWC which known to community.

    Cowork with @FrozenGene and @etaf .

    This is a draft to loop people who may have interest, @anijain2305 @tmoreau89 . Will loop more when the PR is ready, thank you. :)

    opened by zhenhuaw-me 50
  • [VTA][Chisel,de10nano] Chisel fixes and de10nano support

    [VTA][Chisel,de10nano] Chisel fixes and de10nano support

    This PR provides fixes to the VTA Chisel implementation, as well as support and enhancements for the tsim and de10nano targets.

    With fixes in, the deploy classification tutorial now runs correctly on the de10nano for Resnet18 and Resnet34 workloads, matching the results obtained when running with cpu, fsim, and tsim targets.

    A summary of the PR contributions is reported below, more details can be found in the individual commits.

    Bug fixes:

    • Corrupted DRAM stores and loads when crossing page boundaries.

    • Mismatched LoadUop state and output FSM logic.

    Enhancements:

    • Added de10nano host FPGA programming.

    • Enabled de10nano user defined target frequency, tested at 100MHz.

    • Improved FSIM/TSIM/FPGA xref debug.

    opened by pasqoc 48
  • TVM for ROCm 2.x is currently not working

    TVM for ROCm 2.x is currently not working

    Environment: Ubuntu 18.04 + ROCm 2.2 + TVM (built from current master with ROCM = ON)

    I ensure the target TVM library successfully detect and link with ROCM, and the tuning procedure runs successfully, however, while executing tvm.build(s, arg_bufs, 'rocm', name='matmul'), it failed with the following error:

    WARNING:autotvm:Too many errors happen in the tuning. Now is in debug mode
    Finish loading 500 records
    DEBUG:autotvm:Finish loading 500 records
    Cannot find config for target=rocm, workload=('tvm_matmul_tune_op', 4, 256, 256). A fallback configuration is used, which may bring great performance regression.
    WARNING:autotvm:Cannot find config for target=rocm, workload=('tvm_matmul_tune_op', 4, 256, 256). A fallback configuration is used, which may bring great performance regression.
    
    Best config:
    ,None,None
    [14:47:54] /host/docker/matmul_tvm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
    [14:47:54] /host/docker/matmul_tvm/tvm/src/pass/vectorize_loop.cc:362: Detect vector condition in Vectorized Loop, scalarizing...
    Traceback (most recent call last):
      File "matmul_autotvm.py", line 260, in <module>
        search_matmul_config(4, 256, 256, 500) # m, k, n, num_trials
      File "matmul_autotvm.py", line 165, in search_matmul_config
        func = tvm.build(s, arg_bufs, 'rocm', name='matmul')
      File "/host/docker/matmul_tvm/tvm/python/tvm/build_module.py", line 617, in build
        fhost, mdev = _build_for_device(flist, tar, target_host)
      File "/host/docker/matmul_tvm/tvm/python/tvm/build_module.py", line 484, in _build_for_device
        mdev = codegen.build_module(fdevice, str(target)) if fdevice else None
      File "/host/docker/matmul_tvm/tvm/python/tvm/codegen.py", line 36, in build_module
        return _Build(lowered_func, target)
      File "/host/docker/matmul_tvm/tvm/python/tvm/_ffi/_ctypes/function.py", line 206, in __call__
        raise get_last_ffi_error()
    tvm._ffi.base.TVMError: Traceback (most recent call last):
      [bt] (2) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(TVMFuncCall+0x61) [0x7f9598de3f01]
      [bt] (1) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(+0x14b2e9) [0x7f95986992e9]
      [bt] (0) /host/docker/matmul_tvm/tvm/build_rocm/libtvm.so(+0x231aaa) [0x7f959877faaa]
      File "/host/docker/matmul_tvm/tvm/src/codegen/codegen.cc", line 46
    TVMError: Check failed: bf != nullptr: Target rocm is not enabled
    
    opened by ghostplant 48
  • [QNN][TFLite] TFLite rounding mode support

    [QNN][TFLite] TFLite rounding mode support

    Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

    Add tflite rounding support with corresponding test cases. The tflite rounding mode golden results are generated with a testbench using MultiplyByQuantizedMultiplier function here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/common.h#L148

    @FrozenGene @anijain2305

    Might help fix the problem described here: https://discuss.tvm.ai/t/supporting-bit-exact-tflite-qnn-inference/5528

    status: inactive 
    opened by fwd4 47
  • [BugFix][Runtime] Fix Incorrect node information

    [BugFix][Runtime] Fix Incorrect node information

    node["attrs"] and node["shape"] may read incorrect node information due to incorrect indexing, especially nodes with multiple outputs in the graph.

    opened by zhaojinxi 1
  • [Bug] Which tvm version currently supports rocm?

    [Bug] Which tvm version currently supports rocm?

    Expected behavior

    Execute tvm code on rocm platform

    Actual behavior

    What actually happened

    Environment

    tvm 0.10 ubuntu 20.04

    Steps to reproduce

    Preferably a minimal script to cause the issue to occur.

    Triage

    There is no specified version of rocm is not in the document

    • needs-triage
    type: bug needs-triage 
    opened by wangzy0327 1
  • [microTVM][Zephyr]Add project files for mlperftiny submission

    [microTVM][Zephyr]Add project files for mlperftiny submission

    This PR makes these changes:

    1. add source/header files for generating a zephyr project which is compatible with EEMBC runner for MLPerfTiny
    2. adjust microtvm_api_server.py and CMakeLists.template to support mlperftiny project type
    3. adds EEMBC api files from https://github.com/mlcommons/tiny in thirdparty/tiny.

    This pull request was co-authored by @alanmacd, @mkatanbaf, @guberti and @areusch as part of our effort to submit to MLPerfTiny. You can find our submission results here: https://mlcommons.org/en/inference-tiny-10/

    opened by mehrdadh 1
  • [docs] Remove empty code blocks

    [docs] Remove empty code blocks

    This fixes some of the issues highlighted in #13668, the parser that checks to ensure that the hidden import is placed in the right spot was incorrect, this includes some fixes to get it working for the cases in that issue.

    Example: https://pr-docs.tlcpack.ai/PR-13689/1/docs/how_to/deploy_models/deploy_prequantized.html

    opened by driazati 1
Releases(v0.10.0)
  • v0.10.0(Oct 17, 2022)

    Introduction

    The TVM community has worked since the v0.9 release to deliver the following new exciting improvments!

    • Metaschedule
      • Software pipelining and padding for irregular shapes for auto tensorization
      • Stabilized and polished user-interfaces (e.g. database changes, tune_relay)
      • A new MLP-based cost model
    • TIR
      • New schedule primitive for PadEinsum
      • A new TIR node: DeclBuffer
      • INT8 Intrinsics for TensorCores for CUDA!
    • microTVM
      • Improved schedule primitives for ARM v8-m ISA

    And many other general improvements to code quality, TVMScript, and more! Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.9.0...v0.10.0rc0.

    RFCs

    These RFCs have been merged in apache/tvm-rfcs since the last release.

    What's Changed

    Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.9.0...v0.10.0rc0.

    Note that this list is not comprehensive of all PRs and discussions since v0.9. A non-truncated summary can be found here: https://github.com/apache/tvm/issues/12979

    TIR

    • #12720 - [TIR] Implement API for padded layout transformations
    • #12797 - [TIR] Construct the inverse in SuggestIndexMap
    • #12827 - [TIR] Support pattern matching argmax/argmin generated by TOPI
    • #12750 - [TIR, Schedule] Add schedule primitive PadEinsum
    • #11639 - [TIR][Meta-Schedule] Tuple-reduction scheduling support
    • #12515 - [TIR][Arith] Add more strict checking in imm construction and folding.
    • #12717 - [TIR, Schedule] Check consumer in-bound and covered in reverse_compute_inline
    • #12652 - [TIR] Handle axis_separators during FlattenBuffer
    • #12623 - [TIR] Expose MMA-related PTX builtins
    • #12607 - [TIR][Schedule] enhance compute_at and reverse_compute_at primitive to choose possible position ...
    Source code(tar.gz)
    Source code(zip)
    apache-tvm-src-v0.10.0.rc0.tar.gz(19.39 MB)
    apache-tvm-src-v0.10.0.rc0.tar.gz.asc(833 bytes)
    apache-tvm-src-v0.10.0.rc0.tar.gz.sha512(164 bytes)
  • v0.9.0(Jul 14, 2022)

    Introduction

    The TVM community has worked since the v0.8 release to deliver many exciting features and improvements. v0.9.0 is the first release on the new quarterly release schedule and includes many highlights, such as:

    • MetaSchedule's full implementation
    • ARM cascading scheduler for Arm Ethos(TM)-U NPUs
    • Collage which brings tuning to BYOC
    • Several microTVM improvements
    • New tvm.relay.build parameters - runtime=, executor=,
    • AOT - Support for the C++ runtime (with llvm and c targets only) and support for host-driven AOT in the C runtime
    • Hexagon RPC support
      • Testing via Hexagon SDK simulator and on device via Snapdragon-based HDK boards and phones
      • AOT and USMP support
      • Threading
      • Initial op support
    • MLF - Support for multiple modules in a single MLF artifact
    • Several TIR schedule primitives and transforms including (abridged):
      • schedule.transform_layout - Applies a layout transformation to a buffer as specified by an IndexMap.
      • schedule.transform_block_layout - Applies a schedule transformation to a block as specified by an IndexMap.
      • schedule.set_axis_separators - Sets axis separators in a buffer to lower to multi-dimensional memory (e.g. texture memory).
      • transform.InjectSoftwarePipeline - Transforms annotated loop nest into a pipeline prologue, body and epilogue where producers and consumers are overlapped.
      • transform.CommonSubexprElimTIR - Implements common-subexpression elimination for TIR.
      • transform.InjectPTXAsyncCopy - Rewrites global to shared memory copies in CUDA with async copy when annotated tir::attr::async_scope.
      • transform.LowerCrossThreadReduction - Enables support for reductions across threads on GPUs.
    • And many more! See the list of RFCs and PRs included in v0.9.0 for a complete list, as well as the full change list.

    RFCs

    These RFCs have been merged in apache/tvm-rfcs since the last release.

    What's Changed

    Note that this list is not comprehensive of all PRs and discussions since v0.8. Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.8.0...v0.9.0.rc0.

    AOT

    • #11208 - Calculate used memory at the callsite of primitive functions
    • #11365 - Fix function number datatype from char to uint16_t
    • #11091 - Enable A-Normal Form in the AOT executor
    • #10753 - Support LLVM backend with C++ runtime
    • #10518 - Use python temporary directory for AOT tests
    • #10337 - BugFix of workspace calculation
    • #10282 - [runtime] Add Metadata classes for AOTExecutor
    • #9501 - [3/3][DeviceAPI] Wire up cpacked Device API context
    • #9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close
    • #9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators

    BYOC

    • #11474 - Two helper passes for external codegen using RelayToTIR custom pass machinery
    • #11144 - Remove support for run-time linked-params from codegen
    • #10590 - Add order to functions in C Codegen
    • #11638 - [DNNL][CBLAS]Unifles all MKLDNN/DNNL to DNNL
    • #11619 - RelayToTIR custom codegen passes can still depend on dynamic shape functions
    • DNNL - #11902, #11642, #11513, #11571, #11560, #11345, #11111, #10837, #10421, #9995, #9797
    • TensorRT - #11923, #11203, #10759, #10772, #10388
    • CMSIS-NN - #11732, #11625, #10939, #11013, #10817, #10563, #10224, #10148, #10100, #9338, #9531, #9409, #9331
    • OpenCLML - #10243
    • CUTLASS - #11631, #10185, #10177, #10110, #10036, #9899, #9820, #9800, #9795, #9746, #9737, #9698, #9595, #9571
    • CUDNN - #10997, #9986, #9948
    • ACL - #10801
    • PTX - #10855, #10339, #9909
    • CUBLAS - #10826, #10820

    CI

    • #11313 - Refactor of tvm.testing.requires_* annotations
    • #11666 - Enable pylint for tests/python/ci
    • #11657 - Apply linting rules to AOT tests
    • #11380 - Restructure Jenkinsfile
    • Automation - #11813, #11775, #11480, #11437, #10833, #10056, #9973, #9934
    • User experience improvements - #11470, #11329, #11553, #11497, #11051, #10933, #10960, #10525, #10425, #10322, #10121, #9971, #9554, #9752, #9556
    • Reduce CI runtime - #11402, #11349, #11258, #11132, #10946, #10743, #10359
    • Code cleanups - #10968, #10740

    Frontends

    • PaddlePaddle - #11537, #9724, #9564
    • TFLite - #10915, #10566
    • Oneflow - #11321, #11036, #8790
    • PyTorch - #11190, #10504, #10184, #10091
    • ONNX - #10949, #9438, #9186, #9493, #9475
    • Keras - #7006

    Hexagon

    • #11549 - Initial clip operator for Hexagon
    • #11834 - Add op resize2d for hexagon
    • #11559 - Softmax slice op initial version
    • #11529 - Slice ops added - add, subtract, multiply
    • #11720 - [testing] add max_pool2d benchmark
    • #11417 - Implement avg_pool2d slice op
    • #11653 - Add HexagonThreadManager
    • #11547 - Run single RPC server on Android in each testing session
    • #11490 - [testing] add TVMScript elemwise-add
    • #11400 - [testing] refactor benchmark-table code
    • #11277 - moves conftest.py to tvm.contrib.hexagon so outside repos can access the testing fixtures
    • #11319 - Add unit tests for Hexagon Device API
    • #11279 - Add USMP tests
    • #11283 - Update Readme
    • #11239 - capture gtest output and return over FFI
    • #11175 - Add schedule and test for conv2d_transpose_nchw
    • #11018 - [Runtime] Add QuRT thread pool backend
    • #11145 - Add support for on-device unit testing using gtest
    • #11138 - Add test for depthwise conv2d schedule
    • #11016 - Add test for registered schedules
    • #11104 - Add mobilenet test
    • #11090 - Delete offload runtime, move files to right places
    • #11065 - AoT with LLVM Codegen on Hexagon
    • #11025 - Deprecate USE_HEXAGON_DEVICE, introduce USE_HEXAGON
    • #10604 - HVX scheduling and bench-marking of TE element-wise add
    • #10905 - [LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout
    • #10907 - Move aot/graph_executor interactions into launcher
    • #10919 - Register basic strategies and schedules for common operators
    • #10904 - Add unit tests executing 2-d VTCM usage
    • #10910 - Refactor to keep HexagonBuffer private to the device api
    • #10908 - [LLVM][CodeGen] Make CodeGenHexagon a subclass of CodeGenCPU
    • #10878 - Generalized HexagonBuffer::CopyTo/CopyFrom
    • #10846 - Support both 1-d and 2-d VTCM allocations
    • #10581 - Improved ergonomics of HexagonLauncher in unit tests.
    • #10616 - Refactor tvm.contrib.hexagon, NFC
    • #10612 - Deprecate SDK 3.x, rewrite HexagonSDK.cmake
    • #10586 - Codegen for 2d Load/Store
    • #10558 - Generalize builtin for Nd memory alloc with storage scope and add lowering for VTCM / Hexagon
    • #10543 - [Runtime][PipelineExecutor] Add the pipeline internal forwarding logic.
    • #10507 - Add doc on TVM - Hexagon RPC flow
    • #10520 - Resolve breakage in test_hexagon/test_cache_read_write
    • #10311 - [runtime]AOTExecutor implementation for C Codegen
    • #10454 - Allow execution on target or simulator from HexagonLauncher
    • #10365 - Lower cache_read and cache_write to Hexagon DMA via tensorize
    • #10361 - RPC server/client for simulator
    • #10302 - [CI]Add Hexagon Tests to pipeline
    • #10263 - [Docker]Add docker file and scripts
    • #10227 - Refactor Hexagon.cmake
    • #10217 - Adding support for Hexagon User DMA Engine
    • #10068 - Update hexagon API build instruction and cleanup hexagon_proxy_rpc
    • #9970 - Do not auto-build apps when building TVM
    • #9736 - Add unit tests for HexagonBuffer
    • #9525 - Add Hexagon VTCM and discontiguous allocation support
    • #9631 - Add RPC Mechanism for Hexagon
    • #9473 - cleanup Hexagon conv2d tests

    MetaSchedule

    • #11884 - Postproc: Rewrite-Layout
    • #11848 - [OpStrategy] Support MetaSchedule Layout
    • #11845 - [Relay][Pass] Meta-Schedule-Layout-Rewrite
    • #11758 - [Runtime] Enhance Runner RandomFill
    • #11683 - Distributed Measurement
    • #11751 - [Minor] Organize Testing Scripts
    • #11735 - Modify Profiler Timers
    • #11727 - Developer Ergonomics Enhancement II
    • #11692 - Apply-History-Best Task Filtering
    • #11486 - Add Profiler Support For Tuning Efficiency Optimization
    • #11680 - JSONDatabase Utilities
    • #11641 - Generate MetaSchedule Dataset
    • #11622 - Developer Ergonomics Enhancement
    • #11604 - Resolve dependencies between header files
    • #11587 - Add Testing Script with ONNX Support
    • #11590 - Evo Independence from TaskScheduler
    • #11534 - No explicit unrolling for spatial PrimFunc
    • #11512 - Enable Task Filtering
    • #11177 - AutoBind rule and MutateThreadBinding
    • #11157 - Logging Interface Unification
    • #11088 - Auto tensorization for CPU / GPU dot product
    • #10986 - [Refactor] Introduce TuneConfig
    • #11020 - [Metaschedule, Refactor] Move MultiLevelTilingNode decl to a header
    • #10927 - [Refactor] Clarify Integration Logic
    • #10876 - Add utility API to ease using manual schedules
    • #10885 - [BugFix] Fix skipped tests
    • #10366 - Add Gradient Based Task Scheduler
    • #10823 - Fine-Grained Rewrite Unbound Block
    • #10793 - Add demonstration of selectively tuning relay ops with TIR schedules
    • #10811 - Support grouping in the cost model
    • #10810 - Extract task weights during task extraction
    • #10782 - [TIR]Estimate TIR FLOPs
    • #10776 - Misc updates for tuning end-to-end workloads
    • #10689 - Upstream the leftover changes
    • #10648 - [Meta Schedule] Refactor meta schedule testing utils
    • #10578 - New relay backend for meta schedule task extraction
    • #10534 - Bug Fix for Relay Integration
    • #10501 - Update scripts for subgraph tuning
    • #10497 - Refactor testing workloads
    • #10461 - Enable AutoTVM-style template-based search space
    • #10368 - Fix Cyclic Dependency in PyClass Family
    • #10403 - Arithmetic analysis
    • #10367 - Update Tuning Interfaces.
    • #10079 - [M4a] User-API: Tune-TE/TIR/Relay
    • #10081 - [M4a] Rewrite-Cooperative-Fetch
    • #10055 - [M4b] Testcases for TensorRT builder/runner
    • #10092 - [M4a] Mutator: Mutate-Tile-Size
    • #10096 - [M4a] Mutator: Mutate Parallel
    • #10071 - [M4a] PostProcessor: Rewrite-Parallel-Vectorize-Unroll
    • #10043 - [M4a] Schedule Rule: Multi-Level-Tiling
    • #10045 - Mutator: Mutate-Unroll
    • #10033 - [M4a] Schedule Rule: Parallelize-Vectorize-Unroll
    • #10027 - [M4a] PostProcessor: Rewrite-Unbound-Block
    • #10028 - Mutator: Mutate-Compute-Location
    • #9997 - [M4a] PostProcessor: Disallow-Dynamic-Loop
    • #9994 - [M4a] Schedule Rule: Cross-Thread-Reduction
    • #10013 - [M4a] PostProcessor: Rewrite Reduction Block
    • #9975 - [M4a] Schedule Rule: Add-RFactor
    • #9945 - [M4a] PostProcessor: Verify-GPU-Code
    • #9940 - [M4a] Schedule Rule: Random-Compute-Location
    • #9943 - [M4a] Schedule Rule: Auto-Inline
    • #9860 - [M3c] Add Per-Store-Feature
    • #9859 - [M3c] XGB-based Cost Model
    • #9836 - [M4a] Add EvolutionarySearch Search Strategy
    • #9799 - [M4a] Add ReplayFunc Search Strategy
    • #9789 - [M3c] Update TuneContext, TaskScheduler & Search Strategy Design
    • #9780 - [M3c] Add More Measure Callbacks
    • #9761 - [M4a] Add ScheduleRule class & PostOrderApply space generator
    • #9760 - [M3c] Random Feature Extractor

    MicroTVM

    • #11741 - Refactor RVM scripts and fix DNS network issue
    • #11472 - [ARM]Add tests for arm schedules
    • #11634 - Update pyproject to python3.7
    • Zephyr support - #11650
    • RPC - #11227, #10967

    Relay

    • #11825 - [realy][pass]add split infer shape with convert op layout pass
    • #11674 - Finish implementations of WithFields
    • #11481 - IndexedGraph improvements in preparation for Collage
    • #11432 - Plumb external codegen target via Target.current()
    • #11494 - [Pass] Add MaxPool, AvgPool to FoldExplicitPadding
    • #11183 - Add unidirectional sequence lstm
    • #11442 - Add 'static_library' runtime::Module
    • #11413 - [Topi]Support for FP16 ERF on CPU.
    • #11382 - Finish support for list-of-targets
    • #11386 - [Tests] Replace the Relay interpreter with the VM in the op tests
    • #11224 - Support i16, f16 scalars in Relay text
    • #11337 - Fix eltwise alter op layout for broadcast axis
    • #11199 - Flexible shape dispatch transformation
    • #11173 - Support 'external codegen targets'.
    • #10996 - Add FlattenAtrousConv transformation
    • #10871 - [CUDNN] Add cuDNN as a Relay partitioning target (BYOC)
    • #10787 - [Pass][Bugfix] Disable re-use of non-flat buffers in StorageRewrite.
    • #10378 - [FQ2I] Add leaky relu to FQ21
    • #10400 - RelayViz graphviz renderer
    • #10352 - [VIRTUALDEVICE] Change syntax for device planning and store parameter virtual devices in virtual_device_ field
    • #10310 - [ARM_CPU] Conv2d int8 intrinsic for cortex-A72
    • #10085 - RelayViz interface and terminal ast-dump
    • #10239 - Add a conversion of individual operations in FQ2I pass.
    • #10236 - [Refactor] Clean up type relations that are declared as template for no reason
    • #10156 - Fix broadcast InferCorrectLayout
    • #10026 - [VM] Relay VM memory liveness/lifetime analysis
    • #10089 - [Pass] Add a relay pass to extract fake quantized ops
    • #9690 - Change function constructors to WithFields
    • #10069 - [DefuseOps pass] bug fix: To support function body types other…
    • #9954 - Add conv2d_backward_weight op (without topi)
    • #9838 - [FoldScaleAxis] Support dense and bias_add op in fold scale axis
    • #9816 - Add sliding_window operator
    • #9874 - Add a JSON converter for 0.7 -> 0.8 and 0.8 -> 0.9
    • #9735 - [AMP][Pass][Typing] Add faster type inference
    • #9723 - [Frontend] Add Span filling for frontends to Relay
    • #9749 - Fix invalid shape function for "copy" operator
    • #9759 - s/SEScope/VirtualDevice/g
    • #9734 - Support large constants saved/loaded outside of VM executable
    • #9613 - Re-run PlanDevices after LowerTE to flow new memory scope constraints.
    • #9693 - PlanDevices supports 'free' on_device annotations
    • #9641 - [AST] Add virtual_device as a first class field in Relay
    • #9483 - Switch the VM to use the LowerTE pass instead of TECompiler::{Lower,LowerShapeFunc}.
    • #9569 - WithFields method for Call, Function, Var, TupleGetItem, If, Let, RefCreate, RefRead, RefWrite, Match, and Clause
    • #9533 - WithFields for Tuples
    • #9550 - Prepare for switching VM to LowerTEPass.
    • #9542 - Prepare DeadCodeElimination for running post LowerTEPass/ManifestAlloc.
    • #9352 - [TVMC]Introduce executor and runtime parameters
    • #9457 - Add the Arm(R) Ethos(TM)-U NPU identity operator
    • #9326 - Switch PlanDevices pass to be w.r.t. SEScopes instead of DLDeviceTypes.
    • QNN - #11228, #10718, #10086, #10053, #9637, #9982

    Runtime

    • #11334 - [PipelineExecutor] Add graph manually splitting logic into the unit test.
    • #11133 - [PipelineExecutor] Refactor PipelineExecutor.py and Add cross compile support for pipeline executor.
    • #11172 - Move WrapTimeEvaluator from RPC to profiling, NFC
    • #10990 - [PipelineExecutor]Add forwarding queue logic for set input.
    • #10953 - [Vulkan] Add RGP support to TVM for vulkan device
    • #10723 - [PipelineExecutor] Getting the asynchronous output
    • #10283 - AOTExecutor implementation and c target code-generator
    • #9802 - [ThreadPool]Refactor affinity function and support CPU affinity list setting.
    • #10234 - [Pipeline Executor] multiple threads management and the data forwarding notification mechanism.
    • #10326 - Improved log information with function signature
    • #10032 - [PackedFunc] Bring PackedFunc into TVM Object System
    • #10082 - [PipelineExecutor] Pipeline Executor Sequential execution
    • #10010 - [PipelineExecutor] Add Pipeline Executor Interface
    • #9846 - [Pipeline executor] Global parameters group name and runtime modules parameters map.
    • #9889 - [GraphExecutor] Add API get_input_info to graph_executor
    • #9751 - [Pipeline Executor] Add the map logic of global input and subgraph input.

    TE

    • #11589 - Support schedulable TIR compute definitions in TOPI
    • #11341 - Optimized version of concatenation layer
    • #10561 - [TECompiler] Decouple TE compute and schedule lowering in ScheduleBuilder

    TIR

    • #11592 - HoistExpression, generalization of HoistIfThenElse
    • #11870 - [Pass] Remove-Weight-Layout-Rewrite-Block
    • #11740 - [TIR, analysis] Add GetAutoTensorizeMappingInfo to generate transforms for auto tensorization
    • #11585 - Add preserve-unit-iters
    • #11677 - Register CUDA WMMA tensor intrinsics
    • #11658 - [TIR, CUDA] Add pass to replace global to shared memory copy with cp.async
    • #11624 - [Schedule] Allow named block and buffer arguments in Schedule
    • #11628 - [PASS] Refactor a couple of TIR passes - BindTarget, AnnotateEntryFunc, Filter, LowerInitBlock
    • #11574 - CSE pass : Restrict the equivalence to be decided by a normal form - avoids comparison of terms
    • #11575 - Schedule Primitive: Add-Unit-Loop
    • #11515 - Add schedule primitive ReIndex
    • #11524 - [Arith] Additional Simplifications Inside Conditionals
    • #11485 - Add schedule primitive TransformBlockLayout
    • #11495 - [Software pipeline] Fix hardcoded index in access_ptr rewriting, add a GPU test with depth 4
    • #11269 - [Schedule] Transform layout quality of life
    • #11355 - Support tensorization using ldmatrix + MMA
    • #11289 - [Schedule] Allowed typing.Tuple in tir.schedule._type_checker
    • #11317 - Support affine expressions as indices in reverse compute inline
    • #11235 - [Arith] Implemented padded inverses in IndexMap
    • #11238 - [ROOFLINE] Calculate roofline from existing TIR PrimFunc
    • #11225 - Add schedule primitive SetAxisSeparator
    • #11110 - Get read/write access precisely for opaque access.
    • #11106 - Enhance software pipeline validation and fix predicate of epilogue
    • #10843 - StmtFunctor RenewDefs
    • #11075 - Add function to tile a block according to a given tensor intrinsic
    • #11050 - Utility function to decide loop mapping for auto tensorization
    • #11009 - [ROCM] DP4A intrinsic support for TE/TIR
    • #10925 - VNNI and ARM dot product intrinsic for tensorization
    • #10887 - [Schedule] Relax reorder primitive's affine binding check
    • #10732 - [Analysis] Add SuggestIndexMap for layout rewriting
    • #10538 - [Schedule] Transform layout
    • #10638 - Change the behavior of read/write region analysis for reduction blocks.
    • #10705 - Use local complete block and local reduction block to identify compact dataflow
    • #10671 - Tuple Reduction Support in CreatePrimFunc
    • #9727 - [TE]Implement layout transformations, non-flat memory buffers
    • #10405 - [TensorIR] Update VerifyGPU
    • #10401 - [TensorIR] Renormalize split pattern
    • #10112 - [TIR, Relay] improve bfloat16 support
    • #8509 - Tir constants integration into compilation pipeline
    • #9996 - add support for multi-blocking layout and their transformation
    • #10066 - Add software pipelining
    • #10207 - Support sub warp reduction for CUDA target.
    • #9482 - Implementation of Common Subexpression Elimination for TIR
    • #9527 - Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern
    • #10158 - [Schedule] Update compact_dataflow constraint
    • #9871 - [Schedule] Blockize and Tensorize
    • #10016 - [BugFix]Fix cross-thread reduction when single reduction loop with predicate
    • #9880 - Encode conditional accesses info into block read/write regions
    • #9699 - Affine utility support iter lowerbound and diagnostics
    • #9742 - [Schedule] Add Annotate/Unannotate primitive
    • #9738 - [TensorIR] Primitive "SetScope"
    • #9743 - [Schedule] Analysis functions to check if compute_inline and com…
    • #9689 - Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs
    • #9559 - [TensorIR][UX] Type annotation-based runtime type checking
    • #9444 - Add a 'rolling_buffer' scheduling primitive
    • #9360 - [TensorIR] Cross-Thread Reduction

    TOPI

    • #11531 - TE implementation of LSTM using scan
    • #11161 - Add Adreno GPU target and topi supporting textures with dynamically allocated textures
    • #10332 - VNNI support for batch matmul
    • #9873 - Add support for groupped conv3d
    • #10230 - VNNI support for int8 dense
    • #10098 - [Op]5 ops can accept unsigned integers as indices
    • #9832 - Support grouped conv1d
    • #9694 - Add generic batch norm
    • #9233 - Cortex-M DSP support

    TVMScript

    • #11308 - Represent ramp as index slice
    • #10099 - Support T.buffer_decl using data pointer from Let/Allocate
    • #9680 - Improve printer for TIR syntax sugar
    • #9492 - Add syntax sugar for T.handle and T.match_buffer
    • #9620 - Add for loop syntax sugar
    • #9543 - Misc error message improvements
    • #9505 - [Fix] Add type hints for more uncovered cases

    USMP

    • #11015 - U3 use case
    • #10189 - Adding support for U1 usecase for constant pools
    • #10785 - Adding support for U4 usecase
    • #10193 - adding support for U2 and U3 usecases
    • #10005 - Add performance characteristics to PoolInfo
    • #9565 - [TIR]Integrating USMP to AoT Executor
    • #9704 - Hill Climb allocator
    • #9418 - [TIR]adding the pass to convert to pool offsets
    • #9649 - [TIR]Augmenting the algo interface with memory pressure
    • #9214 - [TIR]Greedy memory planning algorithm
    • #8468 - [TIR]Added buffer info extraction pass

    microNPU

    • #11468 - Optimize separate padding operation for conv2d
    • #11453 - Add transform matrices and part matcher to identity op
    • #11410 - add E2E tests with cascader wo striping
    • #11288 - Expose compute cycle annotations to TIR lowering
    • #10959 - Add a pass to reorder copy and compute nodes
    • #10509 - Add various options to the cascader
    • #11263 - Adding a option to enable striping
    • #10251 - Add support for conv2d running on two cores on U65
    • #10862 - Integrate the cascader
    • #10344 - Integrate rolling buffers in Arm(R) Ethos(TM)-U
    • #10824 - Some housekeeping in the test_ethosu folder
    • #10763 - Tweak a layout transform matrix
    • #10725 - Add a pass to move allocate nodes to the outer scope
    • #10695 - Determine block configs using the cascader
    • #10599 - Refactor Relay to TIR hook
    • #10508 - Improve cascader memory transfer estimates
    • #10345 - Add support for TFLite FULLY_CONNECTED
    • #10254 - Introduce a pass to remove redundant identity operations
    • #10062 - [5] Convert Proposals to te.Schedules
    • #9959 - [4] Add the cascader Proposal generator
    • #10022 - enable USMP
    • #10127 - Add support for LeakyReLU
    • #10004 - Add FreeRTOS variant of NPU demo
    • #10060 - Refactor type inference data type checks
    • #9960 - Add support for pack and unpack
    • #10143 - Fix layout assignment in layout optimizer pass
    • #9890 - [3] Plan generation for the cascader
    • #9855 - Add support for transpose convolution
    • #9841 - Add support for nearest neighbor and bilinear upsampling
    • #9951 - Removing constant args from PrimFunc
    • #9929 - Refactor base address determination to codegen
    • #9910 - Add support for requantize
    • #9831 - Move optimization passes to be a module pass and ensure they are running
    • #9785 - [2d] Add more Part matchers to cascader
    • #9778 - [2c] Add performance modelling to cascader
    • #9471 - [2b] Create CascaderGraphs from TE graphs
    • #9469 - [2a] Add CascaderGraph for cascading analysis
    • #9621 - Add support for SPLIT and SPLIT_V
    • #9508 - Update Conv2D Tests to Use TF API to Gen Test Cases
    • #9627 - Add support for SIGMOID
    • #9589 - Add support for TFLite concatenate
    • #9623 - Refactor codegen tests
    • #9561 - Add NHWC -> NHCWB16 layout transformation pass
    • #9576 - Mean legalization support
    • #9597 - Move the compilation to use Target Hooks.
    • #9458 - [1] Add affine analysis structures for the cascader
    • #9547 - Add the infrastructure for lookup table and TANH
    • #9521 - Support binary elementwise with non-4D inputs
    • #9560 - Fix incorrectly calculated stride when converting NHWC to NHCWB16
    • #9530 - Add unary elementwise operator infrastructure with ABS
    • #9514 - Adding rounding mode attribute to operators
    • #9515 - Allow constants to be given as input to an operator

    microTVM

    • #11250 - [ARM] Add Relay tests for conv2d registered schedules
    • #11232 - [rpc] Implemented rpc logging
    • #11044 - Add support for host-driven AoT Executor
    • #11043 - Better version handling for Arduino
    • #10555 - Enable micro tvmc tutorial testing in CI
    • #10194 - [RVM] Add scripts for automated build and testing
    • #10144 - TVMCon 2021 Zephyr Demo with CMSIS-NN
    • #10024 - [tvmc] Add TVMC Micro tutorial for Zephyr
    • #9684 - Fix zephye/test_zephyr_armv7m test
    • #9584 - [TVMC] Add TVMC test for Arduino and Zephyr
    • #9526 - Add minimal forwarding RPC server for host driven python execution on Hexagon
    • Zephyr support - #11362, #10138

    Misc

    • #11465 - Add cooldown interval logic for the profiling functional
    • #11888 - [LLVM] Include LLVM headers in files that use them, not in llvm_common.h
    • #11646 - [Arith] Simplification of ceil, log2, and left_shift
    • #11464 - [MLF] Add support for multiple modules in Model Library Format
    • #11632 - [AutoTVM][Autoscheduler] Default build funcs inherit PassContext
    • #11543 - [OpenCL] Implement conv2d_winograd algorithm for Adreno
    • #11287 - [Arith] Merge surjective/non-surjective iter mapping detections
    • #11393 - Add utility to replace direct call to pytest.main
    • #11252 - [ROOFLINE] Roofline analysis over RPC
    • #11000 - [Graph Debugger] Expose way to benchmark individual nodes.
    • #10794 - bump PyTorch version to 1.11
    • #10821 - [REFACTOR] Remove legacy nnvm folder
    • #10798 - [Arith] Remove diagnostic ctx argument from DetectIterMap
    • #10567 - [Refactor] Reduced repetition in CodeGenLLVM's buffer access
    • #10455 - [AUTO_SCHEDULER] Add feature extraction directly from PrimFunc
    • #7401 - RFC: initial stab at TorchScript fallback
    • #10391 - [vulkan] Add integer dot product (4xint8, 4xuint8) tensorization for the vulkan SPIR-V target.
    • #10293 - [VirtualMachine] new method allowing to set one input tensor by its index or name
    • #10191 - Generate correct output tensor names in C Interface API
    • #9276 - Parameterize test_link_params
    • #9808 - [Rust] Update Rust bindings
    • #9553 - [PROFILING] Add ability to profile a single function_profiling
    • #9611 - [CMAKE] Automatically detect newly added source files
    • #9544 - [Target] enable -arch=sm_xx for assigning cuda target arch and deprecate autotvm.measure.set_cuda_target_arch api
    • Profiler - #11530, #11066
    • Docs - #10921, #11403, #10774, #10912, #9633, #9906, #9534, #9307, #9654, #9580
    • Android - #11241
    • ETHOSN - #11261, #10486, #10018, #9596
    • TVMC - #11012, #10962, #10722, #9817, #9529, #9229
    Source code(tar.gz)
    Source code(zip)
    apache-tvm-src-v0.9.0.tar.gz(18.82 MB)
    apache-tvm-src-v0.9.0.tar.gz.asc(659 bytes)
    apache-tvm-src-v0.9.0.tar.gz.sha512(163 bytes)
  • v0.8.0(Nov 24, 2021)

    Overview

    Apache TVM v0.8 brings several major exciting experimental features, including:

    • PaddlePaddle frontend
    • TVMScript: round-trippable python-based syntax for TIR
    • TorchScript integration
    • TensorIR scheduling language
    • TensorRT and CUTLASS integration via BYOC
    • Int4 TensorCore support in AutoTVM
    • MicroTVM Project API and Zephyr, Arduino support
    • AOT executor
    • Robust Windows support
    • Affine analysis infra: iter-affine-map
    • Improved Vulkan backend
    • CUDA graph support in TVM runtime

    Besides, The community has been working together to refactor and evolve the existing infrastructure, including but not limited to:

    • Relay compilation engine
    • Relay pattern language
    • CI and build process
    • Refactoring documentation and tutorials
    • Stablizing AutoScheduler
    • Stablizing TVMC command line driver interface
    • Stablizing target system
    • Frontend coverage, quantization, dynamic shape, training

    Full changelog: https://gist.github.com/junrushao1994/c669905dbc41edc2e691316df49d8562.

    Accepted RFCs

    The community has adopted a formal RFC process. Below is a list of the formal RFCs accepted by the community since then:

    • [RFC-0005] Meta schedule (AutoTIR)
    • [RFC-0006] Automatic mixed-precision pass and support
    • [RFC-0007] Parametrized unit tests
    • [RFC-0008] MicroTVM Project API
    • [RFC-0009] Unified static memory planner
    • [RFC-0010] Target-registered compiler flow customisation
    • [RFC-0011] Arm® Ethos-U integration
    • [RFC-0014] Pipeline executor
    • [RFC-0015] Use CMSIS-NN with TVM
    • [RFC-0019] Add PaddlePaddle frontend
    • [RFC-0020] Extend metadata in project option
    • [RFC-0022] TIR non-scalar constants
    • [RFC-0023] Adding annotation field to tir.allocate nodes
    • [RFC-0025] PyTorchTVM
    • [RFC-0027] Formalize TVM documentation organization
    • [RFC-0028] Command line composition from internal registry
    • [RFC-0029] Migrating target attributes to IRModule
    • [RFC-0030] Command line configuration files
    • [RFC-0031] C Device API
    • [RFC-0036] TVMScript namespace
    • [RFC-0041] Update TVMScript block syntax

    Features and Improvements

    TE, TIR, TVMScript

    AutoTVM, AutoScheduler, Meta Schedule

    Operator Coverage

    Training

    Relay

    MicroTVM, AOT, Graph Executor and VM

    Arithmetic Analysis

    • Tighter bounds and more simplification on cast #6771 #7045
    • Introducing iterator (quasi-) affine map detection #6667 #7752 #7759
    • Inverse of iterator affine map #8384 #8427
    • Subspace division in iterator affine map #7760

    Frontends

    Codegen Backends and Runtime

    BYOC Integration with Vendor Libraries: TensorRT, ACL, VitisAI

    TVMC

    Rust Binding

    Misc

    • Enhanced CPP-RPC implementation: allow user supplied work dir, support of CPP-RPC server for Apple, support adb-shell style CPP-RPC #7670 #8224 #8223 #7766 #7013
    • Use PopenWorker to handle RPC system: #7889 #7757 #7961
    • Fold target host into target #7462 #7791 #7534 #8835
    • Target-based intrinsic lowering and legalization #7936 #7809
    • Add target tags for all existing CUDA GPU models #7410
    • Linear Congruential Random Engine #8642
    Source code(tar.gz)
    Source code(zip)
    apache-tvm-src-v0.8.0.tar.gz(17.33 MB)
    apache-tvm-src-v0.8.0.tar.gz.asc(833 bytes)
    apache-tvm-src-v0.8.0.tar.gz.sha512(159 bytes)
Owner
The Apache Software Foundation
The Apache Software Foundation
☁ Puff ☁ - The deep stack framework.

☁ Puff ☁ Python with an async runtime built-in Rust for GraphQL, ASGI, WSGI, Postgres, PubSub, Redis, Distributed Tasks, and HTTP2 Client. What is Puf

Kyle Hanson 290 Jan 8, 2023
A deep learning library for rust

Alumina An experimental deep learning library written in pure rust. Breakage expected on each release in the short term. See mnist.rs in examples or R

zza 95 Nov 30, 2022
Awesome deep learning crate

NeuroFlow is fast neural networks (deep learning) Rust crate. It relies on three pillars: speed, reliability, and speed again. Hello, everyone! Work o

Mikhail Kravets 70 Nov 20, 2022
🦀 Example of serving deep learning models in Rust with batched prediction

rust-dl-webserver This project provides an example of serving a deep learning model with batched prediction using Rust. In particular it runs a GPT2 m

Evan Pete Walsh 28 Dec 15, 2022
Messing around with deep learning

Deep Learning Test Implementing deep learning in Rust using just a linear algebra library (nalgebra). The neural network (4 hidden layers, 32 neurons

Dmitry Zamkov 9 Jun 22, 2022
miniature: a toy deep learning library written in Rust

miniature: a toy deep learning library written in Rust A miniature is a toy deep learning library written in Rust. The miniature is: implemented for a

Takuma Seno 4 Nov 29, 2021
High performance distributed framework for training deep learning recommendation models based on PyTorch.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI platform@Kuaishou Technology, collaborating with ETH. It

null 340 Dec 30, 2022
Deep learning superresolution in pure rust

Rusty_SR A Rust super-resolution tool, which when given a low resolution image utilises deep learning to infer the corresponding high resolution image

zza 189 Dec 9, 2022
Deep learning at the speed of light.

luminal Deep learning at the speed of light. use luminal::prelude::*; // Setup graph and tensors let mut cx = Graph::new(); let a = cx.new_tensor::<R

Joe Fioti 3 Jul 25, 2023
Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

Damavand is a code that simulates quantum circuits. In order to learn more about damavand, refer to the documentation. Development status Core feature

prevision.io 6 Mar 29, 2022
Signed distance functions + Rust (CPU & GPU) = ❤️❤️

sdf-playground Signed distance functions + Rust (CPU & GPU) = ❤️❤️ Platforms: Windows, Mac & Linux. About sdf-playground is a demo showcasing how you

Patryk Wychowaniec 5 Nov 16, 2023
Rust based Cross-GPU Machine Learning

HAL : Hyper Adaptive Learning Rust based Cross-GPU Machine Learning. Why Rust? This project is for those that miss strongly typed compiled languages.

Jason Ramapuram 83 Dec 20, 2022
A repo for learning how to parallelize computations in the GPU using Apple's Metal, in Rust.

Metal playground in rust Made for learning how to parallelize computations in the GPU using Apple's Metal, in Rust, via the metal crate. Overview The

Lambdaclass 5 Feb 20, 2023
Deep recommender systems for Rust

sbr An implementation of sequence recommenders based on the wyrm autdifferentiaton library. sbr-rs sbr implements efficient recommender algorithms whi

Maciej Kula 112 Dec 24, 2022
Flexible, reusable reinforcement learning (Q learning) implementation in Rust

Rurel Rurel is a flexible, reusable reinforcement learning (Q learning) implementation in Rust. Release documentation In Cargo.toml: rurel = "0.2.0"

Milan Boers 60 Dec 29, 2022
Ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

Ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

Riccardo D'Ambrosio 2.1k Jan 5, 2023
Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

The Rust CUDA Project An ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust Guide | Getting Started | Fe

Rust GPU 2.1k Dec 30, 2022
A real-time implementation of "Ray Tracing in One Weekend" using nannou and rust-gpu.

Real-time Ray Tracing with nannou & rust-gpu An attempt at a real-time implementation of "Ray Tracing in One Weekend" by Peter Shirley. This was a per

null 89 Dec 23, 2022
🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧

?? rust-gpu Rust as a first-class language and ecosystem for GPU graphics & compute shaders Current Status ?? Note: This project is still heavily in d

Embark 5.5k Jan 9, 2023