Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.


The Rust CUDA Project

An ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust

Guide | Getting Started | Features

⚠️ The project is still in early development, expect bugs, safety issues, and things that don't work ⚠️


The Rust CUDA Project is a project aimed at making Rust a tier-1 language for extremely fast GPU computing using the CUDA Toolkit. It provides tools for compiling Rust to extremely fast PTX code as well as libraries for using existing CUDA libraries with it.


Historically, general purpose high performance GPU computing has been done using the CUDA toolkit. The CUDA toolkit primarily provides a way to use Fortran/C/C++ code for GPU computing in tandem with CPU code with a single source. It also provides many libraries, tools, forums, and documentation to supplement the single-source CPU/GPU code.

CUDA is exclusively an NVIDIA-only toolkit. Many tools have been proposed for cross-platform GPU computing such as OpenCL, Vulkan Computing, and HIP. However, CUDA remains the most used toolkit for such tasks by far. This is why it is imperative to make Rust a viable option for use with the CUDA toolkit.

However, CUDA with Rust has been a historically very rocky road. The only viable option until now has been to use the LLVM PTX backend, however, the LLVM PTX backend does not always work and would generate invalid PTX for many common Rust operations, and in recent years it has been shown time and time again that a specialized solution is needed for Rust on the GPU with the advent of projects such as rust-gpu (for Rust -> SPIR-V).

Our hope is that with this project we can push the Rust GPU computing industry forward and make Rust an excellent language for such tasks. Rust offers plenty of benefits such as __restrict__ performance benefits for every kernel, An excellent module/crate system, delimiting of unsafe areas of CPU/GPU code with unsafe, high level wrappers to low level CUDA libraries, etc.


The scope of the Rust CUDA Project is quite broad, it spans the entirety of the CUDA ecosystem, with libraries and tools to make it usable using Rust. Therefore, the project contains many crates for all corners of the CUDA ecosystem.

The current line-up of libraries is the following:

  • rustc_codegen_nvvm Which is a rustc backend that targets NVVM IR (a subset of LLVM IR) for the libnvvm library.
    • Generates highly optimized PTX code which can be loaded by the CUDA Driver API to execute on the GPU.
    • For the near future it will be CUDA-only, but it may be used to target amdgpu in the future.
  • cuda_std for GPU-side functions and utilities, such as thread index queries, memory allocation, warp intrinsics, etc.
    • Not a low level library, provides many utility functions to make it easier to write cleaner and more reliable GPU kernels.
    • Closely tied to rustc_codegen_nvvm which exposes GPU features through it internally.
  • cust for CPU-side CUDA features such as launching GPU kernels, GPU memory allocation, device queries, etc.
    • High level with features such as RAII and Rust Results that make it easier and cleaner to manage the interface to the GPU.
    • A high level wrapper for the CUDA Driver API, the lower level version of the more common CUDA Runtime API used from C++.
    • Provides much more fine grained control over things like kernel concurrency and module loading than the C++ Runtime API.
  • gpu_rand for GPU-friendly random number generation, currently only implements xoroshiro RNGs from rand_xoshiro.
  • optix for CPU-side hardware raytracing and denoising using the CUDA OptiX library.

In addition to many "glue" crates for things such as high level wrappers for certain smaller CUDA libraries.

Related Projects

Other projects related to using Rust on the GPU:

  • 2016: glassful Subset of Rust that compiles to GLSL.
  • 2017: inspirv-rust Experimental Rust MIR -> SPIR-V Compiler.
  • 2018: nvptx Rust to PTX compiler using the nvptx target for rustc (using the LLVM PTX backend).
  • 2020: accel Higher level library that relied on the same mechanism that nvptx does.
  • 2020: rlsl Experimental Rust -> SPIR-V compiler (predecessor to rust-gpu)
  • 2020: rust-gpu Rustc codegen backend to compile Rust to SPIR-V for use in shaders, similar mechanism as our project.


Licensed under either of

at your discretion.


Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

  • First release tracking issue

    First release tracking issue

    Small issue to track the things that need to be done before the project will be released.

    • [x] Get CGUs to work, needs work on the codegen to merge llvm ir modules, probably not too hard.
    • [ ] ~~Add a couple more basic examples.~~
    • [ ] ~~Update nightly close to release.~~
    • [ ] ~~Merge #1~~
    • [x] Make i128 work, shouldn't be too hard, just removing special casing for it since we already handle irregular ints just fine.
    • [ ] ~~[meta] Add proper labels and maybe templates~~
    • [x] Add the raytracer i posted to the examples
    • [x] Actually pass codegen args in cuda_builder
    • [ ] ~~Make the readme~~
    • [x] implement the ABI calculation logic, probably the hardest thing here
    • [x] Use prebuilt LLVM.
    opened by RDambrosio016 0
  • Fix

    Fix "Getting Started" in README


    opened by amadeusine 0
  • Build guide as mdbook

    Build guide as mdbook

    Note: set the Pages source to the gh-pages branch with / as the directory.

    opened by amadeusine 0
  • Atomics design doc and discussion

    Atomics design doc and discussion

    This issue serves as a design document and a discussion on how atomics will/should be implemented.

    CUDA Background

    CUDA has had atomics for basically forever in the form of a few functions like atomicAdd, atomicCAS, etc. See the docs on it here. It also has _system and _block variants of them.

    This has always been the overwhelmingly popular way of doing atomic things in CUDA, and for a while it was the only way, until compute 7.x. sm_70 introduced the .sem qualifier on the atom PTX instruction. This allowed users to specify a specific ordering for atomic operations.

    CUDA decided to implement this by replicating std::atomic as its own thing called cuda::std::atomic. Atomic provides a generic container for atomic operations on types such as int. It offers atomic operations with user-specified orderings.

    Usage of cuda::std::atomic

    Despite NVIDIA pushing for users to use atomic, it has not seen wide adoption, presumably because of the following reasons:

    • cuda::std::atomic is a mess of templates and inheritance because CUDA wanted to make it compatible with the GPU, the CPU (with every compiler's weird atomic semantics), and user-defined functions. This yields weird errors and confusing dependency graphs.
    • Every CUDA example, sample, docs, tutorials, course, etc uses atomicAdd and similar. Unless you are deeply knowledgeable about CUDA you would not switch to atomic, if you even knew it existed.
    • atomic has had a rocky past in terms of it sometimes working or not working, for example, CUDA 10.2 had many issues with std::atomic
    • atomic for some reason does not support float add, i am totally unsure why, the PTX ISA has instructions for it

    Importance of great atomics

    Atomics are the core of many algorithms, therefore it is imperative for a project of this scale to implement them once and implement them well. Otherwise a poor implementation of them might mean users being stuck with such an implementation forever, as with CUDA's case. Therefore, i believe we should take our time with atomics and implement them once and do it well.

    Low level implementation

    The low level implementation of such atomics is not very difficult, it can mostly be taken from how cuda::std::atomic does it at the low level. It implements them in the following way:

    If the CUDA Arch is >= 7.0 then it uses specialized PTX instructions with asm:

    template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acq_rel_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acq_rel.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
    template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_acquire_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.acquire.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
    template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_relaxed_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.relaxed.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
    template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_release_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.release.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }
    template<class _CUDA_A, class _CUDA_B, class _CUDA_C> static inline __device__ void __cuda_fetch_add_volatile_32_device(_CUDA_A __ptr, _CUDA_B& __dst, _CUDA_C __op) { asm volatile("atom.add.gpu.u32 %0,[%1],%2;" : "=r"(__dst) : "l"(__ptr),"r"(__op) : "memory"); }

    With seqcst additionally containing a fence before it:

            switch (__memorder) {
              case __ATOMIC_SEQ_CST: __cuda_fence_sc_device();
              case __ATOMIC_CONSUME:
              case __ATOMIC_ACQUIRE: __cuda_fetch_add_acquire_32_device(__ptr, __tmp, __tmp); break;
              case __ATOMIC_ACQ_REL: __cuda_fetch_add_acq_rel_32_device(__ptr, __tmp, __tmp); break;
              case __ATOMIC_RELEASE: __cuda_fetch_add_release_32_device(__ptr, __tmp, __tmp); break;
              case __ATOMIC_RELAXED: __cuda_fetch_add_relaxed_32_device(__ptr, __tmp, __tmp); break;
              default: assert(0);

    This can very easily be replicated by us since we have full support for inline asm.

    Otherwise, if the arch is less than 7.0, it "emulates" it with barriers:

            switch (__memorder) {
              case __ATOMIC_SEQ_CST:
              case __ATOMIC_ACQ_REL: __cuda_membar_device();
              case __ATOMIC_CONSUME:
              case __ATOMIC_ACQUIRE: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); __cuda_membar_device(); break;
              case __ATOMIC_RELEASE: __cuda_membar_device(); __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
              case __ATOMIC_RELAXED: __cuda_fetch_add_volatile_32_device(__ptr, __tmp, __tmp); break;
              default: assert(0);

    You can find the code for this in CUDA_ROOT\include\cuda\std\detail\libcxx\include\support\atomic\atomic_cuda_generated.h for CUDA 11.5, and CUDA_ROOT\include\cuda\std\detail\__atomic_generated for older versions.

    That file provides functions as intrinsics that the rest of libcu++ build off of:

    template<class _Type, typename cuda::std::enable_if<sizeof(_Type)==4, int>::type = 0>
    __device__ _Type __atomic_fetch_add_cuda(volatile _Type *__ptr, _Type __val, int __memorder, __thread_scope_device_tag) {
      /* ... */

    Rust Intrinsic implementation

    I propose we follow a similar approach of raw unsafe intrinsics for:

    • 32 bit and 64 bit operations, loads, stores, compare_exchange, exchange, and fetch_{add, and, max, min, or, sub, xor}
    • block, device, and system operations

    sm_70+ intrinsics are implemented in cuda_std::atomic::intrinsics::sm_70, emulated intrinsics are in cuda_std;:atomic::intrinsics::emulated.

    Wrappers of the sm-specific intrinsics are in cuda_std::atomic::intrinsics. For example:

    pub unsafe fn atomic_fetch_add_f32_device(ptr: *const f32, ordering: Ordering, val: f32) -> f32;

    High level types

    And finally, we expose high level types in cuda_std::atomic such as AtomicF32, AtomicF64, etc.

    Block atomics (BlockAtomicF32) will need to be unsafe, this is because for device atomics, it is up to the caller of the kernels to ensure buffers and kernels do not contain data races, and systems prevent this. However, block atomics do not, it would be very easy to accidentally cause data races if the accesses are not intra-threadblock.

    Atomic types will expose operations that they specifically allow, for example, per the ISA spec:

    • Every type has fetch_and, fetch_or, fetch_xor, compare_and_swap, and exchange.
    • Signed and unsigned Integers have fetch_add, fetch_inc, fetch_dec, fetch_min, and fetch_max.
    • Unsigned integers have fetch_inc and fetch_add that clamp to [0..b] (unsure if this means 0..MAX or something else).
    • Floats have fetch_add

    Compatibility with core atomics

    Core exposes atomics with a couple of things:

    • Every target has a cfg on whether an atomic width is supported on the target. I have not checked what atomic sizes core thinks nvptx has.
    • The codegen then generates atomic instructions as such:
        fn atomic_load(
            &mut self,
            ty: &'ll Type,
            ptr: &'ll Value,
            _order: AtomicOrdering,
            _size: Size,
        ) -> &'ll Value {
            /* ... */

    In addition to atomic_store, atomic_rmw, atomic_cmpxchg, and a couple more. We currently trap in all of these functions, partly because libnvvm doesn't support atomic instructions for many types, and partly because we want to bikeshed how to implement them nicely.

    However, as expected, things are not quite the same on the CPU and the GPU, there are some very important differences:

    • CUDA has 32 bit and 64 bit atomics (16 bit too if you count f16), while core expects 8 bit atomics too (that is, unless its cfg-gated).
    • Core expects some operations to be available while they are not available, such as fetch_nand, we could implement this as a CAS loop but its a bit of an opaque behavior so im not too happy to do that.
    • CUDA has (limited) float atomics, which are the most used types of atomics by far, since GPU code often deals with floats. Core does not have atomic floats, so we would need a separate type, causing lots of confusion.
    • CUDA as previously mentioned has block and system atomics, which are unique to it.

    Because of these limitations, we have a few options for implementing atomics:

    • Try our best to support core atomics, emulate anything thats missing with CAS loops, make AtomicF32 and AtomicF64 different types in cuda_std. Add block and system atomics as their own types in cuda_std::atomic. This maintains compat with core but splits up atomic types, which is not ideal.
    • Don't try to support core intrinsics at all, add everything in cuda_std::atomic, add only the methods that cuda natively supports without CAS loops. Don't try to make the atomics work on the CPU. This is easiest, has the nicest API, but doesn't work on the CPU.

    Implementation Roadmap

    Atomics will likely be implemented incrementally, most of the work is transferring over the raw intrinsics, after that, the hard part is done and we can just focus on the stable public API.

    Device float atomics will be first, since it is by far the most used type of intrinsic. After that, the order will probably follow:

    Integer Device Atomics -> Float System Atomics -> Integer System Atomics -> Float Block Atomics -> Integer Block Atomics -> Anything that's missing


    I'd love to hear any feedback you have! We must make sure this is implemented once and implemented correctly.

    A-atomics C-discussion 
    opened by RDambrosio016 0
  • Building repo, getting

    Building repo, getting "undefined symbol: setupterm"

    Bear with me here as I'm on NixOS so installing the dependencies has been a journey. I've cloned this repo and am just trying to run cargo build. I've gotten as far as installing CUDA and OptiX, to the point where it's actually building the path_tracer crate, but now I'm getting some scary codegen errors from rustc:

    error: failed to run custom build command for `path_tracer v0.1.0 ($REPO_ROOT/examples/cuda/cpu/path_tracer)`
    Caused by:
      process didn't exit successfully: `$REPO_ROOT/target/debug/build/path_tracer-970d6a9b9c38170f/build-script-build` (exit status: 101)
      --- stdout
      --- stderr
      warning: $REPO_ROOT/crates/cust/Cargo.toml: `default-features = [".."]` was found in [features]. Did you mean to use `default = [".."]`?
      error: failed to run `rustc` to learn about target-specific information
      Caused by:
        process didn't exit successfully: `rustc - --crate-name ___ --print=file-names -Zcodegen-backend=$REPO_ROOT/target/debug/deps/ -Cllvm-args=-arch=compute_61 --target nvptx64-nvidia-cuda --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=cfg` (exit status: 1)
        --- stderr
        error: couldn't load codegen backend "$REPO_ROOT/target/debug/deps/": "$REPO_ROOT/target/debug/deps/ undefined symbol: setupterm"
      thread 'main' panicked at 'Did not find output file in rustc output', crates/cuda_builder/src/
      stack backtrace:
         0: rust_begin_unwind
                   at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/std/src/
         1: core::panicking::panic_fmt
                   at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/
         2: core::panicking::panic_display
                   at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/
         3: core::option::expect_failed
                   at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/
         4: core::option::Option<T>::expect
                   at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/
         5: cuda_builder::get_last_artifact
                   at $REPO_ROOT/crates/cuda_builder/src/
         6: cuda_builder::invoke_rustc
                   at $REPO_ROOT/crates/cuda_builder/src/
         7: cuda_builder::CudaBuilder::build
                   at $REPO_ROOT/crates/cuda_builder/src/
         8: build_script_build::main
                   at ./
         9: core::ops::function::FnOnce::call_once
                   at /rustc/4e89811b46323f432544f9c4006e40d5e5d7663f/library/core/src/ops/
      note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
    warning: build failed, waiting for other jobs to finish...
    error: build failed

    Any tips on what do here?


    • CUDA 11.4.2
    • OptiX 7.3.0
    • LLVM 7.1.0
    • NVidia driver 470.63.01
    • rustc 1.57.0-nightly (4e89811b4 2021-10-16)
    bug C-rustc_codegen_nvvm 
    opened by dbeckwith 15
  • GPU kernel debugging documentation

    GPU kernel debugging documentation

    Super excited to try out this project, I've just been reading through the Guide so far. I couldn't find any pages on debugging GPU kernels at runtime (there's a page on debugging the codegen but not the live kernel itself). I think it would be great if debugging was mentioned somewhere, at least to say if it isn't well-supported yet or maybe link to some external resources on CUDA debugging.

    A-debuginfo A-docs 
    opened by dbeckwith 1
  • OptiX (CPU) tracking issue

    OptiX (CPU) tracking issue

    Issue to track what needs to be done to complete the CPU (host) side of OptiX:

    (me) Denoising:

    • [ ] optixDenoiserComputeIntensity
    • [ ] optixDenoiserComputeAverageColor
    • [ ] setup_state_with_buffers
    • [ ] Some way of running the denoiser with the same input and output image.
    • [ ] AOV models
    • [ ] Tiled denoising

    (@anderslanglands) Actual Raytracing:

    • [ ] port remaining iw examples
    • [ ] port SDK examples necessary to test remaining features (e.g. motion blur, curves)
    • [ ] implement instance transforms, motion blur
    C-OptiX (Host) T-tracking-issue 
    opened by RDambrosio016 0
  • First pass

    First pass

    Just enough of optix to get example 02 working - setting up a pipeline and a ShaderBindingTable, launching an optix kernel.

    opened by anderslanglands 2
Rust GPU
Fast GPU computing for the Rust language
Rust GPU
A real-time implementation of "Ray Tracing in One Weekend" using nannou and rust-gpu.

Real-time Ray Tracing with nannou & rust-gpu An attempt at a real-time implementation of "Ray Tracing in One Weekend" by Peter Shirley. This was a per

null 76 Nov 9, 2021
Open deep learning compiler stack for cpu, gpu and specialized accelerators

Open Deep Learning Compiler Stack Documentation | Contributors | Community | Release Notes Apache TVM is a compiler stack for deep learning systems. I

The Apache Software Foundation 7.4k Nov 29, 2021
Rust based Cross-GPU Machine Learning

HAL : Hyper Adaptive Learning Rust based Cross-GPU Machine Learning. Why Rust? This project is for those that miss strongly typed compiled languages.

Jason Ramapuram 79 Oct 28, 2021
Open Machine Intelligence Framework for Hackers. (GPU/CPU)

Leaf • Introduction Leaf is a open Machine Learning Framework for hackers to build classical, deep or hybrid machine learning applications. It was ins

Autumn 5.5k Nov 29, 2021
MesaTEE GBDT-RS : a fast and secure GBDT library, supporting TEEs such as Intel SGX and ARM TrustZone

MesaTEE GBDT-RS : a fast and secure GBDT library, supporting TEEs such as Intel SGX and ARM TrustZone MesaTEE GBDT-RS is a gradient boost decision tre

MesaLock Linux 172 Nov 17, 2021
A fast, safe and easy to use reinforcement learning framework in Rust.

RSRL (api) Reinforcement learning should be fast, safe and easy to use. Overview rsrl provides generic constructs for reinforcement learning (RL) expe

Thomas Spooner 112 Oct 31, 2021
K-dimensional tree in Rust for fast geospatial indexing and lookup

kdtree K-dimensional tree in Rust for fast geospatial indexing and nearest neighbors lookup Crate Documentation Usage Benchmark License Usage Add kdtr

Rui Hu 115 Nov 17, 2021
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 5k Nov 27, 2021
Fast hierarchical agglomerative clustering in Rust.

kodama This crate provides a fast implementation of agglomerative hierarchical clustering. This library is released under the MIT license. The ideas a

Diffeo 54 Nov 26, 2021
Fwumious Wabbit, fast on-line machine learning toolkit written in Rust

Fwumious Wabbit is a very fast machine learning tool built with Rust inspired by and partially compatible with Vowpal Wabbit (much love! read more abo

Outbrain 104 Nov 14, 2021
Rust wrapper for the Fast Artificial Neural Network library

fann-rs Rust wrapper for the Fast Artificial Neural Network (FANN) library. This crate provides a safe interface to FANN on top of the low-level bindi

Andreas Fackler 11 Sep 10, 2020
Instance Distance is a fast pure-Rust implementation of the Hierarchical Navigable Small Worlds paper

Fast approximate nearest neighbor searching in Rust, based on HNSW index

Instant Domain Search, Inc. 115 Nov 25, 2021
l2 is a fast, Pytorch-style Tensor+Autograd library written in Rust

l2 • ?? A Pytorch-style Tensor+Autograd library written in Rust Installation • Contributing • Authors • License • Acknowledgements Made by Bilal Khan

Bilal Khan 139 Nov 14, 2021
FFSVM stands for "Really Fast Support Vector Machine"

In One Sentence You trained a SVM using libSVM, now you want the highest possible performance during (real-time) classification, like games or VR. Hig

Ralf Biedert 47 Nov 21, 2021
convolutions-rs is a crate that provides a fast, well-tested convolutions library for machine learning

convolutions-rs convolutions-rs is a crate that provides a fast, well-tested convolutions library for machine learning written entirely in Rust with m

null 1 Nov 19, 2021
Msgpack serialization/deserialization library for Python, written in Rust using PyO3, and rust-msgpack. Reboot of orjson.[Python]

ormsgpack ormsgpack is a fast msgpack library for Python. It is a fork/reboot of orjson It serializes faster than msgpack-python and deserializes a bi

Aviram Hassan 47 Nov 3, 2021
Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

This repo contains a prototype of Narwhal and Tusk. It supplements the paper Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus.

Facebook Research 30 Nov 8, 2021
Distributed compute platform implemented in Rust, and powered by Apache Arrow.

Ballista: Distributed Compute Platform Overview Ballista is a distributed compute platform primarily implemented in Rust, powered by Apache Arrow. It

Ballista 2.3k Nov 20, 2021
Tensors and differentiable operations (like TensorFlow) in Rust

autograd Differentiable operations and tensors backed by ndarray. Motivation Machine learning is one of the field where Rust lagging behind other lang

Ryo ASAKURA 340 Nov 22, 2021