Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.


The Rust CUDA Project

An ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust

⚠️ The project is still in early development, expect bugs, safety issues, and things that don't work ⚠️


The Rust CUDA Project is a project aimed at making Rust a tier-1 language for extremely fast GPU computing using the CUDA Toolkit. It provides tools for compiling Rust to extremely fast PTX code as well as libraries for using existing CUDA libraries with it.


Historically, general purpose high performance GPU computing has been done using the CUDA toolkit. The CUDA toolkit primarily provides a way to use Fortran/C/C++ code for GPU computing in tandem with CPU code with a single source. It also provides many libraries, tools, forums, and documentation to supplement the single-source CPU/GPU code.

CUDA is exclusively an NVIDIA-only toolkit. Many tools have been proposed for cross-platform GPU computing such as OpenCL, Vulkan Computing, and HIP. However, CUDA remains the most used toolkit for such tasks by far. This is why it is imperative to make Rust a viable option for use with the CUDA toolkit.

However, CUDA with Rust has been a historically very rocky road. The only viable option until now has been to use the LLVM PTX backend, however, the LLVM PTX backend does not always work and would generate invalid PTX for many common Rust operations, and in recent years it has been shown time and time again that a specialized solution is needed for Rust on the GPU with the advent of projects such as rust-gpu (for Rust -> SPIR-V).

Our hope is that with this project we can push the Rust GPU computing industry forward and make Rust an excellent language for such tasks. Rust offers plenty of benefits such as __restrict__ performance benefits for every kernel, An excellent module/crate system, delimiting of unsafe areas of CPU/GPU code with unsafe, high level wrappers to low level CUDA libraries, etc.


The scope of the Rust CUDA Project is quite broad, it spans the entirety of the CUDA ecosystem, with libraries and tools to make it usable using Rust. Therefore, the project contains many crates for all corners of the CUDA ecosystem.

The current line-up of libraries is the following:

  • rustc_codegen_nvvm Which is a rustc backend that targets NVVM IR (a subset of LLVM IR) for the libnvvm library.
    • Generates highly optimized PTX code which can be loaded by the CUDA Driver API to execute on the GPU.
    • For the near future it will be CUDA-only, but it may be used to target amdgpu in the future.
  • cuda_std for GPU-side functions and utilities, such as thread index queries, memory allocation, warp intrinsics, etc.
    • Not a low level library, provides many utility functions to make it easier to write cleaner and more reliable GPU kernels.
    • Closely tied to rustc_codegen_nvvm which exposes GPU features through it internally.
  • cust for CPU-side CUDA features such as launching GPU kernels, GPU memory allocation, device queries, etc.
    • High level with features such as RAII and Rust Results that make it easier and cleaner to manage the interface to the GPU.
    • A high level wrapper for the CUDA Driver API, the lower level version of the more common CUDA Runtime API used from C++.
    • Provides much more fine grained control over things like kernel concurrency and module loading than the C++ Runtime API.
  • gpu_rand for GPU-friendly random number generation, currently only implements xoroshiro RNGs from rand_xoshiro.
  • optix for CPU-side hardware raytracing and denoising using the CUDA OptiX library.

In addition to many "glue" crates for things such as high level wrappers for certain smaller CUDA libraries.

Related Projects

Other projects related to using Rust on the GPU:

  • 2016: glassful Subset of Rust that compiles to GLSL.
  • 2017: inspirv-rust Experimental Rust MIR -> SPIR-V Compiler.
  • 2018: nvptx Rust to PTX compiler using the nvptx target for rustc (using the LLVM PTX backend).
  • 2020: accel Higher level library that relied on the same mechanism that nvptx does.
  • 2020: rlsl Experimental Rust -> SPIR-V compiler (predecessor to rust-gpu)
  • 2020: rust-gpu Rustc codegen backend to compile Rust to SPIR-V for use in shaders, similar mechanism as our project.


Licensed under either of

at your discretion.


Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

  • 0.3(Feb 7, 2022)

    mpm_cuda_2d path_tracer

    Top: upcoming MPM engine that runs on CPU and GPU using rust-cuda, Bottom: toy path tracer that can run on CPU, GPU, and GPU (hardware raytracing) using recent experiments with OptiX

    Today marks an exciting milestone for the Rust CUDA Project, over the past couple of months, we have made significant advancements in supporting many of the fundamental CUDA ecosystem libraries. The main changes in this release are the changes to cust to make future library support possible, but we will also be highlighting some of the WIP experiments we have been conducting.

    Cust changes

    This release is likely to be the biggest and most breaking change to cust ever, we had to fundamentally rework how many things work to:

    • Fix some unsoundness.
    • Remove some outdated and inconsistent things.
    • Rework how contexts work to be interoperable with the runtime API.

    Therefore this release is guaranteed to break your code, however, the changes should not break too much unless you did a lot of lower-level work with device memory constructs.

    Cust 0.3 changes


    This release is gigantic, so here are the main things you need to worry about:

    Context::create_and_push(FLAGS, device) -> Context::new(device).
    Module::from_str(PTX) -> Module::from_ptx(PTX, &[]).

    Context handling overhaul

    The way that contexts are handled in cust has been completely overhauled, it now uses primary context handling instead of the normal driver API context APIs. This is aimed at future-proofing cust for libraries such as cuBLAS and cuFFT, as well as overall simplifying the context handling APIs. This does mean that the API changed a bit:

    • create_and_push is now new and it only takes a device, not a device and flags.
    • set_flags is now used for setting context flags.
    • ContextStack, UnownedContext, and other legacy APIs are gone.

    The old context handling is fully present in cust::context::legacy for anyone who needs it for specific reasons. If you use quick_init you don't need to worry about any breaking changes, the API is the same.


    DeviceCopy has now been split into its own crate, cust_core. The crate is #![no_std], which allows you to pull in cust_core in GPU crates for deriving DeviceCopy without cfg shenanigans.


    • DeviceBox::wrap, use DeviceBox::from_raw.
    • DeviceSlice::as_ptr and DeviceSlice::as_mut_ptr. Use DeviceSlice::as_device_ptr then DevicePointer::as_(mut)_ptr.
    • DeviceSlice::chunks and consequently DeviceChunks.
    • DeviceSlice::chunks_mut and consequently DeviceChunksMut.
    • DeviceSlice::from_slice and DeviceSlice::from_slice_mut because it was unsound.
    • DevicePointer::as_raw_mut (use DevicePointer::as_mut_ptr).
    • DevicePointer::wrap (use DevicePointer::from_raw).
    • DeviceSlice no longer implements Index and IndexMut, switching away from [T] made this impossible to implement. Instead you can now use DeviceSlice::index which behaves the same.
    • vek is no longer re-exported.


    • Module::from_str, use Module::from_ptx and pass &[] for options.
    • Module::load_from_string, use Module::from_ptx_cstr.


    • cust::memory::LockedBox, same as LockedBuffer except for single elements.
    • cust::memory::cuda_malloc_async.
    • cust::memory::cuda_free_async.
    • impl AsyncCopyDestination<LockedBox<T>> for DeviceBox<T> for async HtoD/DtoH memcpy.
    • DeviceBox::new_async.
    • DeviceBox::drop_async.
    • DeviceBox::zeroed_async.
    • DeviceBox::uninitialized_async.
    • DeviceBuffer::uninitialized_async.
    • DeviceBuffer::drop_async.
    • DeviceBuffer::zeroed.
    • DeviceBuffer::zeroed_async.
    • DeviceBuffer::cast.
    • DeviceBuffer::try_cast.
    • DeviceSlice::set_8 and DeviceSlice::set_8_async.
    • DeviceSlice::set_16 and DeviceSlice::set_16_async.
    • DeviceSlice::set_32 and DeviceSlice::set_32_async.
    • DeviceSlice::set_zero and DeviceSlice::set_zero_async.
    • the bytemuck feature which is enabled by default.
    • mint integration behind impl_mint.
    • half integration behind impl_half.
    • glam integration behind impl_glam.
    • experimental linux external memory import APIs through cust::external::ExternalMemory.
    • DeviceBuffer::as_slice.
    • DeviceVariable, a simple wrapper around DeviceBox<T> and T which allows easy management of a CPU and GPU version of a type.
    • DeviceMemory, a trait describing any region of GPU memory that can be described with a pointer + a length.
    • memcpy_htod, a wrapper around cuMemcpyHtoD_v2.
    • mem_get_info to query the amount of free and total memory.
    • DevicePointer::as_ptr and DevicePointer::as_mut_ptr for *const T and *mut T.
    • DevicePointer::from_raw for CUdeviceptr -> DevicePointer<T> with a safe function.
    • DevicePointer::cast.
    • dependency on cust_core for DeviceCopy.
    • ModuleJitOption, JitFallback, JitTarget, and OptLevel for specifying options when loading a module. Note that ModuleJitOption::MaxRegisters does not seem to work currently, but NVIDIA is looking into it. You can achieve the same goal by compiling the ptx to cubin using nvcc then loading that: nvcc --cubin foo.ptx -maxrregcount=REGS
    • Module::from_fatbin.
    • Module::from_cubin.
    • Module::from_ptx and Module::from_ptx_cstr.
    • Stream, Module, Linker, Function, Event, UnifiedBox, ArrayObject, LockedBuffer, LockedBox, DeviceSlice, DeviceBuffer, and DeviceBox all now impl Send and Sync, this makes it much easier to write multigpu code. The CUDA API is fully thread-safe except for graph objects.


    • zeroed functions on DeviceBox and others are no longer unsafe and instead now require T: Zeroable. The functions are only available with the bytemuck feature.
    • Stream::add_callback now internally uses cuLaunchHostFunc anticipating the deprecation and removal of cuStreamAddCallback per the driver docs. This does however mean that the function no longer takes a device status as a parameter and does not execute on context error.
    • Linker::complete now only returns the built cubin, and not the cubin and a duration.
    • Features such as vek for implementing DeviceCopy are now impl_cratename, e.g. impl_vek, impl_half, etc.
    • DevicePointer::as_raw now returns a CUdeviceptr instead of a *const T.
    • num-complex integration is now behind impl_num_complex, not num-complex.
    • DeviceBox now requires T: DeviceCopy (previously it didn't but almost all its methods did).
    • DeviceBox::from_raw now takes a CUdeviceptr instead of a *mut T.
    • DeviceBox::as_device_ptr now requires &self instead of &mut self.
    • DeviceBuffer now requires T: DeviceCopy.
    • DeviceBuffer is now repr(C) and is represented by a DevicePointer<T> and a usize.
    • DeviceSlice now requires T: DeviceCopy.
    • DeviceSlice is now represented as a DevicePointer<T> and a usize (and is repr(C)) instead of [T] which was definitely unsound.
    • DeviceSlice::as_ptr and DeviceSlice::as_ptr_mut now both return a DevicePointer<T>.
    • DeviceSlice is now Clone and Copy.
    • DevicePointer::as_raw now returns a CUdeviceptr, not a *const T (use DevicePointer::as_ptr).
    • Fixed typo in CudaError, InvalidSouce is now InvalidSource, no more invalid sauce 🍅🥣

    Line tables

    The libnvvm codegen can now generate line tables while optimizing (previously it could generate debug info but not optimize), which allows you to debug and profile kernels much better in tools like Nsight Compute. You can enable debug info creation using .debug(DebugInfo::LineTables) with cuda_builder.



    Using the generous work of @anderslanglands, we were able to get rust-cuda to target hardware raytracing completely in rust (both for the host and the device). The toy path tracer example has been ported to be able to use hardware rt as a backend, however, optix and optix_device are not published on yet since they are still highly experimental.

    Screenshot_564 using hardware rt to render a simple mesh


    Work on supporting cuBLAS through a high-level wrapper library has started, a lot of work needed to be done in cust to interop with cuBLAS which is a runtime API based library. This required some changes with how cust handles contexts to avoid dropping context resources cuBLAS was using. The library is not yet published but eventually will be once it is more complete. cuBLAS is a big piece of neural network training on the GPU so it is critical to support it.


    @frjnn has been generously working on wrapping the cuDNN library. cuDNN is the primary tool used to train neural networks on the GPU, and the primary tool used by pytorch and tensorflow. High level bindings to cuDNN are a major step to making Machine Learning in Rust a viable option. This work is still very in-progress so it is not published yet, it will be published once it is usable and will likely first be used in neuronika for GPU neural network training.


    Work on supporting GPU-side atomics in cuda_std has started, some preliminary work is already published in cuda_std, however, it is still very in-progress and subject to change. Atomics are a difficult issue due to the vast amount of options available for GPU atomics, including:

    • Different atomic scopes, device, system, or block.
    • Specialized instructions or emulated depending on the compute capability target.
    • Hardware float atomics (which core does not have)

    You can read more about it here.

    Source code(tar.gz)
    Source code(zip)
  • 0.2(Dec 5, 2021)

    This release marks the start of fixing many of the fundamental issues in the codegen, as well as implementing some of the most needed features for writing performant kernel.

    This release mostly covers quality of life changes, bug fixes, and some performance improvements.


    Required nightly has been updated to 12/4/21, This fixes rust-analyzer not working sometimes.

    PTX Backend

    DCE (Dead Code Elimination)

    DCE has been implemented, we switched to an alternative way of linking together dependencies which now drastically reduces the amount of work libnvvm has to do, as well as removes any globals or functions not directly or indirectly used by kernels. This reduced the PTX size of the path tracer example from about 20kloc to 2.3 kloc.

    Address Spaces

    CUDA Address Spaces have been mostly implemented, any user-defined static that does not rely on interior mutability will be placed in the constant address space (__constant__), otherwise it will be placed in the generic address space (which is global for globals). This also allowed us to implement basic static shared memory support.

    Libm override

    The codegen automatically overrides calls to libm with calls to libdevice. This is to allow existing no_std crates to take advantage of architecture-optimized math intrinsics. This can be disabled from cuda_builder if you need strict determinism. This also reduces PTX size a good amount in math-heavy kernels (3.8kloc to 2.3kloc in our path tracer). It also reduces register usage by a little bit, which can yield performance gains.


    • Added address space query and conversion functions in cuda_std::ptr.
    • Added #[externally_visible] for making sure the codegen does not eliminate a function if not used by a kernel
    • Added #[address_space(...)] for making the codegen put a static in a specific address space, mostly internal and unsafe.
    • Added basic static shared memory support with cuda_std::shared_array!


    Cust 0.2 was actually released some time ago but these were the changes in 0.2 and 0.2.1:

    • Added Device::as_raw.
    • Added MemoryAdvise for unified memory advising.
    • Added MemoryAdvise::prefetch_host and MemoryAdvise::prefetch_device for telling CUDA to explicitly fetch unified memory somewhere.
    • Added MemoryAdvise::advise_read_mostly.
    • Added MemoryAdvise::preferred_location and MemoryAdvise::unset_preferred_location. Note that advising APIs are only present on high end GPUs such as V100s.
    • StreamFlags::NON_BLOCKING has been temporarily disabled because of soundness concerns.
    • Change GpuBox::as_device_ptr and GpuBuffer::as_device_ptr to take &self instead of &mut self.
    • Rename DBuffer -> DeviceBuffer. This is how it was in rustacuda, but it was changed at some point, but now we reconsidered that it may be the wrong choice.
    • Renamed DBox -> DeviceBox.
    • Renamed DSlice -> DeviceSlice.
    • Remove GpuBox::as_device_ptr_mut and GpuBuffer::as_device_ptr_mut.
    • Remove accidentally added vek default feature.
    • vek feature now uses default-features = false, this also means Rgb and Rgba no longer implement DeviceCopy.
    Source code(tar.gz)
    Source code(zip)
