Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

Rust GPU

Last update: Dec 30, 2022

Related tags

Machine learning Rust-CUDA

Overview

The Rust CUDA Project

An ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust

Guide | Getting Started | Features

⚠️ The project is still in early development, expect bugs, safety issues, and things that don't work ⚠️

Goal

The Rust CUDA Project is a project aimed at making Rust a tier-1 language for extremely fast GPU computing using the CUDA Toolkit. It provides tools for compiling Rust to extremely fast PTX code as well as libraries for using existing CUDA libraries with it.

Background

Historically, general purpose high performance GPU computing has been done using the CUDA toolkit. The CUDA toolkit primarily provides a way to use Fortran/C/C++ code for GPU computing in tandem with CPU code with a single source. It also provides many libraries, tools, forums, and documentation to supplement the single-source CPU/GPU code.

CUDA is exclusively an NVIDIA-only toolkit. Many tools have been proposed for cross-platform GPU computing such as OpenCL, Vulkan Computing, and HIP. However, CUDA remains the most used toolkit for such tasks by far. This is why it is imperative to make Rust a viable option for use with the CUDA toolkit.

However, CUDA with Rust has been a historically very rocky road. The only viable option until now has been to use the LLVM PTX backend, however, the LLVM PTX backend does not always work and would generate invalid PTX for many common Rust operations, and in recent years it has been shown time and time again that a specialized solution is needed for Rust on the GPU with the advent of projects such as rust-gpu (for Rust -> SPIR-V).

Our hope is that with this project we can push the Rust GPU computing industry forward and make Rust an excellent language for such tasks. Rust offers plenty of benefits such as __restrict__ performance benefits for every kernel, An excellent module/crate system, delimiting of unsafe areas of CPU/GPU code with unsafe, high level wrappers to low level CUDA libraries, etc.

Structure

The scope of the Rust CUDA Project is quite broad, it spans the entirety of the CUDA ecosystem, with libraries and tools to make it usable using Rust. Therefore, the project contains many crates for all corners of the CUDA ecosystem.

The current line-up of libraries is the following:

rustc_codegen_nvvm Which is a rustc backend that targets NVVM IR (a subset of LLVM IR) for the libnvvm library.
- Generates highly optimized PTX code which can be loaded by the CUDA Driver API to execute on the GPU.
- For the near future it will be CUDA-only, but it may be used to target amdgpu in the future.
cuda_std for GPU-side functions and utilities, such as thread index queries, memory allocation, warp intrinsics, etc.
- Not a low level library, provides many utility functions to make it easier to write cleaner and more reliable GPU kernels.
- Closely tied to rustc_codegen_nvvm which exposes GPU features through it internally.
cust for CPU-side CUDA features such as launching GPU kernels, GPU memory allocation, device queries, etc.
- High level with features such as RAII and Rust Results that make it easier and cleaner to manage the interface to the GPU.
- A high level wrapper for the CUDA Driver API, the lower level version of the more common CUDA Runtime API used from C++.
- Provides much more fine grained control over things like kernel concurrency and module loading than the C++ Runtime API.
gpu_rand for GPU-friendly random number generation, currently only implements xoroshiro RNGs from rand_xoshiro.
optix for CPU-side hardware raytracing and denoising using the CUDA OptiX library.

In addition to many "glue" crates for things such as high level wrappers for certain smaller CUDA libraries.

Related Projects

Other projects related to using Rust on the GPU:

2016: glassful Subset of Rust that compiles to GLSL.
2017: inspirv-rust Experimental Rust MIR -> SPIR-V Compiler.
2018: nvptx Rust to PTX compiler using the nvptx target for rustc (using the LLVM PTX backend).
2020: accel Higher level library that relied on the same mechanism that nvptx does.
2020: rlsl Experimental Rust -> SPIR-V compiler (predecessor to rust-gpu)
2020: rust-gpu Rustc codegen backend to compile Rust to SPIR-V for use in shaders, similar mechanism as our project.

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your discretion.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments

Add example fails to build

Hello, I was trying to build the example and after troubleshooting a few issues as I worked through the getting started, I've hit a wall I think without hints at least with my knowledge.

Before I show the output of the build failure, here is some background:

System is Ubuntu 20.04

>>> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I took the prebuilt llvm from https://github.com/rust-gpu/rustc_codegen_nvvm-llvm/releases/download/LLVM-7.1.0/ and set LLVM_CONFIG to it (I was hitting this before but no longer)

I needed to set CUDA_ROOT and CUDA_PATH to /usr/local/cuda-11.5/bin

I also has libnvvm.so under /usr/local/cuda-11.5/nvvm/lib64 which is now in my PATH

My toolchain is defaulted to nightly-2021-12-04-x86_64-unknown-linux-gnu to be sure, but I also put the toolchain file in the root of Rust-CUDA in the add example root

I don't see anything from the getting started that I seem to have missed, so, finally, the output of the build now is

   Compiling rustc_codegen_nvvm v0.2.2 (/home/nick/Projects/Rust-CUDA/crates/rustc_codegen_nvvm)
The following warnings were emitted during compilation:

warning: c++: error: unrecognized command line option ‘-Wcovered-switch-default’; did you mean ‘-Wno-switch-default’?
warning: c++: error: unrecognized command line option ‘-Wcovered-switch-default’; did you mean ‘-Wno-switch-default’?
warning: c++: error: unrecognized command line option ‘-Wstring-conversion’; did you mean ‘-Wsign-conversion’?
warning: c++: error: unrecognized command line option ‘-Wstring-conversion’; did you mean ‘-Wsign-conversion’?

error: failed to run custom build command for `rustc_codegen_nvvm v0.2.2 (/home/nick/Projects/Rust-CUDA/crates/rustc_codegen_nvvm)`

Caused by:
  process didn't exit successfully: `/home/nick/Projects/Rust-CUDA/target/debug/build/rustc_codegen_nvvm-95f9f0253c1f9f5e/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-env-changed=LLVM_CONFIG
  cargo:rerun-if-env-changed=USE_PREBUILT_LLVM
  cargo:rustc-cfg=llvm_component="bitreader"
  cargo:rustc-cfg=llvm_component="bitwriter"
  cargo:rustc-cfg=llvm_component="ipo"
  cargo:rustc-cfg=llvm_component="lto"
  cargo:rustc-cfg=llvm_component="nvptx"
  cargo:rerun-if-env-changed=LLVM_RUSTLLVM
  cargo:rerun-if-changed=rustc_llvm_wrapper
  TARGET = Some("x86_64-unknown-linux-gnu")
  OPT_LEVEL = Some("3")
  HOST = Some("x86_64-unknown-linux-gnu")
  CXX_x86_64-unknown-linux-gnu = None
  CXX_x86_64_unknown_linux_gnu = None
  HOST_CXX = None
  CXX = None
  CXXFLAGS_x86_64-unknown-linux-gnu = None
  CXXFLAGS_x86_64_unknown_linux_gnu = None
  HOST_CXXFLAGS = None
  CXXFLAGS = None
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("true")
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  running: "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m64" "-I" "rustc_llvm_wrapper/rustllvm.h" "-I/usr/bin/llvm-linux-x86_64/include" "-fPIC" "-fvisibility-inlines-hidden" "-Werror=date-time" "-Werror=unguarded-availability-new" "-std=c++11" "-Wall" "-Wextra" "-Wno-unused-parameter" "-Wwrite-strings" "-Wcast-qual" "-Wmissing-field-initializers" "-pedantic" "-Wno-long-long" "-Wcovered-switch-default" "-Wnon-virtual-dtor" "-Wdelete-non-virtual-dtor" "-Wstring-conversion" "-ffunction-sections" "-fdata-sections" "-O3" "-DNDEBUG" "-fno-exceptions" "-fno-rtti" "-D_GNU_SOURCE" "-D__STDC_CONSTANT_MACROS" "-D__STDC_FORMAT_MACROS" "-D__STDC_LIMIT_MACROS" "-DLLVM_COMPONENT_BITREADER" "-DLLVM_COMPONENT_BITWRITER" "-DLLVM_COMPONENT_IPO" "-DLLVM_COMPONENT_LTO" "-DLLVM_COMPONENT_NVPTX" "-o" "/home/nick/Projects/Rust-CUDA/target/debug/build/rustc_codegen_nvvm-8af0dea16b2e8647/out/rustc_llvm_wrapper/RustWrapper.o" "-c" "rustc_llvm_wrapper/RustWrapper.cpp"
  running: "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m64" "-I" "rustc_llvm_wrapper/rustllvm.h" "-I/usr/bin/llvm-linux-x86_64/include" "-fPIC" "-fvisibility-inlines-hidden" "-Werror=date-time" "-Werror=unguarded-availability-new" "-std=c++11" "-Wall" "-Wextra" "-Wno-unused-parameter" "-Wwrite-strings" "-Wcast-qual" "-Wmissing-field-initializers" "-pedantic" "-Wno-long-long" "-Wcovered-switch-default" "-Wnon-virtual-dtor" "-Wdelete-non-virtual-dtor" "-Wstring-conversion" "-ffunction-sections" "-fdata-sections" "-O3" "-DNDEBUG" "-fno-exceptions" "-fno-rtti" "-D_GNU_SOURCE" "-D__STDC_CONSTANT_MACROS" "-D__STDC_FORMAT_MACROS" "-D__STDC_LIMIT_MACROS" "-DLLVM_COMPONENT_BITREADER" "-DLLVM_COMPONENT_BITWRITER" "-DLLVM_COMPONENT_IPO" "-DLLVM_COMPONENT_LTO" "-DLLVM_COMPONENT_NVPTX" "-o" "/home/nick/Projects/Rust-CUDA/target/debug/build/rustc_codegen_nvvm-8af0dea16b2e8647/out/rustc_llvm_wrapper/PassWrapper.o" "-c" "rustc_llvm_wrapper/PassWrapper.cpp"
  cargo:warning=c++: error: unrecognized command line option ‘-Wcovered-switch-default’; did you mean ‘-Wno-switch-default’?
  cargo:warning=c++: error: unrecognized command line option ‘-Wcovered-switch-default’; did you mean ‘-Wno-switch-default’?
  cargo:warning=c++: error: unrecognized command line option ‘-Wstring-conversion’; did you mean ‘-Wsign-conversion’?
  cargo:warning=c++: error: unrecognized command line option ‘-Wstring-conversion’; did you mean ‘-Wsign-conversion’?
  exit status: 1
  exit status: 1

  --- stderr


  error occurred: Command "c++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-fno-omit-frame-pointer" "-m64" "-I" "rustc_llvm_wrapper/rustllvm.h" "-I/usr/bin/llvm-linux-x86_64/include" "-fPIC" "-fvisibility-inlines-hidden" "-Werror=date-time" "-Werror=unguarded-availability-new" "-std=c++11" "-Wall" "-Wextra" "-Wno-unused-parameter" "-Wwrite-strings" "-Wcast-qual" "-Wmissing-field-initializers" "-pedantic" "-Wno-long-long" "-Wcovered-switch-default" "-Wnon-virtual-dtor" "-Wdelete-non-virtual-dtor" "-Wstring-conversion" "-ffunction-sections" "-fdata-sections" "-O3" "-DNDEBUG" "-fno-exceptions" "-fno-rtti" "-D_GNU_SOURCE" "-D__STDC_CONSTANT_MACROS" "-D__STDC_FORMAT_MACROS" "-D__STDC_LIMIT_MACROS" "-DLLVM_COMPONENT_BITREADER" "-DLLVM_COMPONENT_BITWRITER" "-DLLVM_COMPONENT_IPO" "-DLLVM_COMPONENT_LTO" "-DLLVM_COMPONENT_NVPTX" "-o" "/home/nick/Projects/Rust-CUDA/target/debug/build/rustc_codegen_nvvm-8af0dea16b2e8647/out/rustc_llvm_wrapper/RustWrapper.o" "-c" "rustc_llvm_wrapper/RustWrapper.cpp" with args "c++" did not execute successfully (status code exit status: 1).

opened by CircArgs 8

Expose the null stream in cust

Are there any reasons for currently not exposing the null stream in cust?

Would there be any problems to implementing Default::default() for Stream as the null stream?

A lot of traditional C++ cuda code uses the null stream. I think exposing it in rust as well can lower the complexity of having to manage streams yourself. It can also make the on-boarding experience close to what it is in C++.
question wontfix C-cust

opened by kjetilkjeka 5
Implement DeviceCopy for Complex numbers

This PR adds an optional implementation of DeviceCopy for Complex numbers. This can be very valuable for scientific computing. The num_complex::Complex type is already Copy and #[repr(C)].

opened by sebcrozet 4
Requesting help for updating Rust-CUDA to newer rustc version.

Hello everyone. It has been quite a while till the rust-toolchain has been updated. I've been trying to to get something up, but am kind of stuck due to lack of knowledge about the rustc codegen API. Can someone maybe take a look at my fork at https://github.com/apriori/Rust-CUDA/tree/update-to-latest-nightly and assist in its completion?

opened by apriori 3
internal compiler error: unknown intrinsic 'raw_eq'
I am interested in trying ed25519 signature verification on cuda, and was trying to compile ed25519-dalek crate.

More specifically, for now I just added it as a dependency to gpu/add example to see if there will be any compilation errors.

I had to remove code related to secret key generation(since I am only interested in verification) and zeroize crate since it was causing obvious errors (see https://github.com/andll/ed25519-dalek/commit/132de2d6735117262802845370b098745bf7a68c), but after that I get compiler error that I don't understand / know how to fix

error: internal compiler error: crates/rustc_codegen_nvvm/src/intrinsic.rs:430:18: unknown intrinsic 'raw_eq' thread 'rustc' panicked at 'Box<dyn Any>', compiler/rustc_errors/src/lib.rs:1170:9 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace warning: `ed25519-dalek` (lib) generated 3 warnings error: could not compile `ed25519-dalek`; 3 warnings emitted warning: build failed, waiting for other jobs to finish... error: atomic fence is not supported, use cuda_std intrinsics instead

I am also not sure if 'unknown intrinsic' and 'atomic fence is not supported' referring to the same problem or are they different issue.

Problem with those errors is that they don't point to source code that cause them, so for person like me who is not familiar with compiler internals don't even know where to look.

Do you have any advice what are the things in source that can cause this that I can try to remove/change with something else?

Is there a way to pin point specific source code line that cause the issue?
bug C-rustc_codegen_nvvm
opened by andll 3

Error: a PTX JIT compilation failed

Platform: Jetson Nano 2Gi Arch: aarch64/arm64 OS: Linux Ubuntu 18.04 / Tegra

# Same output with -sass, -elf, and pretty much any of the other opts/flags for cuobjdump.
cuobjdump -ptx `which cns-rt`
cuobjdump info    : File '/usr/local/bin/cns-rt' does not contain device code

cuda-memcheck --report-api-errors all cns-rt
========= CUDA-MEMCHECK
========= Program hit CUDA_ERROR_INVALID_PTX (error 218) due to "a PTX JIT compilation failed" on CUDA API call to cuModuleLoadData.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuModuleLoadData + 0x114) [0x1d449c]
=========     Host Frame:cns-rt [0x90dc]
=========     Host Frame:cns-rt [0x8d50]
=========     Host Frame:cns-rt [0x7e38]
=========     Host Frame:cns-rt [0x8e04]
=========     Host Frame:cns-rt [0x8e1c]
=========     Host Frame:cns-rt [0x1ea98]
=========     Host Frame:cns-rt [0x82ec]
=========     Host Frame:/lib/aarch64-linux-gnu/libc.so.6 (__libc_start_main + 0xe0) [0x20720]
=========     Host Frame:cns-rt [0x7afc]
=========
Error: "a PTX JIT compilation failed"
========= ERROR SUMMARY: 1 error

EDIT (added the PTX):

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-30521435
// Cuda compilation tools, release 11.4, V11.4.152
// Based on NVVM 7.0.1
//

.version 7.4
.target sm_61
.address_size 64

	// .globl	add

.visible .entry add(
	.param .u64 add_param_0,
	.param .u64 add_param_1,
	.param .u64 add_param_2,
	.param .u64 add_param_3,
	.param .u64 add_param_4
)
{
	.reg .pred 	%p<3>;
	.reg .f32 	%f<4>;
	.reg .b32 	%r<5>;
	.reg .b64 	%rd<14>;


	ld.param.u64 	%rd2, [add_param_0];
	ld.param.u64 	%rd6, [add_param_1];
	ld.param.u64 	%rd3, [add_param_2];
	ld.param.u64 	%rd4, [add_param_3];
	ld.param.u64 	%rd5, [add_param_4];
	mov.u32 	%r1, %ntid.x;
	mov.u32 	%r2, %ctaid.x;
	mov.u32 	%r3, %tid.x;
	mad.lo.s32 	%r4, %r1, %r2, %r3;
	cvt.u64.u32 	%rd1, %r4;
	setp.ge.u64 	%p1, %rd1, %rd6;
	@%p1 bra 	$L__BB0_4;

	setp.lt.u64 	%p2, %rd1, %rd4;
	@%p2 bra 	$L__BB0_3;
	bra.uni 	$L__BB0_2;

$L__BB0_3:
	cvta.to.global.u64 	%rd7, %rd5;
	shl.b64 	%rd8, %rd1, 2;
	add.s64 	%rd9, %rd7, %rd8;
	cvta.to.global.u64 	%rd10, %rd3;
	add.s64 	%rd11, %rd10, %rd8;
	ld.global.nc.f32 	%f1, [%rd11];
	cvta.to.global.u64 	%rd12, %rd2;
	add.s64 	%rd13, %rd12, %rd8;
	ld.global.nc.f32 	%f2, [%rd13];
	add.f32 	%f3, %f2, %f1;
	st.global.f32 	[%rd9], %f3;

$L__BB0_4:
	ret;

$L__BB0_2:
	trap;

}

An important note is that this is all compiled on an Ubuntu 18.04 arm64 container with Cuda 11.4, but the binary is then moved to the L4T-runtime container (which is needed for the Jetson device) which only supports Cuda 10.2. The docs in the Getting Started section of this repo seem to indicate that such a setup should be fine ... though I may have misinterpreted that statement.

Any ideas on what is causing this issue?

opened by thedodd 3

DeviceCopy vs. Copy
In the Kernel ABI documentation we learn we can pass any struct that implements copy.

However, this seems not to work because cust still has assert_impl_devicecopy, that fails on compile (due to generics):

fn assert_impl_devicecopy<T: $crate::memory::DeviceCopy>(_val: T) {} if false { $( assert_impl_devicecopy($arg); )* }

(In functions.rs)

Is this an old remnant, or do we need to use DeviceCopy?
question
opened by tmrlvi 3
Installing llvm 7 on Ubuntu 22.04

I'm currently running Ubuntu 22.04 and have had quite a bit of trouble getting llvm installed. It seems it's not supported anymore, and 20.04 was the last version that supported it. Given 22.04 is the newest LTS version, and llvm 7 is required for this project to work, this seems like a massive problem. Am I missing some more straightforward way of installing this?

opened by jafioti 2
Derive DeviceCopy invariants?

I see that there is no impl DeviceCopy for core/alloc Vec, but I do see the impls for the vek::* types. Is there any particular concern with slapping on an unsafe impl DeviceCopy for the core/alloc Vec type?

I have some data structures which are pure data, no references, nothing like that, but there are vector fields. Would it be sounds to impl DeviceCopy for such a type? I’m wondering what the actual requirements are. Just no references and no_std?
question C-cust

opened by thedodd 2
docs: add `output` file hack to tips

When I built my project, rustc_codegen_nvvm rebuilt on every subsequent build of my program, even if I just changed the CPU crate code. Someone on the discord server shared their trick on how to fix it, by touching the output file in the target or rustc_codegen_nvvm and it worked for me, so I added it to the guide.

opened by Nilstrieb 2
Fixes to Dockerfile

Apologies for the minor PRs in succession. I realized there were some minor issues with the Dockerfile that are now fixed and running the add example successfully seems to confirm it is working properly.

Details: The primary issue was that building succeeded, but running failed to create a context. This was somehow related to ld not finding the shared objects despite being in LD_LIBRARY_PATH. I reverted to ldconfig and this fixed the issue.

Starting this PR as a draft to hear opinions on whether the Dockerfile should optionally include optix with a build argument. I have not investigated this much but I can look into it.

opened by CircArgs 2
Port to more current rust-nightly
also port from register_attr to register_tool (approach shamelessly taken from rust-gpu)

add_gpu example working

Deactivate warp submodule as it does trigger some codegen issues

breaking:

Inject "no_std" and "register_tool" crate flags via cuda_builder. The user has no longer to define the respective cfg_attr section in gpu code. Leaving them still in gpu code will result in a compile error from cargo.

to be further tested:

more complex examples/projects
opened by apriori 1
[cust_std] Vec struct segfaults unpredictably

I sadly cannot give a reproducible example for this, because it happens unpredictably, but essentially, whenever I use the Vec struct in my kernel code, it ends up giving me a LaunchFailure error after some time.

opened by TudbuT 0
Dynamic Parallelism | implementation strategy

Well ... once again, I find myself in need of another feature. This time, dynamic parallelism.

Looks like this is also part of the C++ runtime API, similar to cooperative groups, for which I already have a PR.

I'm considering using a similar strategy for implementing this feature. I would love to just pin down the PTX, but that has proven to be a bit unclear; however, I will definitely start my search in the PTX ISA and see if there are any quick wins. If not, then probably a similar approach as was taken with the cooperative groups API.

Thoughts?

opened by thedodd 2
Support NixOS / add path fallback logic via `which nvcc`

Right now the crate find_cuda_helper() has a few hard-coded paths it checks on Linux (e.g., /opt/cuda or /usr/local/cuda).

On NixOS however the CUDA toolkit gets installed to a location like /nix/store/n3mnxpif0zxs4ws1pw8spj68l0gzcr9z-cudatoolkit-11.7.0, including libs and includes, and all bin commands are made available in the current PATH.

So to compile CUDA crates on NixOS there should be a fallback in find_cuda_lib_dirs() that, for example, searches for the presence of nvcc (e.g., by running which nvcc) and uses its parent as the return value of that method. (Note, not saying this is the best solution for NixOS, please chime in if you know something better).

opened by ralfbiedert 0
Can't install on WSL
I've spent the last several hours trying to install Rust-CUDA on Windows 11 WSL, Ubuntu 20.04, in a Razer Blade 14 notebook, to no success. The install instructions seem to be outdated and/or unclear. I've installed everything following the GUIDE, which was particularly hard because 1. I'm on WSL, 2. it assumes I know what LLVM_CONFIG is, which I don't; 3. it assumes I know what is libnvvm, which I don't. Regardless, after a lot of struggle, I've managed to install nVidia's drivers and CUDA (on Windows side), and CUDA Toolkit (on WSL), following this guide. Right now, when I run cargo run --bin add on Rust-CUDA, I get the following error:

v@MaiaRazerBlade14:~/Rust-CUDA$ cargo run --bin add Compiling nvvm v0.1.1 (/home/v/Rust-CUDA/crates/nvvm) Compiling cust_raw v0.11.3 (/home/v/Rust-CUDA/crates/cust_raw) Compiling curl-sys v0.4.56+curl-7.83.1 Compiling curl v0.4.44 Compiling xz2 v0.1.7 error: failed to run custom build command for `nvvm v0.1.1 (/home/v/Rust-CUDA/crates/nvvm)` Caused by: process didn't exit successfully: `/home/v/Rust-CUDA/target/debug/build/nvvm-f71aa668ff49a3a5/build-script-build` (exit status: 101) --- stderr thread 'main' panicked at 'Failed to find CUDA ROOT, make sure the CUDA SDK is installed and CUDA_PATH or CUDA_ROOT are set!', crates/find_cuda_helper/src/lib.rs:198:10 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace warning: build failed, waiting for other jobs to finish... error: build failed

After inspecting the code of find_cuda_helper, I've learned it is looking for the following file: /usr/local/cuda/lib/cuda.h. That file is not present. Instead, the contents of /usr/local/cuda-11.8 are:

v@MaiaRazerBlade14:/usr/local/cuda$ ls cuda-keyring_1.0-1_all.deb doc gds nsight-systems-2022.4.2 targets version.json

There is a include directory on /usr/local/cuda/targets/x86_64-linux, but that directory does not have a cuda.h file. Instead, cuda.h can be found on /usr/include/cuda.h, but it isn't a cuda directory, just the file isolated. I also have no idea what libnvvm is, but the following file is present: ./usr/lib/x86_64-linux-gnu/libnvvm.so. Perhaps the way I installed the toolkit results in different from what Rust-CUDA expects? I've managed to get the demo working on WSL via Docker, which, in theory, implies I should be able to install it directly, but I'm not sure how to proceed.
opened by VictorTaelin 1

Releases(0.3)

0.3(Feb 7, 2022)
Top: upcoming MPM engine that runs on CPU and GPU using rust-cuda, Bottom: toy path tracer that can run on CPU, GPU, and GPU (hardware raytracing) using recent experiments with OptiX

Today marks an exciting milestone for the Rust CUDA Project, over the past couple of months, we have made significant advancements in supporting many of the fundamental CUDA ecosystem libraries. The main changes in this release are the changes to cust to make future library support possible, but we will also be highlighting some of the WIP experiments we have been conducting.

Cust changes

This release is likely to be the biggest and most breaking change to cust ever, we had to fundamentally rework how many things work to:

Fix some unsoundness.

Remove some outdated and inconsistent things.

Rework how contexts work to be interoperable with the runtime API.

Therefore this release is guaranteed to break your code, however, the changes should not break too much unless you did a lot of lower-level work with device memory constructs.

Cust 0.3 changes

TLDR

This release is gigantic, so here are the main things you need to worry about:

Context::create_and_push(FLAGS, device) -> Context::new(device).
Module::from_str(PTX) -> Module::from_ptx(PTX, &[]).

Context handling overhaul

The way that contexts are handled in cust has been completely overhauled, it now uses primary context handling instead of the normal driver API context APIs. This is aimed at future-proofing cust for libraries such as cuBLAS and cuFFT, as well as overall simplifying the context handling APIs. This does mean that the API changed a bit:

create_and_push is now new and it only takes a device, not a device and flags.

set_flags is now used for setting context flags.

ContextStack, UnownedContext, and other legacy APIs are gone.

The old context handling is fully present in cust::context::legacy for anyone who needs it for specific reasons. If you use quick_init you don't need to worry about any breaking changes, the API is the same.

cust_core

DeviceCopy has now been split into its own crate, cust_core. The crate is #![no_std], which allows you to pull in cust_core in GPU crates for deriving DeviceCopy without cfg shenanigans.

Removed

DeviceBox::wrap, use DeviceBox::from_raw.

DeviceSlice::as_ptr and DeviceSlice::as_mut_ptr. Use DeviceSlice::as_device_ptr then DevicePointer::as_(mut)_ptr.

DeviceSlice::chunks and consequently DeviceChunks.

DeviceSlice::chunks_mut and consequently DeviceChunksMut.

DeviceSlice::from_slice and DeviceSlice::from_slice_mut because it was unsound.

DevicePointer::as_raw_mut (use DevicePointer::as_mut_ptr).

DevicePointer::wrap (use DevicePointer::from_raw).

DeviceSlice no longer implements Index and IndexMut, switching away from [T] made this impossible to implement. Instead you can now use DeviceSlice::index which behaves the same.

vek is no longer re-exported.

Deprecated

Module::from_str, use Module::from_ptx and pass &[] for options.

Module::load_from_string, use Module::from_ptx_cstr.

Added

cust::memory::LockedBox, same as LockedBuffer except for single elements.

cust::memory::cuda_malloc_async.

cust::memory::cuda_free_async.

impl AsyncCopyDestination<LockedBox<T>> for DeviceBox<T> for async HtoD/DtoH memcpy.

DeviceBox::new_async.

DeviceBox::drop_async.

DeviceBox::zeroed_async.

DeviceBox::uninitialized_async.

DeviceBuffer::uninitialized_async.

DeviceBuffer::drop_async.

DeviceBuffer::zeroed.

DeviceBuffer::zeroed_async.

DeviceBuffer::cast.

DeviceBuffer::try_cast.

DeviceSlice::set_8 and DeviceSlice::set_8_async.

DeviceSlice::set_16 and DeviceSlice::set_16_async.

DeviceSlice::set_32 and DeviceSlice::set_32_async.

DeviceSlice::set_zero and DeviceSlice::set_zero_async.

the bytemuck feature which is enabled by default.

mint integration behind impl_mint.

half integration behind impl_half.

glam integration behind impl_glam.

experimental linux external memory import APIs through cust::external::ExternalMemory.

DeviceBuffer::as_slice.

DeviceVariable, a simple wrapper around DeviceBox<T> and T which allows easy management of a CPU and GPU version of a type.

DeviceMemory, a trait describing any region of GPU memory that can be described with a pointer + a length.

memcpy_htod, a wrapper around cuMemcpyHtoD_v2.

mem_get_info to query the amount of free and total memory.

DevicePointer::as_ptr and DevicePointer::as_mut_ptr for *const T and *mut T.

DevicePointer::from_raw for CUdeviceptr -> DevicePointer<T> with a safe function.

DevicePointer::cast.

dependency on cust_core for DeviceCopy.

ModuleJitOption, JitFallback, JitTarget, and OptLevel for specifying options when loading a module. Note that ModuleJitOption::MaxRegisters does not seem to work currently, but NVIDIA is looking into it. You can achieve the same goal by compiling the ptx to cubin using nvcc then loading that: nvcc --cubin foo.ptx -maxrregcount=REGS

Module::from_fatbin.

Module::from_cubin.

Module::from_ptx and Module::from_ptx_cstr.

Stream, Module, Linker, Function, Event, UnifiedBox, ArrayObject, LockedBuffer, LockedBox, DeviceSlice, DeviceBuffer, and DeviceBox all now impl Send and Sync, this makes it much easier to write multigpu code. The CUDA API is fully thread-safe except for graph objects.

Changed

zeroed functions on DeviceBox and others are no longer unsafe and instead now require T: Zeroable. The functions are only available with the bytemuck feature.

Stream::add_callback now internally uses cuLaunchHostFunc anticipating the deprecation and removal of cuStreamAddCallback per the driver docs. This does however mean that the function no longer takes a device status as a parameter and does not execute on context error.

Linker::complete now only returns the built cubin, and not the cubin and a duration.

Features such as vek for implementing DeviceCopy are now impl_cratename, e.g. impl_vek, impl_half, etc.

DevicePointer::as_raw now returns a CUdeviceptr instead of a *const T.

num-complex integration is now behind impl_num_complex, not num-complex.

DeviceBox now requires T: DeviceCopy (previously it didn't but almost all its methods did).

DeviceBox::from_raw now takes a CUdeviceptr instead of a *mut T.

DeviceBox::as_device_ptr now requires &self instead of &mut self.

DeviceBuffer now requires T: DeviceCopy.

DeviceBuffer is now repr(C) and is represented by a DevicePointer<T> and a usize.

DeviceSlice now requires T: DeviceCopy.

DeviceSlice is now represented as a DevicePointer<T> and a usize (and is repr(C)) instead of [T] which was definitely unsound.

DeviceSlice::as_ptr and DeviceSlice::as_ptr_mut now both return a DevicePointer<T>.

DeviceSlice is now Clone and Copy.

DevicePointer::as_raw now returns a CUdeviceptr, not a *const T (use DevicePointer::as_ptr).

Fixed typo in CudaError, InvalidSouce is now InvalidSource, no more invalid sauce 🍅🥣

Line tables

The libnvvm codegen can now generate line tables while optimizing (previously it could generate debug info but not optimize), which allows you to debug and profile kernels much better in tools like Nsight Compute. You can enable debug info creation using .debug(DebugInfo::LineTables) with cuda_builder.

OptiX

Using the generous work of @anderslanglands, we were able to get rust-cuda to target hardware raytracing completely in rust (both for the host and the device). The toy path tracer example has been ported to be able to use hardware rt as a backend, however, optix and optix_device are not published on crates.io yet since they are still highly experimental.

using hardware rt to render a simple mesh

cuBLAS

Work on supporting cuBLAS through a high-level wrapper library has started, a lot of work needed to be done in cust to interop with cuBLAS which is a runtime API based library. This required some changes with how cust handles contexts to avoid dropping context resources cuBLAS was using. The library is not yet published but eventually will be once it is more complete. cuBLAS is a big piece of neural network training on the GPU so it is critical to support it.

cuDNN

@frjnn has been generously working on wrapping the cuDNN library. cuDNN is the primary tool used to train neural networks on the GPU, and the primary tool used by pytorch and tensorflow. High level bindings to cuDNN are a major step to making Machine Learning in Rust a viable option. This work is still very in-progress so it is not published yet, it will be published once it is usable and will likely first be used in neuronika for GPU neural network training.

Atomics

Work on supporting GPU-side atomics in cuda_std has started, some preliminary work is already published in cuda_std, however, it is still very in-progress and subject to change. Atomics are a difficult issue due to the vast amount of options available for GPU atomics, including:

Different atomic scopes, device, system, or block.

Specialized instructions or emulated depending on the compute capability target.

Hardware float atomics (which core does not have)

You can read more about it here.
Source code(tar.gz)
Source code(zip)
0.2(Dec 5, 2021)
This release marks the start of fixing many of the fundamental issues in the codegen, as well as implementing some of the most needed features for writing performant kernel.

This release mostly covers quality of life changes, bug fixes, and some performance improvements.

Nightly

Required nightly has been updated to 12/4/21, This fixes rust-analyzer not working sometimes.

PTX Backend

DCE (Dead Code Elimination)

DCE has been implemented, we switched to an alternative way of linking together dependencies which now drastically reduces the amount of work libnvvm has to do, as well as removes any globals or functions not directly or indirectly used by kernels. This reduced the PTX size of the path tracer example from about 20kloc to 2.3 kloc.

Address Spaces

CUDA Address Spaces have been mostly implemented, any user-defined static that does not rely on interior mutability will be placed in the constant address space (__constant__), otherwise it will be placed in the generic address space (which is global for globals). This also allowed us to implement basic static shared memory support.

Libm override

The codegen automatically overrides calls to libm with calls to libdevice. This is to allow existing no_std crates to take advantage of architecture-optimized math intrinsics. This can be disabled from cuda_builder if you need strict determinism. This also reduces PTX size a good amount in math-heavy kernels (3.8kloc to 2.3kloc in our path tracer). It also reduces register usage by a little bit, which can yield performance gains.

cuda_std

Added address space query and conversion functions in cuda_std::ptr.

Added #[externally_visible] for making sure the codegen does not eliminate a function if not used by a kernel

Added #[address_space(...)] for making the codegen put a static in a specific address space, mostly internal and unsafe.

Added basic static shared memory support with cuda_std::shared_array!

Cust

Cust 0.2 was actually released some time ago but these were the changes in 0.2 and 0.2.1:

Added Device::as_raw.

Added MemoryAdvise for unified memory advising.

Added MemoryAdvise::prefetch_host and MemoryAdvise::prefetch_device for telling CUDA to explicitly fetch unified memory somewhere.

Added MemoryAdvise::advise_read_mostly.

Added MemoryAdvise::preferred_location and MemoryAdvise::unset_preferred_location. Note that advising APIs are only present on high end GPUs such as V100s.

StreamFlags::NON_BLOCKING has been temporarily disabled because of soundness concerns.

Change GpuBox::as_device_ptr and GpuBuffer::as_device_ptr to take &self instead of &mut self.

Rename DBuffer -> DeviceBuffer. This is how it was in rustacuda, but it was changed at some point, but now we reconsidered that it may be the wrong choice.

Renamed DBox -> DeviceBox.

Renamed DSlice -> DeviceSlice.

Remove GpuBox::as_device_ptr_mut and GpuBuffer::as_device_ptr_mut.

Remove accidentally added vek default feature.

vek feature now uses default-features = false, this also means Rgb and Rgba no longer implement DeviceCopy.

Source code(tar.gz)
Source code(zip)