Extendable HPC-Framework for CUDA, OpenCL and common CPU

Overview

Collenchyma • Join the chat at https://gitter.im/autumnai/collenchyma Build Status Coverage Status Crates.io License

Collenchyma is an extensible, pluggable, backend-agnostic framework for parallel, high-performance computations on CUDA, OpenCL and common host CPU. It is fast, easy to build and provides an extensible Rust struct to execute operations on almost any machine, even if it does not have CUDA or OpenCL capable devices.

Collenchyma's abstracts over the different computation languages (Native, OpenCL, Cuda) and let's you run highly-performant code, thanks to easy parallelization, on servers, desktops or mobiles without the need to adapt your code for the machine you deploy to. Collenchyma does not require OpenCL or Cuda on the machine and automatically falls back to the native host CPU, making your application highly flexible and fast to build.

Collenchyma was started at Autumn to support the Machine Intelligence Framework Leaf with backend-agnostic, state-of-the-art performance.

  • Parallelizing Performance
    Collenchyma makes it easy to parallelize computations on your machine, putting all the available cores of your CPUs/GPUs to use. Collenchyma provides optimized operations through Plugins, that you can use right away to speed up your application.

  • Easily Extensible
    Writing custom operations for GPU execution becomes easy with Collenchyma, as it already takes care of Framework peculiarities, memory management, safety and other overhead. Collenchyma provides Plugins (see examples below), that you can use to extend the Collenchyma backend with your own, business-specific operations.

  • Butter-smooth Builds
    As Collenchyma does not require the installation of various frameworks and libraries, it will not add significantly to the build time of your application. Collenchyma checks at run-time if these frameworks can be used and gracefully falls back to the standard, native host CPU if they are not. No long and painful build procedures for you or your users.

For more information,

Disclaimer: Collenchyma is currently in a very early and heavy stage of development. If you are experiencing any bugs that are not due to not yet implemented features, feel free to create an issue.

Getting Started

If you're using Cargo, just add Collenchyma to your Cargo.toml:

[dependencies]
collenchyma = "0.0.8"

If you're using Cargo Edit, you can call:

$ cargo add collenchyma

Plugins

You can easily extend Collenchyma's Backend with more backend-agnostic operations, through Plugins. Plugins provide a set of related operations - BLAS would be a good example. To extend Collenchyma's Backend with operations from a Plugin, just add a the desired Plugin crate to your Cargo.toml file. Here is a list of available Collenchyma Plugins, that you can use right away for your own application, or take as a starting point, if you would like to create your own Plugin.

  • BLAS - Collenchyma plugin for backend-agnostic Basic Linear Algebra Subprogram Operations.
  • NN - Collenchyma plugin for Neural Network related algorithms.

You can easily write your own backend-agnostic, parallel operations and provide it for others, via a Plugin - we are happy to feature your Plugin here, just send us a PR.

Examples

Collenchyma comes without any operations. The following examples therefore assumes, that you have added both collenchyma and the Collenchyma Plugin collenchyma-nn to your Cargo manifest.

extern crate collenchyma as co;
extern crate collenchyma_nn as nn;
use co::prelude::*;
use nn::*;

fn write_to_memory<T: Copy>(mem: &mut MemoryType, data: &[T]) {
    if let &mut MemoryType::Native(ref mut mem) = mem {
        let mut mem_buffer = mem.as_mut_slice::<T>();
        for (index, datum) in data.iter().enumerate() {
            mem_buffer[index] = *datum;
        }
    }
}

fn main() {
    // Initialize a CUDA Backend.
    let backend = Backend::<Cuda>::default().unwrap();
    // Initialize two SharedTensors.
    let mut x = SharedTensor::<f32>::new(backend.device(), &(1, 1, 3)).unwrap();
    let mut result = SharedTensor::<f32>::new(backend.device(), &(1, 1, 3)).unwrap();
    // Fill `x` with some data.
    let payload: &[f32] = &::std::iter::repeat(1f32).take(x.capacity()).collect::<Vec<f32>>();
    let native = Backend::<Native>::default().unwrap();
    x.add_device(native.device()).unwrap(); // Add native host memory
    x.sync(native.device()).unwrap(); // Sync to native host memory
    write_to_memory(x.get_mut(native.device()).unwrap(), payload); // Write to native host memory.
    x.sync(backend.device()).unwrap(); // Sync the data to the CUDA device.
    // Run the sigmoid operation, provided by the NN Plugin, on your CUDA enabled GPU.
    backend.sigmoid(&mut x, &mut result).unwrap();
    // See the result.
    result.add_device(native.device()).unwrap(); // Add native host memory
    result.sync(native.device()).unwrap(); // Sync the result to host memory.
    println!("{:?}", result.get(native.device()).unwrap().as_native().unwrap().as_slice::<f32>());
}

Contributing

Want to contribute? Awesome! We have instructions to help you get started contributing code or documentation. And high priority issues, that we could need your help with.

We have a mostly real-time collaboration culture and happens here on Github and on the Collenchyma Gitter Channel. You can also reach out to the Maintainers {@MJ, @hobofan}.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as below, without any additional terms or conditions.

Changelog

You can find the release history in the root file CHANGELOG.md.

A changelog is a log or record of all the changes made to a project, such as a website or software project, usually including such records as bug fixes, new features, etc. - Wikipedia

We are using Clog, the Rust tool for auto-generating CHANGELOG files.

License

Licensed under either of

at your option.

Comments
  • Move to stable

    Move to stable

    This PR removes unnecessary use of feature flags that bar using Rust on stable([1])([2]) compilers.

    Reasons being:

    • associated consts (Tracking issue: https://github.com/rust-lang/rust/issues/29646) are currently buggy and can be replaced by an fn in that case.
    • associated_type_defaults can be removed without breakage
    • unboxed_closures can be removed without breakage
    • StaticMutex has an uncertain future (https://github.com/rust-lang/rust/issues/27717) and can be emulated in that case by using lazy_static! (correct me if I'm wrong)

    Finally, I must admit that I didn't get the test suite running quickly.

    ([1]) Outstanding: this doesn't quite work on stable yet, as some APIs in use are currently making their way through beta, so they are not feature gated, but also not available in 1.4.0. 1.5.0 beta works and as 1.5.0 is 2 weeks away, this is probably not worth the effort. ([2]) rblas is not on stable yet, see https://github.com/mikkyang/rust-blas/pull/12 for that. You can use that version of rust-blas by checking it out from my https://github.com/skade/rust-blas/ and dropping the following .cargo/config in your repository:

    paths = ["/path/to/rblas/checkout"]
    
    opened by skade 7
  • feat/tensor: implement IntoTensorDesc for [usize; N], N=1...6

    feat/tensor: implement IntoTensorDesc for [usize; N], N=1...6

    Currently dimensions of a tensor can be specified with usize, tuples, Vecs and slices:

    SharedTensor::new(backend, &10)
    SharedTensor::new(backend, &(10, 2))
    SharedTensor::new(backend, &vec![10, 2])
    

    In cases like this, vec! causes an unneeded allocation and is a bit more verbose than possible. Usize/tuple syntax looks somewhat irregular. It would be nice to be able to express tensor creation like this:

    SharedTensor::new(backend, &[10, 2])
    

    But Rust doesn't autocoerce &[usize; _] into &[usize]. This patch adds explicit implementations to make this use case work.

    Package version is also bumped to make it possible to depend on this feature.

    opened by alexandermorozov 5
  • Use size types that depend on the target architecture.

    Use size types that depend on the target architecture.

    The build broke on my side, due to a mismatched type between u32 and u64 in the cuda bindings

    This replaces u32 and u64 with usize (or libc::size_t) where the expected type is a pointer size. It should make the build more robust over different target architectures.

    opened by pdib 5
  • Relicense under dual MIT/Apache-2.0

    Relicense under dual MIT/Apache-2.0

    The project maintainer explicitly asked for this issue to be opened.

    TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that license is good for interoperation. The MIT license as an add-on can be nice for GPLv2 projects to use your code.

    Why?

    The MIT license requires reproducing countless copies of the same copyright header with different names in the copyright field, for every MIT library in use. The Apache license does not have this drawback. However, this is not the primary motivation for me creating these issues. The Apache license also has protections from patent trolls and an explicit contribution licensing clause. However, the Apache license is incompatible with GPLv2. This is why Rust is dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for GPLv2 compat), and doing so would be wise for this project. This also makes this crate suitable for inclusion and unrestricted sharing in the Rust standard distribution and other projects using dual MIT/Apache, such as my personal ulterior motive, the Robigalia project.

    Some ask, "Does this really apply to binary redistributions? Does MIT really require reproducing the whole thing?" I'm not a lawyer, and I can't give legal advice, but some Google Android apps include open source attributions using this interpretation. Others also agree with it. But, again, the copyright notice redistribution is not the primary motivation for the dual-licensing. It's stronger protections to licensees and better interoperation with the wider Rust ecosystem.

    How?

    To do this, get explicit approval from each contributor of copyrightable work (as not all contributions qualify for copyright, due to not being a "creative work", e.g. a typo fix) and then add the following to your README:

    ## License
    
    Licensed under either of
    
     * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
     * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
    
    at your option.
    
    ### Contribution
    
    Unless you explicitly state otherwise, any contribution intentionally submitted
    for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
    additional terms or conditions.
    

    and in your license headers, if you have them, use the following boilerplate (based on that used in Rust):

    // Copyright 2016 collenchyma Developers
    //
    // Licensed under the Apache License, Version 2.0, <LICENSE-APACHE or
    // http://apache.org/licenses/LICENSE-2.0> or the MIT license <LICENSE-MIT or
    // http://opensource.org/licenses/MIT>, at your option. This file may not be
    // copied, modified, or distributed except according to those terms.
    

    It's commonly asked whether license headers are required. I'm not comfortable making an official recommendation either way, but the Apache license recommends it in their appendix on how to use the license.

    Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these from the Rust repo for a plain-text version.

    And don't forget to update the license metadata in your Cargo.toml to:

    license = "MIT OR Apache-2.0"
    

    I'll be going through projects which agree to be relicensed and have approval by the necessary contributors and doing this changes, so feel free to leave the heavy lifting to me!

    Contributor checkoff

    To agree to relicensing, comment with :

    I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.
    

    Or, if you're a contributor, you can check the box in this repo next to your name. My scripts will pick this exact phrase up and check your checkbox, but I'll come through and manually review this issue later as well.

    • [x] @MichaelHirn
    • [x] @hobofan
    • [x] @homu
    • [x] @skade
    • [x] @pdib
    opened by emberian 4
  • for macos

    for macos

    For OS X, you need the kind key of "framework".

    This patch and adding build.rs as below made compilation successful on Mac.

    fn main() {
        let YOUR_CUDA_LIB_PATH = "";
        println!("cargo:rustc-link-search=native={}", YOUR_CUDA_LIB_PATH);
    }
    
    opened by y-ich 3
  • Release numbers for the benchmarks againts other frameworks

    Release numbers for the benchmarks againts other frameworks

    Hi, I watched the talk on Rust Berlin, however I did not find any place where there are actually published numbers about the speed performance of the framework. Also on the talk it was said that it is twice faster than Tensorflow and Caffee, however it was not specified on which device, as well as I did not hear comments for comparison with Torch, Theano and MXNet which are defacto the current lead in performance in ML. It would be really great if there could be something that we can actually see as numbers as well as the code for the other frameworks against which you are benchmarking.

    opened by botev 3
  • Non-exhaustive pattern: type &device::DeviceType is non-empty

    Non-exhaustive pattern: type &device::DeviceType is non-empty

    I am getting the following error when I try to build collenchyma on Windows 10:

    C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326:21: 348:22 error: non-exhaustive patterns: type &device::DeviceType is non-empty [E0002]
    C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326                     match destination {
                                                                                                                         ^
    C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326:21: 348:22 help: run `rustc --explain E0002` to see a detailed explanation
    C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326:21: 348:22 help: Please ensure that all possible cases are being handled; possibly adding wildcards or more match arms.
    C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326                     match destination {
                                                                                                                         ^
    error: aborting due to previous error
    

    I'm guessing this has to do with the match statement at tensor.rs:342 not having a default case, but I'm not sure.

    opened by ghost 0
  • Backend decoupling and typed memory interface

    Backend decoupling and typed memory interface

    I've implemented feature related to decoupling from #37. Main commit can be viewed here. Below is commit message for convenience:

    Change SharedTensor::read() signature from fn read(&self, device: &DeviceType) -> Result<&MemoryType, ...> into fn read<D: IDevice(&self, device: &D) -> Result<&D::M, ...> New signature provides type-level guarantee that if a Cuda device is passed into read(), then it'll return Cuda memory (and not Native or OpenCL). Previously required additional unwraps (.as_native().unwrap()) are no longer required, code is more clear and concise.

    Internally SharedTensor uses Any type to store objects of different types uniformely. Synchronization between memories is also done through type-erased interface. This makes it possible to define a new Framework in an external crate, or extract Cuda and OpenCL frameworks into their own crates. Though error types would require some additional work.

    Use of "dynamic typing" has drawbacks -- mainly slightly larger runtime overhead. Before this patch benchmarks showed that SharedTensor::read() takes 19-22ns, now it takes 23-26ns. For comparison, minimal synchronized CUDA operation will take about 10-40us. Small NN layers on CPU are much faster, e.g. 10-input softmax layer takes about 500ns. Still, in typical NNs overhead looks negligible, and I think it's fair tradeoff for code clarity and better decoupling.

    Here are actual benches, before:

    test bench_shared_tensor_access_time_first                            ... bench:          19 ns/iter (+/- 2)
    test bench_shared_tensor_access_time_second                           ... bench:          21 ns/iter (+/- 0)
    

    after:

    test bench_shared_tensor_access_time_first                        ... bench:          23 ns/iter (+/- 0)
    test bench_shared_tensor_access_time_second                       ... bench:          26 ns/iter (+/- 3)
    

    What's your opinion on it?

    opened by alexandermorozov 0
  • Refactor synchronization

    Refactor synchronization

    I've implemented memory access API and syncronization based on bitmasks. Tesnsor/TensorView and decoupling aren't implemented.

    Native and CUDA pass all tests. OpenCL compiles but segfaults on my machine, both with this PR and without it.

    PR isn't ready to be merged yet -- I'd like to fix plugins and Leaf first to see that there are no unexpected problems.

    opened by alexandermorozov 0
  • Support Windows

    Support Windows

    With Cuda 7.0/7.5 i get this error ollenchyma>cargo test Running target\debug\backend_specs-5e76dbe75e34191a.exe running 5 tests test backend_spec::native::it_can_use_ibackend_trait_object ... ok test backend_spec::native::it_can_create_default_backend ... ok after that backend_specs-randumnumber.exe stopped working GDB OUTPUT: `running 5 tests [New Thread 16076.0x1804] [New Thread 16076.0x3c24] [New Thread 16076.0x315c] [New Thread 16076.0x34c0] [New Thread 16076.0x32d4] [New Thread 16076.0x42ec] [New Thread 16076.0x34e4] [New Thread 16076.0x2a64] [New Thread 16076.0xe3c]

    Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 16076.0x34e4] 0x000000000051b670 in cuInit ()`

    stackframe: ` #0 0x000000000051b670 in cuInit () #1 0x00000000004874ec in collenchyma::frameworks::cuda::api::driver::utils::API::ffi_init ()

    at src\frameworks\cuda\api\driver/utils.rs:15
    

    #2 0x0000000000472fcb in collenchyma::frameworks::cuda::api::driver::utils::API::init ()

    at src\frameworks\cuda\api\driver/utils.rs:11
    

    #3 0x0000000000472bfd in collenchyma::frameworks::cuda::Cuda.IFramework::new ()

    at src\frameworks\cuda/mod.rs:46
    

    #4 0x0000000000406dfe in backend_specs::backend::IBackend::defaultco::backend::Backendco::frameworks::cuda::Cuda () at src/backend.rs:106 #5 0x0000000000406cf0 in backend_specs::backend_spec::cuda::it_can_create_default_backend ()

    at tests/backend_specs.rs:37
    

    #6 0x0000000000429e29 in boxed::F.FnBox$LT$A$GT$::call_box::h809104141212590368 () #7 0x000000000042c844 in sys_common::unwind::try::try_fn::h15534802711051563033 () #8 0x00000000004c471b in sys_common::unwind::try::inner_try::h3ae51230bca914caH9s () #9 0x000000000042cbfb in boxed::F.FnBox$LT$A$GT$::call_box::h11947310459896948609 () #10 0x00000000004d522e in sys::thread::Thread::new::thread_start::h24036cb9e2bd1c0fLey () #11 0x00007ffcfe848102 in KERNEL32!BaseThreadInitThunk ()

    from C:\WINDOWS\system32\kernel32.dll #12 0x00007ffd00b0c5b4 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll #13 0x0000000000000000 in ?? ()

    Backtrace stopped: previous frame inner to this frame (corrupt stack?) `

    opened by gallexme 2
Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

null 294 Dec 23, 2022
Signed distance field font and image command line tool based on OpenCL.

SDFTool Signed distance field font and image command line tool based on OpenCL. Build Windows Run cargo build --release in Visual Studio developer x64

弦语蝶梦 7 Oct 16, 2022
Rust+OpenCL+AVX2 implementation of LLaMA inference code

RLLaMA RLLaMA is a pure Rust implementation of LLaMA large language model inference.. Supported features Uses either f16 and f32 weights. LLaMA-7B, LL

Mikko Juola 344 Apr 16, 2023
Kuintessence - Open Source HPC Computing Orchestration System

Kuintessence is an advanced computing orchestration system designed to revolutionize HPC workload and cluster management, to empower HPC users by allowing them to concentrate on their scientific ideas rather than getting bogged down by HPC environment complications, and enhance research productivity to its fullest potential.

National Supercomputing Center in Jinan 4 Aug 31, 2023
Simple, extendable and embeddable scripting language.

duckscript duckscript SDK CLI Simple, extendable and embeddable scripting language. Overview Language Goals Installation Homebrew Binary Release Ducks

Sagie Gur-Ari 356 Dec 24, 2022
Easy to use, extendable, OCI-compliant container runtime written in pure Rust

PURA - Lightweight & OCI-compliant container runtime Pura is an experimental Linux container runtime written in pure and dependency-minimal Rust. The

Branimir Malesevic 73 Jan 9, 2023
An extendable system made up of autonomous execution services known as nodes organized in a tree of processes. Inspired by Godot!

NodeTree NodeTree is a framework to create large scalable programs and games through a tree of processes. Each process is fully autonomous and is capa

LunaticWyrm 3 Apr 10, 2024
Asynchronous CUDA, NPP and TensorRT for Rust.

Asynchronous CUDA, NPP and TensorRT ℹ️ Introduction The async-cuda family of libraries is an experimental set of libraries for interacting with the GP

Oddity.ai 4 Jun 19, 2023
Doing advent of code with CUDA and rust.

advent-of-cuda I want to get better at writing cuda kernels so I plan on doing advent of code with cuda. You can follow along here. This for advent of

Jess Frazelle 190 Dec 14, 2023
A Rusty CUDA wrapper

cuda-oxide cuda-oxide is a safe wrapper for CUDA. With cuda-oxide you can execute and coordinate CUDA kernels. Safety Philosophy cuda-oxide does not o

Max Bruce 30 Dec 7, 2022
LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!

LLaMa 7b in rust This repo contains the popular LLaMa 7b language model, fully implemented in the rust programming language! Uses dfdx tensors and CUD

Corey Lowman 16 May 8, 2023
Open Machine Intelligence Framework for Hackers. (GPU/CPU)

Leaf • Introduction Leaf is a open Machine Learning Framework for hackers to build classical, deep or hybrid machine learning applications. It was ins

Autumn 5.5k Jan 1, 2023
A fast and simple command-line tool for common operations over JSON-lines files

rjp: Rapid JSON-lines processor A fast and simple command-line tool for common operations over JSON-lines files, such as: converting to and from text

Ales Tamchyna 3 Jul 8, 2022
A common library and set of test cases for transforming OSM tags to lane specifications

osm2lanes See discussion for context. This repo is currently just for starting this experiment. No license chosen yet. Structure data tests.json—tests

A/B Street 29 Nov 16, 2022
Common data structures and algorithms for competitive programming in Rust

algorithm-rs algorithm-rs is common data structures and algorithms for competitive programming in Rust. Contents TBA How To Contribute Contributions a

Chris Ohk 16 Dec 21, 2022
Hardware Abstraction Layer for AVR microcontrollers and common boards

avr-hal Hardware Abstraction Layer for AVR microcontrollers and common boards (for example Arduino). Based on the avr-device crate. This is a new vers

Rahix 776 Jan 1, 2023
Rust-clippy - A bunch of lints to catch common mistakes and improve your Rust code

Clippy A collection of lints to catch common mistakes and improve your Rust code. There are over 450 lints included in this crate! Lints are divided i

The Rust Programming Language 8.7k Dec 31, 2022
Common data structures and algorithms in Rust

Contest Algorithms in Rust A collection of classic data structures and algorithms, emphasizing usability, beauty and clarity over full generality. As

Aram Ebtekar 3.3k Dec 27, 2022
A library to help you sew up your Ethereum project with Rust and just like develop in a common backend

SewUp Secondstate EWasm Utility Program, a library helps you sew up your Ethereum project with Rust and just like development in a common backend. The

Second State 48 Dec 18, 2022
Comparing performance of Rust math libraries for common 3D game and graphics tasks

mathbench mathbench is a suite of unit tests and benchmarks comparing the output and performance of a number of different Rust linear algebra librarie

Cameron Hart 137 Dec 8, 2022