Extendable HPC-Framework for CUDA, OpenCL and common CPU

Autumn

Last update: Oct 31, 2022

Related tags

Computation collenchyma

Overview

Collenchyma •

Collenchyma is an extensible, pluggable, backend-agnostic framework for parallel, high-performance computations on CUDA, OpenCL and common host CPU. It is fast, easy to build and provides an extensible Rust struct to execute operations on almost any machine, even if it does not have CUDA or OpenCL capable devices.

Collenchyma's abstracts over the different computation languages (Native, OpenCL, Cuda) and let's you run highly-performant code, thanks to easy parallelization, on servers, desktops or mobiles without the need to adapt your code for the machine you deploy to. Collenchyma does not require OpenCL or Cuda on the machine and automatically falls back to the native host CPU, making your application highly flexible and fast to build.

Collenchyma was started at Autumn to support the Machine Intelligence Framework Leaf with backend-agnostic, state-of-the-art performance.

Parallelizing Performance
Collenchyma makes it easy to parallelize computations on your machine, putting all the available cores of your CPUs/GPUs to use. Collenchyma provides optimized operations through Plugins, that you can use right away to speed up your application.
Easily Extensible
Writing custom operations for GPU execution becomes easy with Collenchyma, as it already takes care of Framework peculiarities, memory management, safety and other overhead. Collenchyma provides Plugins (see examples below), that you can use to extend the Collenchyma backend with your own, business-specific operations.
Butter-smooth Builds
As Collenchyma does not require the installation of various frameworks and libraries, it will not add significantly to the build time of your application. Collenchyma checks at run-time if these frameworks can be used and gracefully falls back to the standard, native host CPU if they are not. No long and painful build procedures for you or your users.

For more information,

see Collenchyma's Documentation
or get in touch on Twitter or Gitter

Disclaimer: Collenchyma is currently in a very early and heavy stage of development. If you are experiencing any bugs that are not due to not yet implemented features, feel free to create an issue.

Getting Started

If you're using Cargo, just add Collenchyma to your Cargo.toml:

[dependencies]
collenchyma = "0.0.8"

If you're using Cargo Edit, you can call:

$ cargo add collenchyma

Plugins

You can easily extend Collenchyma's Backend with more backend-agnostic operations, through Plugins. Plugins provide a set of related operations - BLAS would be a good example. To extend Collenchyma's Backend with operations from a Plugin, just add a the desired Plugin crate to your Cargo.toml file. Here is a list of available Collenchyma Plugins, that you can use right away for your own application, or take as a starting point, if you would like to create your own Plugin.

BLAS - Collenchyma plugin for backend-agnostic Basic Linear Algebra Subprogram Operations.
NN - Collenchyma plugin for Neural Network related algorithms.

You can easily write your own backend-agnostic, parallel operations and provide it for others, via a Plugin - we are happy to feature your Plugin here, just send us a PR.

Examples

Collenchyma comes without any operations. The following examples therefore assumes, that you have added both collenchyma and the Collenchyma Plugin collenchyma-nn to your Cargo manifest.

extern crate collenchyma as co;
extern crate collenchyma_nn as nn;
use co::prelude::*;
use nn::*;

fn write_to_memory<T: Copy>(mem: &mut MemoryType, data: &[T]) {
    if let &mut MemoryType::Native(ref mut mem) = mem {
        let mut mem_buffer = mem.as_mut_slice::<T>();
        for (index, datum) in data.iter().enumerate() {
            mem_buffer[index] = *datum;
        }
    }
}

fn main() {
    // Initialize a CUDA Backend.
    let backend = Backend::<Cuda>::default().unwrap();
    // Initialize two SharedTensors.
    let mut x = SharedTensor::<f32>::new(backend.device(), &(1, 1, 3)).unwrap();
    let mut result = SharedTensor::<f32>::new(backend.device(), &(1, 1, 3)).unwrap();
    // Fill `x` with some data.
    let payload: &[f32] = &::std::iter::repeat(1f32).take(x.capacity()).collect::<Vec<f32>>();
    let native = Backend::<Native>::default().unwrap();
    x.add_device(native.device()).unwrap(); // Add native host memory
    x.sync(native.device()).unwrap(); // Sync to native host memory
    write_to_memory(x.get_mut(native.device()).unwrap(), payload); // Write to native host memory.
    x.sync(backend.device()).unwrap(); // Sync the data to the CUDA device.
    // Run the sigmoid operation, provided by the NN Plugin, on your CUDA enabled GPU.
    backend.sigmoid(&mut x, &mut result).unwrap();
    // See the result.
    result.add_device(native.device()).unwrap(); // Add native host memory
    result.sync(native.device()).unwrap(); // Sync the result to host memory.
    println!("{:?}", result.get(native.device()).unwrap().as_native().unwrap().as_slice::<f32>());
}

Contributing

Want to contribute? Awesome! We have instructions to help you get started contributing code or documentation. And high priority issues, that we could need your help with.

We have a mostly real-time collaboration culture and happens here on Github and on the Collenchyma Gitter Channel. You can also reach out to the Maintainers {@MJ, @hobofan}.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as below, without any additional terms or conditions.

Changelog

You can find the release history in the root file CHANGELOG.md.

A changelog is a log or record of all the changes made to a project, such as a website or software project, usually including such records as bug fixes, new features, etc. - Wikipedia

We are using Clog, the Rust tool for auto-generating CHANGELOG files.

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Comments

Move to stable
This PR removes unnecessary use of feature flags that bar using Rust on stable([1])([2]) compilers.

Reasons being:

associated consts (Tracking issue: https://github.com/rust-lang/rust/issues/29646) are currently buggy and can be replaced by an fn in that case.

associated_type_defaults can be removed without breakage

unboxed_closures can be removed without breakage

StaticMutex has an uncertain future (https://github.com/rust-lang/rust/issues/27717) and can be emulated in that case by using lazy_static! (correct me if I'm wrong)

Finally, I must admit that I didn't get the test suite running quickly.

([1]) Outstanding: this doesn't quite work on stable yet, as some APIs in use are currently making their way through beta, so they are not feature gated, but also not available in 1.4.0. 1.5.0 beta works and as 1.5.0 is 2 weeks away, this is probably not worth the effort. ([2]) rblas is not on stable yet, see https://github.com/mikkyang/rust-blas/pull/12 for that. You can use that version of rust-blas by checking it out from my https://github.com/skade/rust-blas/ and dropping the following .cargo/config in your repository:

paths = ["/path/to/rblas/checkout"]
opened by skade 7
feat/tensor: implement IntoTensorDesc for [usize; N], N=1...6
Currently dimensions of a tensor can be specified with usize, tuples, Vecs and slices:

SharedTensor::new(backend, &10) SharedTensor::new(backend, &(10, 2)) SharedTensor::new(backend, &vec![10, 2])

In cases like this, vec! causes an unneeded allocation and is a bit more verbose than possible. Usize/tuple syntax looks somewhat irregular. It would be nice to be able to express tensor creation like this:

SharedTensor::new(backend, &[10, 2])

But Rust doesn't autocoerce &[usize; _] into &[usize]. This patch adds explicit implementations to make this use case work.

Package version is also bumped to make it possible to depend on this feature.
opened by alexandermorozov 5
Use size types that depend on the target architecture.

The build broke on my side, due to a mismatched type between u32 and u64 in the cuda bindings

This replaces u32 and u64 with usize (or libc::size_t) where the expected type is a pointer size. It should make the build more robust over different target architectures.

opened by pdib 5
Relicense under dual MIT/Apache-2.0
The project maintainer explicitly asked for this issue to be opened.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that license is good for interoperation. The MIT license as an add-on can be nice for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright header with different names in the copyright field, for every MIT library in use. The Apache license does not have this drawback. However, this is not the primary motivation for me creating these issues. The Apache license also has protections from patent trolls and an explicit contribution licensing clause. However, the Apache license is incompatible with GPLv2. This is why Rust is dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for GPLv2 compat), and doing so would be wise for this project. This also makes this crate suitable for inclusion and unrestricted sharing in the Rust standard distribution and other projects using dual MIT/Apache, such as my personal ulterior motive, the Robigalia project.

Some ask, "Does this really apply to binary redistributions? Does MIT really require reproducing the whole thing?" I'm not a lawyer, and I can't give legal advice, but some Google Android apps include open source attributions using this interpretation. Others also agree with it. But, again, the copyright notice redistribution is not the primary motivation for the dual-licensing. It's stronger protections to licensees and better interoperation with the wider Rust ecosystem.

How?

To do this, get explicit approval from each contributor of copyrightable work (as not all contributions qualify for copyright, due to not being a "creative work", e.g. a typo fix) and then add the following to your README:

## License Licensed under either of * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0) * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT) at your option. ### Contribution Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

and in your license headers, if you have them, use the following boilerplate (based on that used in Rust):

// Copyright 2016 collenchyma Developers // // Licensed under the Apache License, Version 2.0, <LICENSE-APACHE or // http://apache.org/licenses/LICENSE-2.0> or the MIT license <LICENSE-MIT or // http://opensource.org/licenses/MIT>, at your option. This file may not be // copied, modified, or distributed except according to those terms.

It's commonly asked whether license headers are required. I'm not comfortable making an official recommendation either way, but the Apache license recommends it in their appendix on how to use the license.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these from the Rust repo for a plain-text version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT OR Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval by the necessary contributors and doing this changes, so feel free to leave the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.

Or, if you're a contributor, you can check the box in this repo next to your name. My scripts will pick this exact phrase up and check your checkbox, but I'll come through and manually review this issue later as well.

[x] @MichaelHirn

[x] @hobofan

[x] @homu

[x] @skade

[x] @pdib
opened by emberian 4
for macos
For OS X, you need the kind key of "framework".

This patch and adding build.rs as below made compilation successful on Mac.

fn main() { let YOUR_CUDA_LIB_PATH = ""; println!("cargo:rustc-link-search=native={}", YOUR_CUDA_LIB_PATH); }
opened by y-ich 3
Release numbers for the benchmarks againts other frameworks

Hi, I watched the talk on Rust Berlin, however I did not find any place where there are actually published numbers about the speed performance of the framework. Also on the talk it was said that it is twice faster than Tensorflow and Caffee, however it was not specified on which device, as well as I did not hear comments for comparison with Torch, Theano and MXNet which are defacto the current lead in performance in ML. It would be really great if there could be something that we can actually see as numbers as well as the code for the other frameworks against which you are benchmarking.

opened by botev 3

Non-exhaustive pattern: type &device::DeviceType is non-empty

I am getting the following error when I try to build collenchyma on Windows 10:

C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326:21: 348:22 error: non-exhaustive patterns: type &device::DeviceType is non-empty [E0002]
C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326                     match destination {
                                                                                                                     ^
C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326:21: 348:22 help: run `rustc --explain E0002` to see a detailed explanation
C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326:21: 348:22 help: Please ensure that all possible cases are being handled; possibly adding wildcards or more match arms.
C:\Users\Maplicant\.cargo\registry\src\github.com-1ecc6299db9ec823\collenchyma-0.0.8\src\tensor.rs:326                     match destination {
                                                                                                                     ^
error: aborting due to previous error

I'm guessing this has to do with the match statement at tensor.rs:342 not having a default case, but I'm not sure.

opened by ghost 0

Backend decoupling and typed memory interface
I've implemented feature related to decoupling from #37. Main commit can be viewed here. Below is commit message for convenience:

Change SharedTensor::read() signature from fn read(&self, device: &DeviceType) -> Result<&MemoryType, ...> into fn read<D: IDevice(&self, device: &D) -> Result<&D::M, ...> New signature provides type-level guarantee that if a Cuda device is passed into read(), then it'll return Cuda memory (and not Native or OpenCL). Previously required additional unwraps (.as_native().unwrap()) are no longer required, code is more clear and concise.

Internally SharedTensor uses Any type to store objects of different types uniformely. Synchronization between memories is also done through type-erased interface. This makes it possible to define a new Framework in an external crate, or extract Cuda and OpenCL frameworks into their own crates. Though error types would require some additional work.

Use of "dynamic typing" has drawbacks -- mainly slightly larger runtime overhead. Before this patch benchmarks showed that SharedTensor::read() takes 19-22ns, now it takes 23-26ns. For comparison, minimal synchronized CUDA operation will take about 10-40us. Small NN layers on CPU are much faster, e.g. 10-input softmax layer takes about 500ns. Still, in typical NNs overhead looks negligible, and I think it's fair tradeoff for code clarity and better decoupling.

Here are actual benches, before:

test bench_shared_tensor_access_time_first ... bench: 19 ns/iter (+/- 2) test bench_shared_tensor_access_time_second ... bench: 21 ns/iter (+/- 0)

after:

test bench_shared_tensor_access_time_first ... bench: 23 ns/iter (+/- 0) test bench_shared_tensor_access_time_second ... bench: 26 ns/iter (+/- 3)

What's your opinion on it?
opened by alexandermorozov 0
Refactor synchronization

I've implemented memory access API and syncronization based on bitmasks. Tesnsor/TensorView and decoupling aren't implemented.

Native and CUDA pass all tests. OpenCL compiles but segfaults on my machine, both with this PR and without it.

PR isn't ready to be merged yet -- I'd like to fix plugins and Leaf first to see that there are no unexpected problems.

opened by alexandermorozov 0
Support Windows
With Cuda 7.0/7.5 i get this error ollenchyma>cargo test Running target\debug\backend_specs-5e76dbe75e34191a.exe running 5 tests test backend_spec::native::it_can_use_ibackend_trait_object ... ok test backend_spec::native::it_can_create_default_backend ... ok after that backend_specs-randumnumber.exe stopped working GDB OUTPUT: `running 5 tests [New Thread 16076.0x1804] [New Thread 16076.0x3c24] [New Thread 16076.0x315c] [New Thread 16076.0x34c0] [New Thread 16076.0x32d4] [New Thread 16076.0x42ec] [New Thread 16076.0x34e4] [New Thread 16076.0x2a64] [New Thread 16076.0xe3c]

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 16076.0x34e4] 0x000000000051b670 in cuInit ()`

stackframe: ` #0 0x000000000051b670 in cuInit () #1 0x00000000004874ec in collenchyma::frameworks::cuda::api::driver::utils::API::ffi_init ()

at src\frameworks\cuda\api\driver/utils.rs:15

#2 0x0000000000472fcb in collenchyma::frameworks::cuda::api::driver::utils::API::init ()

at src\frameworks\cuda\api\driver/utils.rs:11

#3 0x0000000000472bfd in collenchyma::frameworks::cuda::Cuda.IFramework::new ()

at src\frameworks\cuda/mod.rs:46

#4 0x0000000000406dfe in backend_specs::backend::IBackend::defaultco::backend::Backendco::frameworks::cuda::Cuda () at src/backend.rs:106 #5 0x0000000000406cf0 in backend_specs::backend_spec::cuda::it_can_create_default_backend ()

at tests/backend_specs.rs:37

#6 0x0000000000429e29 in boxed::F.FnBox$LT$A$GT$::call_box::h809104141212590368 () #7 0x000000000042c844 in sys_common::unwind::try::try_fn::h15534802711051563033 () #8 0x00000000004c471b in sys_common::unwind::try::inner_try::h3ae51230bca914caH9s () #9 0x000000000042cbfb in boxed::F.FnBox$LT$A$GT$::call_box::h11947310459896948609 () #10 0x00000000004d522e in sys::thread::Thread::new::thread_start::h24036cb9e2bd1c0fLey () #11 0x00007ffcfe848102 in KERNEL32!BaseThreadInitThunk ()

from C:\WINDOWS\system32\kernel32.dll #12 0x00007ffd00b0c5b4 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll #13 0x0000000000000000 in ?? ()

Backtrace stopped: previous frame inner to this frame (corrupt stack?) `
opened by gallexme 2

Owner

Autumn

GitHub http://autumnai.github.io/collenchyma

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

294 Dec 23, 2022

Signed distance field font and image command line tool based on OpenCL.

SDFTool Signed distance field font and image command line tool based on OpenCL. Build Windows Run cargo build --release in Visual Studio developer x64

7 Oct 16, 2022

Rust+OpenCL+AVX2 implementation of LLaMA inference code

RLLaMA RLLaMA is a pure Rust implementation of LLaMA large language model inference.. Supported features Uses either f16 and f32 weights. LLaMA-7B, LL

344 Apr 16, 2023

Kuintessence - Open Source HPC Computing Orchestration System

Kuintessence is an advanced computing orchestration system designed to revolutionize HPC workload and cluster management, to empower HPC users by allowing them to concentrate on their scientific ideas rather than getting bogged down by HPC environment complications, and enhance research productivity to its fullest potential.

4 Aug 31, 2023

Simple, extendable and embeddable scripting language.

duckscript duckscript SDK CLI Simple, extendable and embeddable scripting language. Overview Language Goals Installation Homebrew Binary Release Ducks

356 Dec 24, 2022

Easy to use, extendable, OCI-compliant container runtime written in pure Rust

PURA - Lightweight & OCI-compliant container runtime Pura is an experimental Linux container runtime written in pure and dependency-minimal Rust. The

73 Jan 9, 2023

An extendable system made up of autonomous execution services known as nodes organized in a tree of processes. Inspired by Godot!

NodeTree NodeTree is a framework to create large scalable programs and games through a tree of processes. Each process is fully autonomous and is capa