Deep learning at the speed of light.

Overview

luminal

image Deep learning at the speed of light.

use luminal::prelude::*;

// Setup graph and tensors
let mut cx = Graph::new();
let a = cx.new_tensor::<R2<3, 1>>();
let b = cx.new_tensor::<R2<1, 4>>();

// Do stuff...
let c = a.matmul(b);

// Set inputs and mark outputs
a.set(vec![1.0, 2.0, 3.0]);
b.set(vec![1.0, 2.0, 3.0, 3.0]);
c.mark();

// Optimize and run graph
cx.optimize(GenericOptimizer::default());
cx.execute();

// Get result
println!("Result: {:?}", c.retrieve().unwrap().data);

Why does this look so different from other DL libraries?

Most deep learning libraries are eager-first, meaning each op call directly operates on the data. So when you see x + y, the addition actually happens right there. This is great for debugging, it works exactly as most developers expect.

However, this isn't great for performance because what makes sense for a developer doesn't make sense for the machine, in the same way that no one writes assembly by hand. Most libraries try to fix this problem by tacking on operator fusion or JIT compilation to try to change the compilation flow to something better for the machine. Turns out this is super difficult even for Pytorch!

Luminal takes a different approach, more similar to XLA, and tinygrad. Here everything's static. When you write out an expression like x + y, no actual computation happens. The operation is recorded to a directed acyclic computation graph for execution later. Only once graph.execute() is ran does the computation happen. But isn't that just lazy execution? Yes it is! But in luminal everything is done this way. All neural networks are built up as one or a few static computation graphs, and executed later.

But Why?

A consequence of this is that the actual computation that gets ran can be radically different than the code that was written. Since we have an entire neural network fully represented in a compute graph, our optimizers have global knowledge and can do much more aggressive optimization without any sync points.

Of course, we can still split the network into multiple seperate graphs if we want to insert dynamic control flow part-way through, which means this method doesn't preclude optimizations like KV caching, because the KV cached forward pass is just a seperate graph!

Some huge benefits are now unlocked:

  • Aggressive kernel fusion
  • Shape-specific kernels compiled at runtime
  • Devices and Dtypes are handled through optimizers (just run the CUDA optimizer to convert the graph to use CUDA kernels, then the fp16 optimizer to convert to half-precision kernels)
  • Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures (try writing a PyTorch network that works with both TF32 dtypes and TPUs; get ready for if statement hell...)

RISC-style architecture

Luminal can be ran on new accelerators by implementing 13 primitive ops.

Accellerators are free to implement their own custom ops, and their own optimizers to convert luminal primitive ops to their bespoke ops.

Compile-time Shape Checks

All operations are shape checked at compile time, so no more shape mismatches! All credit for this goes to dfdx.

View the Graph

Once you've written all your computation code, run cx.display_graph() to see the entire computation graph in all it's glory. Pretty messy looking! Now run cx.optimize(GeneralOptimizer::default()) and display the graph again. Much better.

Where are we?

Currently luminal is extremely alpha. Please don't use this in prod.

Some things on the roadmap:

  • Write common sense cuda ops and optimizer (matmuls, mul-add, etc.)
  • Create reasonable library of NN modules
  • Build benchmarking suite to test against other libs
  • Write specialized CUDA kernels for full transformer architecture (FlashAttention, etc.)
  • Match PT 2.0 perf on LLM training
  • Build dyson swarm
You might also like...
Flexible, reusable reinforcement learning (Q learning) implementation in Rust

Rurel Rurel is a flexible, reusable reinforcement learning (Q learning) implementation in Rust. Release documentation In Cargo.toml: rurel = "0.2.0"

A Rust machine learning framework.

Linfa linfa (Italian) / sap (English): The vital circulating fluid of a plant. linfa aims to provide a comprehensive toolkit to build Machine Learning

Machine Learning library for Rust

rusty-machine This library is no longer actively maintained. The crate is currently on version 0.5.4. Read the API Documentation to learn more. And he

Machine learning crate for Rust

rustlearn A machine learning package for Rust. For full usage details, see the API documentation. Introduction This crate contains reasonably effectiv

Xaynet represents an agnostic Federated Machine Learning framework to build privacy-preserving AI applications.
Xaynet represents an agnostic Federated Machine Learning framework to build privacy-preserving AI applications.

xaynet Xaynet: Train on the Edge with Federated Learning Want a framework that supports federated learning on the edge, in desktop browsers, integrate

The Hacker's Machine Learning Engine

Juice This is the workspace project for juice - machine learning frameworks for hackers coaster - underlying math abstraction coaster-nn coaster-blas

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

A fast, safe and easy to use reinforcement learning framework in Rust.
A fast, safe and easy to use reinforcement learning framework in Rust.

RSRL (api) Reinforcement learning should be fast, safe and easy to use. Overview rsrl provides generic constructs for reinforcement learning (RL) expe

Machine learning in Rust.

Rustml Rustml is a library for doing machine learning in Rust. The documentation of the project with a descprition of the modules can be found here. F

Owner
Joe Fioti
Joe Fioti
Open deep learning compiler stack for cpu, gpu and specialized accelerators

Open Deep Learning Compiler Stack Documentation | Contributors | Community | Release Notes Apache TVM is a compiler stack for deep learning systems. I

The Apache Software Foundation 8.9k Jan 4, 2023
A deep learning library for rust

Alumina An experimental deep learning library written in pure rust. Breakage expected on each release in the short term. See mnist.rs in examples or R

zza 95 Nov 30, 2022
Awesome deep learning crate

NeuroFlow is fast neural networks (deep learning) Rust crate. It relies on three pillars: speed, reliability, and speed again. Hello, everyone! Work o

Mikhail Kravets 70 Nov 20, 2022
🦀 Example of serving deep learning models in Rust with batched prediction

rust-dl-webserver This project provides an example of serving a deep learning model with batched prediction using Rust. In particular it runs a GPT2 m

Evan Pete Walsh 28 Dec 15, 2022
Messing around with deep learning

Deep Learning Test Implementing deep learning in Rust using just a linear algebra library (nalgebra). The neural network (4 hidden layers, 32 neurons

Dmitry Zamkov 9 Jun 22, 2022
miniature: a toy deep learning library written in Rust

miniature: a toy deep learning library written in Rust A miniature is a toy deep learning library written in Rust. The miniature is: implemented for a

Takuma Seno 4 Nov 29, 2021
High performance distributed framework for training deep learning recommendation models based on PyTorch.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI platform@Kuaishou Technology, collaborating with ETH. It

null 340 Dec 30, 2022
Deep learning superresolution in pure rust

Rusty_SR A Rust super-resolution tool, which when given a low resolution image utilises deep learning to infer the corresponding high resolution image

zza 189 Dec 9, 2022
Deep recommender systems for Rust

sbr An implementation of sequence recommenders based on the wyrm autdifferentiaton library. sbr-rs sbr implements efficient recommender algorithms whi

Maciej Kula 112 Dec 24, 2022
☁ Puff ☁ - The deep stack framework.

☁ Puff ☁ Python with an async runtime built-in Rust for GraphQL, ASGI, WSGI, Postgres, PubSub, Redis, Distributed Tasks, and HTTP2 Client. What is Puf

Kyle Hanson 290 Jan 8, 2023