Deep learning at the speed of light.

Joe Fioti

Last update: Jul 25, 2023

Related tags

Machine learning luminal

Overview

luminal

Deep learning at the speed of light.

use luminal::prelude::*;

// Setup graph and tensors
let mut cx = Graph::new();
let a = cx.new_tensor::<R2<3, 1>>();
let b = cx.new_tensor::<R2<1, 4>>();

// Do stuff...
let c = a.matmul(b);

// Set inputs and mark outputs
a.set(vec![1.0, 2.0, 3.0]);
b.set(vec![1.0, 2.0, 3.0, 3.0]);
c.mark();

// Optimize and run graph
cx.optimize(GenericOptimizer::default());
cx.execute();

// Get result
println!("Result: {:?}", c.retrieve().unwrap().data);

Why does this look so different from other DL libraries?

Most deep learning libraries are eager-first, meaning each op call directly operates on the data. So when you see x + y, the addition actually happens right there. This is great for debugging, it works exactly as most developers expect.

However, this isn't great for performance because what makes sense for a developer doesn't make sense for the machine, in the same way that no one writes assembly by hand. Most libraries try to fix this problem by tacking on operator fusion or JIT compilation to try to change the compilation flow to something better for the machine. Turns out this is super difficult even for Pytorch!

Luminal takes a different approach, more similar to XLA, and tinygrad. Here everything's static. When you write out an expression like x + y, no actual computation happens. The operation is recorded to a directed acyclic computation graph for execution later. Only once graph.execute() is ran does the computation happen. But isn't that just lazy execution? Yes it is! But in luminal everything is done this way. All neural networks are built up as one or a few static computation graphs, and executed later.

But Why?

A consequence of this is that the actual computation that gets ran can be radically different than the code that was written. Since we have an entire neural network fully represented in a compute graph, our optimizers have global knowledge and can do much more aggressive optimization without any sync points.

Of course, we can still split the network into multiple seperate graphs if we want to insert dynamic control flow part-way through, which means this method doesn't preclude optimizations like KV caching, because the KV cached forward pass is just a seperate graph!

Some huge benefits are now unlocked:

Aggressive kernel fusion
Shape-specific kernels compiled at runtime
Devices and Dtypes are handled through optimizers (just run the CUDA optimizer to convert the graph to use CUDA kernels, then the fp16 optimizer to convert to half-precision kernels)
Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures (try writing a PyTorch network that works with both TF32 dtypes and TPUs; get ready for if statement hell...)

RISC-style architecture

Luminal can be ran on new accelerators by implementing 13 primitive ops.

Accellerators are free to implement their own custom ops, and their own optimizers to convert luminal primitive ops to their bespoke ops.

Compile-time Shape Checks

All operations are shape checked at compile time, so no more shape mismatches! All credit for this goes to dfdx.

View the Graph

Once you've written all your computation code, run cx.display_graph() to see the entire computation graph in all it's glory. Pretty messy looking! Now run cx.optimize(GeneralOptimizer::default()) and display the graph again. Much better.

Where are we?

Currently luminal is extremely alpha. Please don't use this in prod.

Some things on the roadmap:

Write common sense cuda ops and optimizer (matmuls, mul-add, etc.)
Create reasonable library of NN modules
Build benchmarking suite to test against other libs
Write specialized CUDA kernels for full transformer architecture (FlashAttention, etc.)
Match PT 2.0 perf on LLM training
Build dyson swarm

Flexible, reusable reinforcement learning (Q learning) implementation in Rust

Rurel Rurel is a flexible, reusable reinforcement learning (Q learning) implementation in Rust. Release documentation In Cargo.toml: rurel = "0.2.0"

60 Dec 29, 2022

A Rust machine learning framework.

Linfa linfa (Italian) / sap (English): The vital circulating fluid of a plant. linfa aims to provide a comprehensive toolkit to build Machine Learning

2.2k Jan 2, 2023

Machine Learning library for Rust

rusty-machine This library is no longer actively maintained. The crate is currently on version 0.5.4. Read the API Documentation to learn more. And he

1.2k Dec 31, 2022

Machine learning crate for Rust

rustlearn A machine learning package for Rust. For full usage details, see the API documentation. Introduction This crate contains reasonably effectiv

547 Dec 28, 2022

Xaynet represents an agnostic Federated Machine Learning framework to build privacy-preserving AI applications.

xaynet Xaynet: Train on the Edge with Federated Learning Want a framework that supports federated learning on the edge, in desktop browsers, integrate

196 Dec 22, 2022

The Hacker's Machine Learning Engine

Juice This is the workspace project for juice - machine learning frameworks for hackers coaster - underlying math abstraction coaster-nn coaster-blas

982 Dec 31, 2022

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

405 Dec 20, 2022

A fast, safe and easy to use reinforcement learning framework in Rust.

RSRL (api) Reinforcement learning should be fast, safe and easy to use. Overview rsrl provides generic constructs for reinforcement learning (RL) expe

139 Dec 13, 2022

Machine learning in Rust.

Rustml Rustml is a library for doing machine learning in Rust. The documentation of the project with a descprition of the modules can be found here. F