LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!

Corey Lowman

Last update: May 8, 2023

Related tags

Machine learning rust deep-learning neural-network cuda inference rust-lang llama language-model

Overview

LLaMa 7b in rust

This repo contains the popular LLaMa 7b language model, fully implemented in the rust programming language!

Uses dfdx tensors and CUDA acceleration.

This runs LLaMa directly in f16, meaning there is no hardware acceleration on CPU. Using CUDA is heavily recommended.

Here is the 7b model running on an A10 GPU:

How To Run

(Once) Setting up model weights

Download model weights

Install git lfs. On ubuntu you can run sudo apt install git-lfs
Activate git lfs with git lfs install.
Run the following commands to download the model weights in pytorch format (~25 GB):
1. LLaMa 7b (~25 GB): git clone https://huggingface.co/decapoda-research/llama-7b-hf
2. LLaMa 13b (~75 GB): git clone https://huggingface.co/decapoda-research/llama-13b-hf
3. LLaMa 65b (~244 GB): git clone https://huggingface.co/decapoda-research/llama-65b-hf

Convert the model

(Optional) Run python3.x -m venv <my_env_name> to create a python virtual environment, where x is your prefered python version
(Optional, requires 1.) Run source <my_env_name>\bin\activate (or <my_env_name>\Scripts\activate if on Windows) to activate the environment
Run pip install numpy torch
Run python convert.py to convert the model weights to rust understandable format: a. LLaMa 7b: python convert.py b. LLaMa 13b: python convert.py llama-13b-hf c. LLaMa 65b: python convert.py llama-65b-hf

(Once) Compile

You can compile with normal rust commands:

With cuda:

cargo build --release -F cuda

Without cuda:

cargo build --release

Run the executable

With default args:

./target/release/llama-dfdx --model <model-dir> generate "<prompt>"
./target/release/llama-dfdx --model <model-dir> chat
./target/release/llama-dfdx --model <model-dir> file <path to prompt file>

To see what commands/custom args you can use:

./target/release/llama-dfdx --help

Comments

Can't use args
Running .\target\release\llama-dfdx.exe chat -n=1024 always results in:

error: unexpected argument '-n' found Usage: llama-dfdx.exe chat For more information, try '--help'.

Same for no subcommand/generate/on wsl. Any ideas?
opened by M1ngXU 2

Error running

When trying to run the program, I get this error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Driver(DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain."))', /home/opfromthestart/.cargo/git/checkouts/dfdx-318e6e5ad83eea79/19da9fe/src/tensor_ops/select_and_gather/mod.rs:155:30

opened by opfromthestart 1

Adds small improvements to README

Just some newbie-friendly small additions to the README. Feel free to only take from it any inspiration you find reasonable for the future, no need to merge if you think this makes no sense. Just found it the fastest way possible to suggest this (and in case it makes sense for you, the fastest way these additions to be added).

opened by kstavro 0
Auto determine how much of the model to load into RAM
Use cases:

You can fit the whole model into GPU ram

You can fit part of the model into GPU ram

You need keep all the model weights on disk

In all these cases, we should be able to detect how much GPU ram is available, and determine the max amount of model to store that way. More advanced use cases of sharing GPU with other applications may need manual control over the memory, but that can be done later.
opened by coreylowman 0
Add instructions for Alpaca 7b weights to README

Alpaca 7b should be the exact same structure, so as long as you can convert the weights into the same format with convert.py it should be runnable out of the box

opened by coreylowman 1

Owner

Corey Lowman

GitHub

Asynchronous CUDA, NPP and TensorRT for Rust.

Asynchronous CUDA, NPP and TensorRT ℹ️ Introduction The async-cuda family of libraries is an experimental set of libraries for interacting with the GP

4 Jun 19, 2023

A Rusty CUDA wrapper

cuda-oxide cuda-oxide is a safe wrapper for CUDA. With cuda-oxide you can execute and coordinate CUDA kernels. Safety Philosophy cuda-oxide does not o

30 Dec 7, 2022

Rust+OpenCL+AVX2 implementation of LLaMA inference code

RLLaMA RLLaMA is a pure Rust implementation of LLaMA large language model inference.. Supported features Uses either f16 and f32 weights. LLaMA-7B, LL

344 Apr 16, 2023

Rust based Cross-GPU Machine Learning

HAL : Hyper Adaptive Learning Rust based Cross-GPU Machine Learning. Why Rust? This project is for those that miss strongly typed compiled languages.

83 Dec 20, 2022

A real-time implementation of "Ray Tracing in One Weekend" using nannou and rust-gpu.

Real-time Ray Tracing with nannou & rust-gpu An attempt at a real-time implementation of "Ray Tracing in One Weekend" by Peter Shirley. This was a per

89 Dec 23, 2022

Ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

2.1k Jan 5, 2023

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

The Rust CUDA Project An ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust Guide | Getting Started | Fe

2.1k Dec 30, 2022

How to: Run Rust code on your NVIDIA GPU

Status This documentation about an unstable feature is UNMAINTAINED and was written over a year ago. Things may have drastically changed since then; r

343 Dec 22, 2022

🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧

?? rust-gpu Rust as a first-class language and ecosystem for GPU graphics & compute shaders Current Status ?? Note: This project is still heavily in d

5.5k Jan 9, 2023

A Demo server serving Bert through ONNX with GPU written in Rust with <3

Demo BERT ONNX server written in rust This demo showcase the use of onnxruntime-rs on BERT with a GPU on CUDA 11 served by actix-web and tokenized wit

28 Jan 1, 2023

Wonnx - a GPU-accelerated ONNX inference run-time written 100% in Rust, ready for the web

Wonnx is a GPU-accelerated ONNX inference run-time written 100% in Rust, ready for the web. Supported Platforms (enabled by wgpu) API Windows Linux &

354 Jan 6, 2023

A gpu accelerated (optional) neural network Rust crate.

Intricate A GPU accelerated library that creates/trains/runs neural networks in pure safe Rust code. Architechture overview Intricate has a layout ver

11 Dec 26, 2022

A repo for learning how to parallelize computations in the GPU using Apple's Metal, in Rust.

Metal playground in rust Made for learning how to parallelize computations in the GPU using Apple's Metal, in Rust, via the metal crate. Overview The

5 Feb 20, 2023

A fun, hackable, GPU-accelerated, neural network library in Rust, written by an idiot

Tensorken: A Fun, Hackable, GPU-Accelerated, Neural Network library in Rust, Written by an Idiot (work in progress) Understanding deep learning from t

44 May 6, 2023

Signed distance functions + Rust (CPU & GPU) = ❤️❤️

sdf-playground Signed distance functions + Rust (CPU & GPU) = ❤️❤️ Platforms: Windows, Mac & Linux. About sdf-playground is a demo showcasing how you

5 Nov 16, 2023

rust-gpu CLI driver

rust-gpu-driver Experiment to make rust-gpu more accessible as a GPU shading language in various projects. DISCLAIMER: This is an unstable experiment

9 Feb 16, 2024

Open Machine Intelligence Framework for Hackers. (GPU/CPU)

Leaf • Introduction Leaf is a open Machine Learning Framework for hackers to build classical, deep or hybrid machine learning applications. It was ins

5.5k Jan 1, 2023

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Open Deep Learning Compiler Stack Documentation | Contributors | Community | Release Notes Apache TVM is a compiler stack for deep learning systems. I

8.9k Jan 4, 2023

Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

Damavand is a code that simulates quantum circuits. In order to learn more about damavand, refer to the documentation. Development status Core feature

6 Mar 29, 2022

LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!

Related tags

Overview

LLaMa 7b in rust

How To Run

(Once) Setting up model weights

Download model weights

Convert the model

(Once) Compile

Run the executable

Comments

Can't use args

Error running

Adds small improvements to README

Auto determine how much of the model to load into RAM

Add instructions for Alpaca 7b weights to README

Owner

Corey Lowman

Asynchronous CUDA, NPP and TensorRT for Rust.

A Rusty CUDA wrapper

Rust+OpenCL+AVX2 implementation of LLaMA inference code

Rust based Cross-GPU Machine Learning

A real-time implementation of "Ray Tracing in One Weekend" using nannou and rust-gpu.

Ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

How to: Run Rust code on your NVIDIA GPU

🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧

A Demo server serving Bert through ONNX with GPU written in Rust with <3

Wonnx - a GPU-accelerated ONNX inference run-time written 100% in Rust, ready for the web

A gpu accelerated (optional) neural network Rust crate.

A repo for learning how to parallelize computations in the GPU using Apple's Metal, in Rust.

A fun, hackable, GPU-accelerated, neural network library in Rust, written by an idiot

Signed distance functions + Rust (CPU & GPU) = ❤️❤️

rust-gpu CLI driver

Open Machine Intelligence Framework for Hackers. (GPU/CPU)

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.