LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!

Overview

LLaMa 7b in rust

This repo contains the popular LLaMa 7b language model, fully implemented in the rust programming language!

Uses dfdx tensors and CUDA acceleration.

This runs LLaMa directly in f16, meaning there is no hardware acceleration on CPU. Using CUDA is heavily recommended.

Here is the 7b model running on an A10 GPU:

How To Run

(Once) Setting up model weights

Download model weights

  1. Install git lfs. On ubuntu you can run sudo apt install git-lfs
  2. Activate git lfs with git lfs install.
  3. Run the following commands to download the model weights in pytorch format (~25 GB):
    1. LLaMa 7b (~25 GB): git clone https://huggingface.co/decapoda-research/llama-7b-hf
    2. LLaMa 13b (~75 GB): git clone https://huggingface.co/decapoda-research/llama-13b-hf
    3. LLaMa 65b (~244 GB): git clone https://huggingface.co/decapoda-research/llama-65b-hf

Convert the model

  1. (Optional) Run python3.x -m venv <my_env_name> to create a python virtual environment, where x is your prefered python version
  2. (Optional, requires 1.) Run source <my_env_name>\bin\activate (or <my_env_name>\Scripts\activate if on Windows) to activate the environment
  3. Run pip install numpy torch
  4. Run python convert.py to convert the model weights to rust understandable format: a. LLaMa 7b: python convert.py b. LLaMa 13b: python convert.py llama-13b-hf c. LLaMa 65b: python convert.py llama-65b-hf

(Once) Compile

You can compile with normal rust commands:

With cuda:

cargo build --release -F cuda

Without cuda:

cargo build --release

Run the executable

With default args:

./target/release/llama-dfdx --model <model-dir> generate "<prompt>"
./target/release/llama-dfdx --model <model-dir> chat
./target/release/llama-dfdx --model <model-dir> file <path to prompt file>

To see what commands/custom args you can use:

./target/release/llama-dfdx --help
Comments
  • Can't use args

    Can't use args

    Running .\target\release\llama-dfdx.exe chat -n=1024 always results in:

    error: unexpected argument '-n' found
    
    Usage: llama-dfdx.exe chat
    
    For more information, try '--help'.
    

    Same for no subcommand/generate/on wsl. Any ideas?

    opened by M1ngXU 2
  • Error running

    Error running

    When trying to run the program, I get this error:

    thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Driver(DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain."))', /home/opfromthestart/.cargo/git/checkouts/dfdx-318e6e5ad83eea79/19da9fe/src/tensor_ops/select_and_gather/mod.rs:155:30
    
    opened by opfromthestart 1
  • Adds small improvements to README

    Adds small improvements to README

    Just some newbie-friendly small additions to the README. Feel free to only take from it any inspiration you find reasonable for the future, no need to merge if you think this makes no sense. Just found it the fastest way possible to suggest this (and in case it makes sense for you, the fastest way these additions to be added).

    opened by kstavro 0
  • Auto determine how much of the model to load into RAM

    Auto determine how much of the model to load into RAM

    Use cases:

    1. You can fit the whole model into GPU ram
    2. You can fit part of the model into GPU ram
    3. You need keep all the model weights on disk

    In all these cases, we should be able to detect how much GPU ram is available, and determine the max amount of model to store that way. More advanced use cases of sharing GPU with other applications may need manual control over the memory, but that can be done later.

    opened by coreylowman 0
  • Add instructions for Alpaca 7b weights to README

    Add instructions for Alpaca 7b weights to README

    Alpaca 7b should be the exact same structure, so as long as you can convert the weights into the same format with convert.py it should be runnable out of the box

    opened by coreylowman 1
Owner
Corey Lowman
Corey Lowman
Asynchronous CUDA, NPP and TensorRT for Rust.

Asynchronous CUDA, NPP and TensorRT ℹī¸ Introduction The async-cuda family of libraries is an experimental set of libraries for interacting with the GP

Oddity.ai 4 Jun 19, 2023
A Rusty CUDA wrapper

cuda-oxide cuda-oxide is a safe wrapper for CUDA. With cuda-oxide you can execute and coordinate CUDA kernels. Safety Philosophy cuda-oxide does not o

Max Bruce 30 Dec 7, 2022
Rust+OpenCL+AVX2 implementation of LLaMA inference code

RLLaMA RLLaMA is a pure Rust implementation of LLaMA large language model inference.. Supported features Uses either f16 and f32 weights. LLaMA-7B, LL

Mikko Juola 344 Apr 16, 2023
Rust based Cross-GPU Machine Learning

HAL : Hyper Adaptive Learning Rust based Cross-GPU Machine Learning. Why Rust? This project is for those that miss strongly typed compiled languages.

Jason Ramapuram 83 Dec 20, 2022
A real-time implementation of "Ray Tracing in One Weekend" using nannou and rust-gpu.

Real-time Ray Tracing with nannou & rust-gpu An attempt at a real-time implementation of "Ray Tracing in One Weekend" by Peter Shirley. This was a per

null 89 Dec 23, 2022
Ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

Ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust.

Riccardo D'Ambrosio 2.1k Jan 5, 2023
Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.

The Rust CUDA Project An ecosystem of libraries and tools for writing and executing extremely fast GPU code fully in Rust Guide | Getting Started | Fe

Rust GPU 2.1k Dec 30, 2022
How to: Run Rust code on your NVIDIA GPU

Status This documentation about an unstable feature is UNMAINTAINED and was written over a year ago. Things may have drastically changed since then; r

null 343 Dec 22, 2022
🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧

?? rust-gpu Rust as a first-class language and ecosystem for GPU graphics & compute shaders Current Status ?? Note: This project is still heavily in d

Embark 5.5k Jan 9, 2023
A Demo server serving Bert through ONNX with GPU written in Rust with <3

Demo BERT ONNX server written in rust This demo showcase the use of onnxruntime-rs on BERT with a GPU on CUDA 11 served by actix-web and tokenized wit

Xavier Tao 28 Jan 1, 2023
Wonnx - a GPU-accelerated ONNX inference run-time written 100% in Rust, ready for the web

Wonnx is a GPU-accelerated ONNX inference run-time written 100% in Rust, ready for the web. Supported Platforms (enabled by wgpu) API Windows Linux &

WebONNX 354 Jan 6, 2023
A gpu accelerated (optional) neural network Rust crate.

Intricate A GPU accelerated library that creates/trains/runs neural networks in pure safe Rust code. Architechture overview Intricate has a layout ver

Gabriel Miranda 11 Dec 26, 2022
A repo for learning how to parallelize computations in the GPU using Apple's Metal, in Rust.

Metal playground in rust Made for learning how to parallelize computations in the GPU using Apple's Metal, in Rust, via the metal crate. Overview The

Lambdaclass 5 Feb 20, 2023
A fun, hackable, GPU-accelerated, neural network library in Rust, written by an idiot

Tensorken: A Fun, Hackable, GPU-Accelerated, Neural Network library in Rust, Written by an Idiot (work in progress) Understanding deep learning from t

Kurt Schelfthout 44 May 6, 2023
Signed distance functions + Rust (CPU & GPU) = ❤ī¸â¤ī¸

sdf-playground Signed distance functions + Rust (CPU & GPU) = ❤ī¸â¤ī¸ Platforms: Windows, Mac & Linux. About sdf-playground is a demo showcasing how you

Patryk Wychowaniec 5 Nov 16, 2023
rust-gpu CLI driver

rust-gpu-driver Experiment to make rust-gpu more accessible as a GPU shading language in various projects. DISCLAIMER: This is an unstable experiment

Fredrik Fornwall 9 Feb 16, 2024
Open Machine Intelligence Framework for Hackers. (GPU/CPU)

Leaf â€ĸ Introduction Leaf is a open Machine Learning Framework for hackers to build classical, deep or hybrid machine learning applications. It was ins

Autumn 5.5k Jan 1, 2023
Open deep learning compiler stack for cpu, gpu and specialized accelerators

Open Deep Learning Compiler Stack Documentation | Contributors | Community | Release Notes Apache TVM is a compiler stack for deep learning systems. I

The Apache Software Foundation 8.9k Jan 4, 2023
Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

Damavand is a code that simulates quantum circuits. In order to learn more about damavand, refer to the documentation. Development status Core feature

prevision.io 6 Mar 29, 2022