A fast llama2 decoder in pure Rust.

Overview

llama2.rs 🤗

This is a Rust implementation of Llama2 inference on CPU

The goal is to be as fast as possible.

It has the following features:

  • Support for 4-bit GPT-Q Quantization
  • Batched prefill of prompt tokens
  • SIMD support for fast CPU inference
  • Memory mapping, loads 70B instantly.
  • Static size checks for safety
  • Support for Grouped Query Attention (needed for big Llamas)
  • Python calling API

Can run up on 1 tok/s 70B Llama2 and 9 tok/s 7B Llama2. (on my intel i9 desktop)

To build, you'll need the nightly toolchain, which is used by default:

> rustup toolchain install nightly # to get nightly
> ulimit -s 10000000 # Increase your stack memory limit. 

You can load models from the Hugging Face hub. For example this creates a version of a 70B quantized) model with 4 bit quant and 64 sized groups:

> pip install -r requirements.export.txt
> python export.py l70b.act64.bin TheBloke/llama-2-70b-Guanaco-QLoRA-GPTQ gptq-4bit-64g-actorder_True

The library needs to be recompiled to match the model. You can do this with cargo.

To run:

> cargo run --release --features 70B,group_64 -- -c llama2-70b-q.bin -t 0.0 -s 11 -p "The only thing"                                                                                                                                 
The only thing that I can think of is that the          
achieved tok/s: 0.89155835

Honestly, not so bad for running on my GPU machine, significantly faster than llama.c.

Here's a run of 13B quantized:

> cargo run --release --features 13B,group_128 -- -c l13orca.act.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
Hello to all the cool people out there who are reading this. I hope you are having a great day. I am here
achieved tok/s: 5.1588936

Here's a run of 7B quantized:

cargo run --release --features 7B,group_128 -- -c l7.ack.bin -t 0.0 -s 25 -p "Hello to all the cool people out there who "
> Hello to all the cool people out there who are reading this. I am a newbie here and I am looking for some
achieved tok/s: 9.048136

Python

To run in Python, you need to first compile from the main directory with the python flag.

cargo build --release --features 7B,group_128,python
pip install .

You can then run the following code.

import llama2_rs

def test_llama2_13b_4_128act_can_generate():
    model = llama2_rs.LlamaModel("lorca13b.act132.bin", False)
    tokenizer = llama2_rs.Tokenizer("tokenizer.bin")
    random = llama2_rs.Random()
    response = llama2_rs.generate(
        model,
        tokenizer,
        "Tell me zero-cost abstractions in Rust ",
        50,
        random, 
        0.0
    )

Todos

Configuration

In order to make the model as fast as possible, you need to compile a new version to adapt to other Llama versions. Currently in .cargo/config. The model will fail if these disagree with the binary model that is being loaded. To turn quantization off set it to quant="no".

See Also

Originally, a Rust port of Karpathy's llama2.c but now has a bunch more features to make it scale to 70B.

Also check out:

How does it work?

Started as a port of the original code, with extra type information to make it easier to extend.

There are some dependencies:

  • memmap2for memory mapping
  • rayon for parallel computation.
  • clap for command-line args.
  • pyO3 for python calling
  • SIMD enabled support with portable_simd

Authors

Llama2.rs is written by @srush and @rachtsingh.

Comments
  • Working CUDA version

    Working CUDA version

    Introduces a cuda version of the GPTQ kernel. This use triton to write a custom kernel for this code. On my machine it is about 2x faster than CPU :( was expecting it to be faster, but I guess CPU is pretty good. Main features

    • Triton kernel is implemented in python and exported as a PTX file
    • Called from Rust using fp16 and the CUST frameworks. These are gated in the installation.
    • Had to move things around to make them both work. Main call is with a new feature gpu.

    Right now this copies a bunch of code from the GPTQ datastructure. There is likely a cleaner way to do this.

    This almost killed me. Triton is so buggy and hard to use.

    There is also a test that checks that the CPU and GPU give the same results. However I need to read up on Rust testing to understand how to make this a real test.

    opened by srush 7
  • Build script

    Build script

    This one is maybe just an idea - Cargo doesn't seem to support key/value features, but that's the best way to let crates that depend on this choose the model size. So here I set Cargo features (e.g. "7B") and then the build.rs converts that to a k/v pair (model_size = "7B"). I'm not sure the conversion is necessary but I didn't want to mess too much with your existing setup.

    Separately this sets AVX512 as a enable-able feature that errors if the host machine doesn't support AVX512. My home computer doesn't have it, so I haven't checked it.

    opened by rachtsingh 5
  • Python Versions

    Python Versions

    I am trying to run export.py, but I'm running into the following error:

    │Traceback (most recent call last):                                                                                                                           │
    │  File "/home/veda/Downloads/llama2.rs/export.py", line 134, in <module>                                                                                     │
    │    load_and_export("", output_path)                                                                                                                         │
    │  File "/home/veda/Downloads/llama2.rs/export.py", line 126, in load_and_export                                                                              │
    │    export(model, output_path)                                                                                                                               │
    │  File "/home/veda/Downloads/llama2.rs/export.py", line 57, in export                                                                                        │
    │    hidden_dim = model.layers[0].mlp.up_proj.build()[0].shape[1]                                                                                             │
    │  File "/home/veda/miniconda3/envs/llama2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__                                │
    │    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")                                                                        │
    │AttributeError: 'GeneralQuantLinear' object has no attribute 'build'                                                                                         │
    

    I believe this is related to having the incorrect version of some library, but I can't determine the correct version (I'm not really much of a python person), my versions are:

    Python 3.10.9
    torch==2.1.0.dev20230812+rocm5.5 
    transformers==4.31.0
    auto-gptq==0.3.0
    

    I would suggest either updating the readme to include the exact versions to use, or adding a requirements.txt.

    opened by VedaRePowered 4
  • Non-mmap'ed weights

    Non-mmap'ed weights

    @rachtsingh

    For the Cuda code, I convert the model to a different form let's call it QCudaWeights. However, since it is not mmap'ed the it now has an owner. However I'm not sure how to create a LLamaModel from this.

    pub struct LlamaModel {
        mmap: Mmap,
        pub config: Config,
        pub weights: &'static TWeights,
    }
    

    Any ideas for how to have a LlamaModel that can handle either owned or non-owned weights? Mmap is pretty interesting as a concept that breaks borrowing in Rust.

    opened by srush 2
  • [wip] Cuda

    [wip] Cuda

    Playing around with a CUDA kernel for GPT-Q in triton. Was able to get a version working but it was too slow. Will probably get around to finishing this next week.

    opened by srush 2
  • Configure model size constants using cfg attrs

    Configure model size constants using cfg attrs

    Hey Sasha,

    I was trying out llama2.rs and wanted to swap between the 7B/13B versions on the fly, and I think using conditional compilation here makes it a bit easier.

    I also spent some time trying to optimize the SIMD but it seems really fast! I couldn't find any easy optimizations.

    opened by rachtsingh 2
  • Attempt to go even faster

    Attempt to go even faster

    Two main things I did:

    • As all the dimensions you're vectorizing are perfectly matching the lane sizes you can move from chunks() to chunks_exact() this avoids a branching operation allowing the compiler to evict cmp, jmp and generating epilog "in case"

    • Emit a prefetching (x86_64 only - _mm_prefetech ) operation in the most inner loop doing the dequantization from zero point + scaling. It seems the compiler (or the CPU prefetech) is not catching the linear access over qzeros and generates cache misses. With the prefetch instruction we ask to load the next chunk into L1/L2[/L3] registers

    opened by mfuntowicz 2
  • Made Python support optional

    Made Python support optional

    This wraps a bunch of the Python scaffolding behind a feature python so you can compile the binary without pyo3 at all. The default pip install . and maturin build commands use the python feature so there's no change necessary to README, I think.

    Shrinks the binary from 38MB to 31MB on my machine.

    opened by rachtsingh 1
  • Python interface

    Python interface

    Does the following:

    1. Creates a Python package llama2_rs in addition to the existing binary (also called llama2_rs) using PyO3, which exposes an interface for doing inference from Python. You can see an example in tests/test_model_can_run.py
    2. Moves some of the code needed by both paths (e.g. prefill and generate) into a new inference.rs so it can be used from both places.
    3. Wraps up the &TWeights into a new struct LlamaModel that holds the mmap as well.

    Notes:

    • Includes the change in the export.py PR since it was useful for testing.
    • I think some of the names could be changed, in particular maybe just llama2 instead of llama2_rs?
    • Note the pinned auto-gptq version... I thought it safest until we figure out what's going on with https://github.com/srush/llama2.rs/issues/23
    • cargo build --release does still generate a binary here, but if you want the Python package installed you want to run: pip install maturin; maturin develop --release, so if you decide to merge, might need a docs change.

    Next up write a check that the matvec results are the same in AutoGPTQ and here?

    opened by rachtsingh 1
  • readme commands doesn't work

    readme commands doesn't work

    • There is no command for install llama2-70b-q.bin and does'nt work

    pip install torch transformers [auto-gptq](https://github.com/PanQiWei/AutoGPTQ) python export.py llama2-70b-q.bin

    • this works

    pip install torch transformers auto-gptq python export.py llama2-70b-q.bin

    opened by AsureDay 1
  • Max vocab size

    Max vocab size

    Cuts off some of the additional tokens that are added for llama fine tunes (will support in a future PR).

    https://github.com/srush/llama2.rs/issues/22

    opened by srush 0
Owner
Sasha Rush
Cornell Tech / Hugging Face
Sasha Rush
RISC-V instruction decoder written in Rust.

raki RISC-V instruction decoder written in Rust. Both 32/64bit support. Support rv32/64imac, Zicsr, Zifencei extensions. Implement Display trait for f

Norimasa Takana 5 Oct 18, 2023
🗃 Decoder and utility for the Flipnote Studios .ppm animation format.

?? para What? A decoder and utility for the Flipnote Studios .ppm animation format. Why this implementation over the original? This implementation is

Fuwn 11 Dec 12, 2022
dustls, a pure-rust DTLS implementation

dustls, a pure-rust DTLS implementation A DTLSv1.2 implementation in Rust, reusing rustls for cryptographic primitives and most message payload format

Jonathan de Jong 10 Nov 28, 2022
WIP demo project for pure rust playing nicely with MCUBoot/Infinitime OTA

pinetime-rust-mcuboot WIP demo project for pure rust playing nicely with MCUBoot/Infinitime OTA Firmware Behavior This is an example project so I just

David Boles 2 Jan 26, 2022
Risc-V assembly interpreter built with pure Rust

Risc-V Interpreter Interpreter for Risc-V assembly built with Rust Report bug · Request feature Table of contents Quick start Exemple Program Instruct

null 2 Aug 24, 2022
A pure-Rust serverless discord chatbot hosted on Cloudflare Workers.

About A pure-Rust serverless discord chatbot hosted on Cloudflare Workers. With a free account you have up to 100k requests per day. For storing state

Mike Dallas 31 Nov 21, 2022
A little bit fast and modern Ruby version manager written in Rust

A little bit fast and modern Ruby version manager written in Rust Features Pure Rust implementation not using ruby-build Cross-platform support (macOS

Takayuki Maeda 510 Jan 5, 2023
🚀 Fast and 100% API compatible postcss replacer, built in Rust

?? Fast and 100% API compatible postcss replacer, built in Rust

迷渡 472 Jan 7, 2023
Fast and simple PHP version manager written in rust

[WIP] phpup (PHP-up): Fast and Simple PHP version manager âš¡ Fast and simple PHP version manager, written in rust Features No requirements for system P

null 27 Dec 25, 2022
A dead-simple, extreme fast permission flag system for Rust with no dependencies

A dead-simple, extreme fast permission flag system for Rust with no dependencies

pan93412 2 Mar 17, 2022
🚀 Fleet is the blazing fast build tool for Rust

Fleet is the blazing fast build tool for Rust. Compiling with Fleet is up-to 5x faster than with cargo.

Dimension 2.2k Jan 6, 2023
A simple, fast and fully-typed JSPaste API wrapper for Rust

rspaste A simple, fast and fully-typed JSPaste API wrapper for Rust. aidak.tk » Installation Put the desired version of the crate into the dependencie

Aidak 2 May 17, 2022
A blazingly fast 🔥 Discord bot written in Rust

rusty-bot ?? A blazingly fast ?? Discord bot written in Rust. Commands name use !rm <count> deletes old messages !meme <subreddit> sends a random meme

Asandei Stefan 2 Oct 14, 2022
A fast & light weight Discord Client made with love using the Rust programming language.

LemonCord A fast & light-weight Discord Client written in Rust using the wry crate. Features Fast, light-weight, easy to use. 100% Open sourced. No su

Lemon Rose 5 Jan 30, 2023
A fast, powerful and configurable interpreter written in Rust

brainfuck-interpreter Overview A fast, powerful and configurable interpreter written in Rust, which allows various options to meet different demends,

Justin Chen 4 Feb 12, 2023
A fast package manager for NodeJS written in Rust.

click A fast package manager for NodeJS written in Rust. How fast? Benchmark of bun vs click clean install: Based on benchmarks done with hyperfine, c

Sam 52 Oct 10, 2023
A blazinlgy fast 🚀 transpiler written in rust 🦀 that fixes (pun intended) your problems

Pissfix ?? Pissfix is a blazingly fast ?? programming language that transpiles to a "interesting" and not well known programming language called "Post

null 3 Sep 28, 2023
âš¡rustygram is a minimal and blazing fast telegram notification framework for Rust

âš¡rustygram âš¡rustygram is a minimal and blazing fast telegram notification framework using Rust. Abstracts away the Telegram API complexity so your app

Chia Yong Kang 15 Dec 1, 2023
MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine

MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine. Both searching and indexing are highly customizable. Features such as typo-tolerance, filters, and synonyms are provided out-of-the-box. For more information about features go to our documentation.

MeiliSearch 31.6k Dec 30, 2022