Run LLaMA inference on CPU, with Rust 🦀🚀🦙

Rustformers

Last update: Apr 17, 2023

Related tags

Miscellaneous llama-rs

Overview

LLaMA-rs

Do the LLaMA thing, but now in Rust 🦀 🚀 🦙

Image by @darthdeus, using Stable Diffusion

LLaMA-rs is a Rust port of the llama.cpp project. This allows running inference for Facebook's LLaMA model on a CPU with good performance using full precision, f16 or 4-bit quantized versions of the model.

Just like its C++ counterpart, it is powered by the ggml tensor library, achieving the same performance as the original code.

Getting started

Make sure you have a Rust 1.65.0 or above and C toolchain¹ set up.

llama-rs is a Rust library, while llama-cli is a CLI application that wraps llama-rs and offers basic inference capabilities.

The following instructions explain how to build llama-cli.

NOTE: For best results, make sure to build and run in release mode. Debug builds are going to be very slow.

Building using `cargo`

Run

cargo install --git https://github.com/rustformers/llama-rs llama-cli

to install llama-cli to your Cargo bin directory, which rustup is likely to have added to your PATH.

It can then be run through llama-cli.

Building from repository

Clone the repository, and then build it through

cargo build --release --bin llama-cli

The resulting binary will be at target/release/llama-cli[.exe].

It can also be run directly through Cargo, using

cargo run --release --bin llama-cli -- <ARGS>

This is useful for development.

Getting the weights

In order to run the inference code in llama-rs, a copy of the model's weights are required. Currently, the only legal source to get the weights is this repository. Note that the choice of words also may or may not hint at the existence of other kinds of sources.

After acquiring the weights, it is necessary to convert them into a format that is compatible with ggml. To achieve this, follow the steps outlined below:

Warning

To run the Python scripts, a Python version of 3.9 or 3.10 is required. 3.11 is unsupported at the time of writing.

# Convert the model to f16 ggml format
python3 scripts/convert-pth-to-ggml.py /path/to/your/models/7B/ 1

# Quantize the model
**Note** To quantize the model, for now using llama.cpp is necessary. This will be fixed once #84 is merged.

Note

The llama.cpp repository has additional information on how to obtain and run specific models. With some caveats:

Currently, llama-rs supports both the old (unversioned) and the new (versioned) ggml formats, but not the mmap-ready version that was recently merged.

Support for other open source models is currently planned. For models where weights can be legally distributed, this section will be updated with scripts to make the install process as user-friendly as possible. Due to the model's legal requirements, this is currently not possible with LLaMA itself and a more lengthy setup is required.

Running

For example, try the following prompt:

llama-cli infer -m <path>/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"

Some additional things to try:

Use --help to see a list of available options.

If you have the alpaca-lora weights, try repl mode!

llama-cli repl -m <path>/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt

Sessions can be loaded (--load-session) or saved (--save-session) to file. To automatically load and save the same session, use --persist-session. This can be used to cache prompts to reduce load time, too:

(This GIF shows an older version of the flags, but the mechanics are still the same.)

Docker

# To build (This will take some time, go grab some coffee):
docker build -t llama-rs .

# To run with prompt:
docker run --rm --name llama-rs -it -v ${PWD}/data:/data -v ${PWD}/examples:/examples llama-rs infer -m data/gpt4all-lora-quantized-ggml.bin -p "Tell me how cool the Rust programming language is:"

# To run with prompt file and repl (will wait for user input):
docker run --rm --name llama-rs -it -v ${PWD}/data:/data -v ${PWD}/examples:/examples llama-rs repl -m data/gpt4all-lora-quantized-ggml.bin -f examples/alpaca_prompt.txt

Q&A

Why did you do this?

It was not my choice. Ferris appeared to me in my dreams and asked me to rewrite this in the name of the Holy crab.

Seriously now.

Come on! I don't want to get into a flame war. You know how it goes, something something memory something something cargo is nice, don't make me say it, everybody knows this already.

I insist.

Sheesh! Okaaay. After seeing the huge potential for llama.cpp, the first thing I did was to see how hard would it be to turn it into a library to embed in my projects. I started digging into the code, and realized the heavy lifting is done by ggml (a C library, easy to bind to Rust) and the whole project was just around ~2k lines of C++ code (not so easy to bind). After a couple of (failed) attempts to build an HTTP server into the tool, I realized I'd be much more productive if I just ported the code to Rust, where I'm more comfortable.

Is this the real reason?

Haha. Of course not. I just like collecting imaginary internet points, in the form of little stars, that people seem to give to me whenever I embark on pointless quests for rewriting X thing, but in Rust.

How is this different from `llama.cpp`?

This is a reimplementation of llama.cpp that does not share any code with it outside of ggml. This was done for a variety of reasons:

llama.cpp requires a C++ compiler, which can cause problems for cross-compilation to more esoteric platforms. An example of such a platform is WebAssembly, which can require a non-standard compiler SDK.
Rust is easier to work with from a development and open-source perspective; it offers better tooling for writing "code in the large" with many other authors. Additionally, we can benefit from the larger Rust ecosystem with ease.
We would like to make ggml an optional backend (see this issue).

In general, we hope to build a solution for model inferencing that is as easy to use and deploy as any other Rust crate.

A modern-ish C toolchain is required to compile ggml. A C++ toolchain should not be necessary. ↩

Comments

Copy v_transposed like llama.cpp

See https://github.com/ggerganov/llama.cpp/pull/439

Closes #67

I'm not necessarily proposing to merge this, just putting it here in case it's useful.

From my very, very, very unscientific testing, it seems like this does very slightly increase memory usage and also increases token generation time a little bit.

These notes are super unscientific, but included just in case it's useful. Also note these tests on were on a machine running other applications like VS Code, web browsers, etc. The tests with the 30B model are close to my machine's memory limit (32GB) so may have caused some swapping as well.

The differences are definitely in the margin of error just because the tests weren't very controlled. (It also did seem like it made more of a difference with 12 threads vs 6. My CPU only has 6 physical cores.)

==== 7B 12t
new 
        Maximum resident set size (kbytes): 4261744
feed_prompt_duration: 5502ms
prompt_tokens: 18
predict_duration: 14414ms
predict_tokens: 50
per_token_duration: 288.280ms

6t
       Maximum resident set size (kbytes): 4361328
feed_prompt_duration: 3942ms
prompt_tokens: 18
predict_duration: 47036ms
predict_tokens: 146
per_token_duration: 322.164ms

old 12t
        Maximum resident set size (kbytes): 4253076
feed_prompt_duration: 4119ms
prompt_tokens: 18
predict_duration: 12705ms
predict_tokens: 50
per_token_duration: 254.100ms

t6
      Maximum resident set size (kbytes): 4290144
feed_prompt_duration: 4001ms
prompt_tokens: 18
predict_duration: 39464ms
predict_tokens: 146
per_token_duration: 270.301ms

        
-------------
new 13B 12t
        Maximum resident set size (kbytes): 8326708
feed_prompt_duration: 8033ms
prompt_tokens: 18
predict_duration: 83420ms
predict_tokens: 146
per_token_duration: 571.370ms

new 13B 6t
    Maximum resident set size (kbytes): 8173012
feed_prompt_duration: 7985ms
prompt_tokens: 18
predict_duration: 42496ms
predict_tokens: 82
per_token_duration: 518.244ms

feed_prompt_duration: 8160ms
prompt_tokens: 18
predict_duration: 41615ms
predict_tokens: 82
per_token_duration: 507.500ms


old 13B 12t
        Maximum resident set size (kbytes): 8210536
feed_prompt_duration: 7813ms
prompt_tokens: 18
predict_duration: 71144ms
predict_tokens: 146
per_token_duration: 487.288ms

6t
feed_prompt_duration: 9226ms
prompt_tokens: 18
predict_duration: 39793ms
predict_tokens: 82
per_token_duration: 485.280ms

----

new 30B 6t
        Maximum resident set size (kbytes): 20291036
feed_prompt_duration: 18688ms
prompt_tokens: 18
predict_duration: 97053ms
predict_tokens: 82
per_token_duration: 1183.573ms

old
        Maximum resident set size (kbytes): 20257344
feed_prompt_duration: 19693ms
prompt_tokens: 18
predict_duration: 93953ms
predict_tokens: 82
per_token_duration: 1145.768ms

opened by KerfuffleV2 19

Partially convert `pth` to `ggml`

I'm still working on adding the weights to the file, rn it only adds params and tokens to the file (the md5 hash matches the llama.cpp generated file without the weights). And also no quantizing. Also added generate and convert subcommands to the cli as well. Let me know if anything needs changing 🙂

Partially resolves https://github.com/rustformers/llama-rs/issues/21

opened by karelnagel 18
Embedding extraction

Implements #56.

I ported the llama.cpp code to allow extracting word embeddings and logits from a call to evaluate. I validated this using an ad_hoc_test (currently hard-coded in main) and results seem to make sense: The dot product for different embeddings is higher, the more similar the two words are, which is exactly how embeddings should work.

This serves as a proof of concept, but we need to discuss the API before we can merge. Currently, I added an EvaluateOutputRequest struct, so we can expand this in the future, allowing retrieval for other interesting bits of the inference process, but these values are not easily obtainable using the regular APIs (i.e. feed_prompt, infer_next_token). I'm not sure if that's a problem: Are we ok with users having to drop down to the lower level evaluate function when they need to retrieve this kind of information?

On a different note, I would really like for someone with a bit of understanding to validate that the results here are correct. Perhaps @hlhr202 can shed some light there?

Finally, should we consider exposing this to llama-cli at all?

opened by setzer22 15
fix(llama): buffer tokens until valid UTF-8
As discussed on Discord and in #11.

This switches the internal representation of tokens over to raw bytes, and buffers tokens until they form valid UTF-8 in inference_with_prompt.

Open questions:

Should we use smallvec or similar for tokens? We're going to be making a lot of unnecessary tiny allocations as-is.

FnMut as a bound is OK, right?
opened by philpax 12

cross build failed since ggml-sys 0.1.0

Hi, llama-node author here. Would like to know how to cross compile for ggml-sys? I m compiling apple arm64 on Ubuntu WSL2

command is

cargo build --release --target aarch64-apple-darwin

the printed error

error: failed to run custom build command for `ggml-sys v0.1.0 (https://github.com/rustformers/llama-rs.git?branch=main#57440bff)`

Caused by:
  process didn't exit successfully: `/home/hlhr202/workspace/llama-node/target/release/build/ggml-sys-fa5fcd66bb36a9eb/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-changed=ggml
  OPT_LEVEL = Some("3")
  TARGET = Some("aarch64-apple-darwin")
  HOST = Some("x86_64-unknown-linux-gnu")
  cargo:rerun-if-env-changed=CC_aarch64-apple-darwin
  CC_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CC_aarch64_apple_darwin
  CC_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=TARGET_CC
  TARGET_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  RUSTC_LINKER = Some("/home/hlhr202/.cache/napi-rs-nodejs/zig-linker-aarch64-apple-darwin.sh")
  cargo:rerun-if-env-changed=CROSS_COMPILE
  CROSS_COMPILE = None
  cargo:rerun-if-env-changed=CFLAGS_aarch64-apple-darwin
  CFLAGS_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CFLAGS_aarch64_apple_darwin
  CFLAGS_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=TARGET_CFLAGS
  TARGET_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("false")
  CARGO_CFG_TARGET_FEATURE = Some("aes,crc,dit,dotprod,dpb,dpb2,fcma,fhm,flagm,fp16,frintts,jsconv,lor,lse,neon,paca,pacg,pan,pmuv3,ras,rcpc,rcpc2,rdm,sb,sha2,sha3,ssbs,vh")
  cargo:rustc-link-lib=framework=Accelerate
  running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-I" "include" "-mcpu=native" "-pthread" "-DGGML_USE_ACCELERATE" "-DNDEBUG" "-o" "/home/hlhr202/workspace/llama-node/target/aarch64-apple-darwin/release/build/ggml-sys-132533d640c61337/out/ggml/ggml.o" "-c" "ggml/ggml.c"
  cargo:warning=cc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead
  cargo:warning=cc: error: unrecognized command-line option ‘-arch’
  exit status: 1

  --- stderr


  error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-I" "include" "-mcpu=native" "-pthread" "-DGGML_USE_ACCELERATE" "-DNDEBUG" "-o" "/home/hlhr202/workspace/llama-node/target/aarch64-apple-darwin/release/build/ggml-sys-132533d640c61337/out/ggml/ggml.o" "-c" "ggml/ggml.c" with args "cc" did not execute successfully (status code exit status: 1).


warning: build failed, waiting for other jobs to finish...
You are cross compiling aarch64-apple-darwin on linux host
Internal Error: Command failed: cargo build --release --target aarch64-apple-darwin
    at checkExecSyncError (node:child_process:828:11)
    at Object.execSync (node:child_process:902:15)
    at BuildCommand.<anonymous> (/home/hlhr202/workspace/llama-node/packages/core/node_modules/@napi-rs/cli/scripts/index.js:11474:30)
    at Generator.next (<anonymous>)
    at fulfilled (/home/hlhr202/workspace/llama-node/packages/core/node_modules/@napi-rs/cli/scripts/index.js:3509:58)

opened by hlhr202 12

30B model doesn't load

Following the same steps works for 7B and 13B model, with the 30B parameters I get

thread 'main' panicked at 'Could not load model: Tensor tok_embeddings.weight has the wrong size in model file', llama-rs/src/main.rs:39:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

opened by RCasatta 12

Let's collaborate
[apologies for early send, accidentally hit enter]

Hey there! Turns out we think on extremely similar wavelengths - I did the exact same thing as you, for the exact same reasons (libraryification), and through the use of similar abstractions: https://github.com/philpax/ggllama

Couple of differences I spotted on my quick perusal:

My version builds on both Windows and Linux, but fails to infer correctly past the first round. Windows performance is also pretty crappy because ggml doesn't support multithreading on Windows.

I use PhantomData with the Tensors to prevent them outliving the Context they're spawned from.

I vendored llama.cpp in so that I could track it more directly and use its ggml.c/h, and to make it obvious which version I was porting.

Given yours actually works, I think that it's more promising :p

What are your immediate plans, and what do you want people to help you out with? My plan was to get it working, then librarify it, make a standalone Discord bot with it as a showcase, and then investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency.
opened by philpax 12

llama-cli: Could not load model: InvalidMagic { path: ... }

Model sucessfully runs on llama.cpp but not in llama-rs

Command:

cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"

PS C:\Users\Usuário\Desktop\llama-rs> cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
    Finished release [optimized] target(s) in 2.83s
     Running `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"`
thread 'main' panicked at 'Could not load model: InvalidMagic { path: "C:\\Users\\Usuário\\Downloads\\LLaMA\\7B\\ggml-model-q4_0.bin" }', llama-cli\src\main.rs:147:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"` (exit code: 101)

opened by mguinhos 11

Good ideas from llama.cpp
I've been tracking the llama.cpp repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:

[ ] GPTQ quantization :eyes: https://github.com/ggerganov/llama.cpp/issues/9

[ ] Not sure how that is even possible (isn't the task I/O bound?), but people are claiming great speedups when loading the modelling in parallel. This should be pretty easy to implement using rayon. https://github.com/ggerganov/llama.cpp/issues/85#issuecomment-1470814328

[ ] Seems there's an issue with the normalization function used. It should be RMSNorm. Would be good to keep an eye on this, and simply swap the the ggml function once it's implemented on the C++ side :eyes: https://github.com/ggerganov/llama.cpp/issues/173#issuecomment-1470801468

[x] It looks like dropping to F16 for the memory_k and memory_v reduces memory usage. It is not known whether this hurts quality, but we should follow the C++ side and add a flag to drop to F16 for the memory. This would also make the cached prompts added as part of #14 take half the size on disk, which is a nice bonus: https://github.com/ggerganov/llama.cpp/pull/154#pullrequestreview-1342214004

[x] Looks like the fix from #1 just landed upstream. We should make sure to fix it here too https://github.com/ggerganov/llama.cpp/pull/161

[ ] The tokenizer used in llama.cpp has some issues. It would be better to use sentencepiece, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible https://github.com/ggerganov/llama.cpp/issues/167

enhancement
opened by setzer22 11
Warning: Bad token in vocab at index xxx

running cargo run --release -- -m ~/dev/llama.cpp/models/7B/ggml-model-f16.bin -f prompt gives a bunch of "Warning: Bad token in vocab at index..."

The path points to ggml converted llama model, which I have verified that they work with llama.cpp
bug

opened by CheatCod 10

Void Linux Build Error

System Info

OS -> Void Linux (x86_64) (glibc), linux kernel 6.1.21_1 rustc -> rustc 1.68.1 (8460ca823 2023-03-20) cargo -> cargo 1.68.1 (115f34552 2023-02-26)

Are the required C compilers present?

I have clang version 12.0.1, and gcc version 12.2.0 installed.

What error does the installation give?

Here's the entire log, uploaded to termbin.com for convenience. This looks like the important bit:

The following warnings were emitted during compilation:

warning: In file included from /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/immintrin.h:107,
warning:                  from ggml/ggml.c:155:
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h: In function 'ggml_vec_dot_f16':
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~

error: failed to run custom build command for `ggml-raw v0.1.0 (/home/outsider/.cargo/git/checkouts/llama-rs-962022d29f37c95e/a067431/ggml-raw)`

Should this be reported to the base repository, ggml, as it's a compilation error for that project, or is it still relevant here?

opened by 13-05 9

Support for Dolly models

There are some pretty fresh models available from DataBricks called Dolly trained using GPT-J. I've found one converted to GGML and quantized but it fails to run with llama-rs. It'd be nice to add support for those models as they seem to be promising for commercial usage.

Currently, I'm getting this error using main branch:

❯ RUST_BACKTRACE=1 cargo r --bin llama-cli --release -- infer -m ~/dev/llm/int4_fixed_zero.bin -p "Tell me about yourself"
    Finished release [optimized] target(s) in 0.05s
     Running `target/release/llama-cli infer -m /dev/llm/int4_fixed_zero.bin -p 'Tell me about yourself'`
[2023-04-16T16:03:16Z INFO  llama_cli::cli_args] ggml ctx size = 6316.05 MB

[2023-04-16T16:03:16Z INFO  llama_cli::cli_args] Loading model part 1/1 from '/dev/llm/int4_fixed_zero.bin'

thread 'main' panicked at 'Could not load model: UnknownTensor { tensor_name: "gpt_neox.embed_in.weight", path: "/dev/llm/int4_fixed_zero.bin" }', llama-cli/src/cli_args.rs:311:10
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9aa5c24b7d763fb98d998819571128ff2eb8a3ca/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9aa5c24b7d763fb98d998819571128ff2eb8a3ca/library/core/src/panicking.rs:64:14
   2: core::result::unwrap_failed
             at /rustc/9aa5c24b7d763fb98d998819571128ff2eb8a3ca/library/core/src/result.rs:1790:5
   3: llama_cli::cli_args::ModelLoad::load
   4: llama_cli::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Upstream discussion: https://github.com/ggerganov/llama.cpp/discussions/569 Dolly GGML Q4 model: https://huggingface.co/snphs/dolly-v2-12b-q4 https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

opened by vv9k 1

Build and execute our own computation graph
At present, we are using GGML's computation graph. This works well, but it has a few flaws:

We're reliant on whatever support GGML has for threading; the Rust threading ecosystem is more versatile/OS-agnostic

Adding new operations requires patching GGML

We're coupled pretty tightly to GGML, so switching to an alternate backend would be quite difficult; this will only get worse as we support more models

Abstraction of shared pieces of functionality gets a little finicky with the exposed API

After reading https://github.com/ggerganov/llama.cpp/discussions/915, I had a flash of inspiration and realised we could address these problems by using our own computation graph.

The code would be fairly similar to what it is now - but instead of building up a GGML computation graph, we build up our own in Rust code with all of the usual strong-typing guarantees.

To begin with, this computation graph would then be "compiled" to a GGML computation graph, so that it works identically.

Once that's done, we would look at reimplementing the actual execution of the graph in Rust code and using GGML's operations to do so (e.g. we use its vec_dot_q4_0, etc).

This would allow us to decouple from GGML in the future (#3), and gives us freedom to implement new operations that aren't supported by GGML without having to maintain our own patched version.

Ideally, we would just use burn or something similar directly, but none of the existing libraries are in a position to serve our needs (GGML-like performance with quantization support). This lets us side-step that issue for now, and focus on describing models that could be executed by anything once support is available.

Constructing our own computation graph and compiling it to GGML should be fairly simple (this could be done with petgraph or our own graph implementation, it's not that difficult).

The main problem comes in the executor reimplementation - a lot of GGML's more complex operations are coupled to the executor, so we'd have to reimplement them (e.g. all the ggml_compute_forward_... functions). Additionally, a lot of the base operations are static void and not exposed to the outside world, so it's likely we'd have to patch GGML anyway.

An alternate approach to full graph reimplementation might be to add support for custom elementwise operations once (as @KerfuffleV2 has done in their fork), so that we can polyfill custom operations from our computation graph.
enhancement maintenance
opened by philpax 2
Rename to `llm`

With our (hopefully) incoming support for #85 and #75, this crate is growing beyond just LLaMA support. At the same time, llama_rs is a little unwieldy as a crate name.

To accommodate this, I've taken the liberty of reserving the llm crate name. Just before we release 0.1.0, llama-rs will be renamed to llm, and llama-cli will become llm-cli (but likely have the binary name llm, so you can do llm bloom infer -m bloom_model -p "what's the capital of Virginia?").

Each model would be under its own feature flag, so you would only pay for what you use.
enhancement maintenance

opened by philpax 5
perf: add benchmarks

https://github.com/rustformers/llama-rs/issues/4 mentions doing benchmarking. We should add some microbenchmarks to show the words per second v.s. llama.cpp. In fact, we should use a similar benchmarking methodology to llama.cpp to get an apples to apples comparison.

opened by jon-chuang 0
feat: mmapped ggjt loader
Fixes the issues in https://github.com/rustformers/llama-rs/pull/125

Improvements:

Loading 7B Vicuna (q4):

default: warm start: 1785ms, cold start: 2618ms

--features="mmap": warm start: 7ms, cold start: 38ms

Loading 13B Vicuna (q4):

default: warm start: 4833ms, cold start: 5905ms

--features="mmap": warm start: 9ms, cold start: 33ms

So we get a 250X-500X speedup! Higher than the advertised 10-100x :)

@iacore
opened by jon-chuang 7

Owner

Rustformers

Transformers in Rust!

GitHub

Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

0 Mar 29, 2022

A Discord bot, written in Rust, that generates responses using the LLaMA language model.

llamacord A Discord bot, written in Rust, that generates responses using the LLaMA language model. Built on top of llama-rs. Setup Model Obtain the LL

6 Mar 20, 2023

A Discord bot, written in Rust, that generates responses using the LLaMA language model.

llamacord A Discord bot, written in Rust, that generates responses using the LLaMA language model. Built on top of llama-rs. Setup Model Obtain the LL

18 Apr 9, 2023

A rusty interface to llama.cpp for rust

llama-cpp-rs Higher level API for the llama-cpp-sys library here: https://github.com/shadowmint/llama-cpp-sys/ A full end-to-end example can be found

3 Apr 16, 2023

A mimimal Rust implementation of Llama.c

llama2.rs Rust meets llama. A mimimal Rust implementation of karpathy's llama.c. Currently the code uses the 15M parameter model provided by Karpathy

6 Aug 8, 2023

Rust bindings to llama.cpp, using metal on macOS

llama-rs Rust bindings to llama.cpp, for macOS, with metal support, for testing and evaluating whether it would be worthwhile to run an Llama model lo

7 Aug 31, 2023

High-level, optionally asynchronous Rust bindings to llama.cpp

llama_cpp-rs Safe, high-level Rust bindings to the C++ project of the same name, meant to be as user-friendly as possible. Run GGUF-based large langua

4 Nov 21, 2023

A fusion of OTP lib/dialyzer + lib/compiler for regular Erlang with type inference

Typed ERLC The Problem I have a dream, that one day there will be an Erlang compiler, which will generate high quality type-correct code from deduced

35 Sep 5, 2022

OpenAI compatible API for serving LLAMA-2 model

Cria - Local llama OpenAI-compatible API The objective is to serve a local llama-2 model by mimicking an OpenAI API service. The llama2 model runs on

66 Aug 8, 2023

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

candle-vllm Efficient platform for inference and serving local LLMs including an OpenAI compatible API server. Features OpenAI compatible API server p

21 Nov 15, 2023

uefi update 4 amd cpu's

Description UEFI is the successor to the BIOS. It provides an early boot environment for OS loaders, hypervisors and other low-level applications. Whi

1 Mar 14, 2022

Define safe interfaces to MMIO and CPU registers with ease

regi regi lets you define safe interfaces to MMIO and CPU registers with ease. License Licensed under either of Apache License, Version 2.0 or MIT lic

2 Feb 10, 2022

Telegram bot help you to run Rust code in Telegram via Rust playground

RPG_BOT (Rust Playground Bot) Telegram bot help you to run Rust code in Telegram via Rust playground Bot interface The bot supports 3 straightforward

8 Dec 6, 2022

TAT agent is an agent written in Rust, which run in CVM or Lighthouse instances.

TAT agent is an agent written in Rust, which run in CVM, Lighthouse or CPM 2.0 instances. Its role is to run commands remotely without ssh login, invoked from TencentCloud Console/API. Commands include but not limited to: Shell, PowerShell, Python. TAT stands for TencentCloud Automation Tools. See more info at https://cloud.tencent.com/product/tat.

97 Dec 21, 2022

Run LLaMA inference on CPU, with Rust 🦀🚀🦙

Related tags

Overview

LLaMA-rs

Getting started

Building using cargo

Building from repository

Getting the weights

Running

Docker

Q&A

Why did you do this?

Seriously now.

I insist.

Is this the real reason?

How is this different from llama.cpp?

Footnotes

Comments

System Info

Are the required C compilers present?

What error does the installation give?

Owner

Rustformers

Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

A Discord bot, written in Rust, that generates responses using the LLaMA language model.

A Discord bot, written in Rust, that generates responses using the LLaMA language model.

A rusty interface to llama.cpp for rust

A mimimal Rust implementation of Llama.c

Rust bindings to llama.cpp, using metal on macOS

High-level, optionally asynchronous Rust bindings to llama.cpp

A fusion of OTP lib/dialyzer + lib/compiler for regular Erlang with type inference

OpenAI compatible API for serving LLAMA-2 model

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

uefi update 4 amd cpu's

Define safe interfaces to MMIO and CPU registers with ease

Telegram bot help you to run Rust code in Telegram via Rust playground

TAT agent is an agent written in Rust, which run in CVM or Lighthouse instances.

Rust Keeper bots that run various functions, from liquidations, to orderbook cranks, and more.

This is a public snapshot of Fly's init code. It powers every Firecracker microvm we run for our users.

Buildomat manages the provisioning of ephemeral UNIX systems on which to run software builds

Yellhole is a lightweight tumblelog which can run on e.g. fly.io for cheap.

First Git on Rust is reimplementation with rust in order to learn about rust, c and git.

Building using `cargo`

How is this different from `llama.cpp`?