Run LLaMA inference on CPU, with Rust 🦀🚀🦙

Overview

LLaMA-rs

Do the LLaMA thing, but now in Rust 🦀 🚀 🦙

A llama riding a crab, AI-generated

Image by @darthdeus, using Stable Diffusion

ko-fi

Latest version MIT Discord

Gif showcasing language generation using llama-rs

LLaMA-rs is a Rust port of the llama.cpp project. This allows running inference for Facebook's LLaMA model on a CPU with good performance using full precision, f16 or 4-bit quantized versions of the model.

Just like its C++ counterpart, it is powered by the ggml tensor library, achieving the same performance as the original code.

Getting started

Make sure you have a Rust 1.65.0 or above and C toolchain1 set up.

llama-rs is a Rust library, while llama-cli is a CLI application that wraps llama-rs and offers basic inference capabilities.

The following instructions explain how to build llama-cli.

NOTE: For best results, make sure to build and run in release mode. Debug builds are going to be very slow.

Building using cargo

Run

cargo install --git https://github.com/rustformers/llama-rs llama-cli

to install llama-cli to your Cargo bin directory, which rustup is likely to have added to your PATH.

It can then be run through llama-cli.

Building from repository

Clone the repository, and then build it through

cargo build --release --bin llama-cli

The resulting binary will be at target/release/llama-cli[.exe].

It can also be run directly through Cargo, using

cargo run --release --bin llama-cli -- <ARGS>

This is useful for development.

Getting the weights

In order to run the inference code in llama-rs, a copy of the model's weights are required. Currently, the only legal source to get the weights is this repository. Note that the choice of words also may or may not hint at the existence of other kinds of sources.

After acquiring the weights, it is necessary to convert them into a format that is compatible with ggml. To achieve this, follow the steps outlined below:

Warning

To run the Python scripts, a Python version of 3.9 or 3.10 is required. 3.11 is unsupported at the time of writing.

# Convert the model to f16 ggml format
python3 scripts/convert-pth-to-ggml.py /path/to/your/models/7B/ 1

# Quantize the model
**Note** To quantize the model, for now using llama.cpp is necessary. This will be fixed once #84 is merged.

Note

The llama.cpp repository has additional information on how to obtain and run specific models. With some caveats:

Currently, llama-rs supports both the old (unversioned) and the new (versioned) ggml formats, but not the mmap-ready version that was recently merged.

Support for other open source models is currently planned. For models where weights can be legally distributed, this section will be updated with scripts to make the install process as user-friendly as possible. Due to the model's legal requirements, this is currently not possible with LLaMA itself and a more lengthy setup is required.

Running

For example, try the following prompt:

llama-cli infer -m <path>/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"

Some additional things to try:

  • Use --help to see a list of available options.

  • If you have the alpaca-lora weights, try repl mode!

    llama-cli repl -m <path>/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt

    Gif showcasing alpaca repl mode

  • Sessions can be loaded (--load-session) or saved (--save-session) to file. To automatically load and save the same session, use --persist-session. This can be used to cache prompts to reduce load time, too:

    Gif showcasing prompt caching

    (This GIF shows an older version of the flags, but the mechanics are still the same.)

Docker

# To build (This will take some time, go grab some coffee):
docker build -t llama-rs .

# To run with prompt:
docker run --rm --name llama-rs -it -v ${PWD}/data:/data -v ${PWD}/examples:/examples llama-rs infer -m data/gpt4all-lora-quantized-ggml.bin -p "Tell me how cool the Rust programming language is:"

# To run with prompt file and repl (will wait for user input):
docker run --rm --name llama-rs -it -v ${PWD}/data:/data -v ${PWD}/examples:/examples llama-rs repl -m data/gpt4all-lora-quantized-ggml.bin -f examples/alpaca_prompt.txt

Q&A

Why did you do this?

It was not my choice. Ferris appeared to me in my dreams and asked me to rewrite this in the name of the Holy crab.

Seriously now.

Come on! I don't want to get into a flame war. You know how it goes, something something memory something something cargo is nice, don't make me say it, everybody knows this already.

I insist.

Sheesh! Okaaay. After seeing the huge potential for llama.cpp, the first thing I did was to see how hard would it be to turn it into a library to embed in my projects. I started digging into the code, and realized the heavy lifting is done by ggml (a C library, easy to bind to Rust) and the whole project was just around ~2k lines of C++ code (not so easy to bind). After a couple of (failed) attempts to build an HTTP server into the tool, I realized I'd be much more productive if I just ported the code to Rust, where I'm more comfortable.

Is this the real reason?

Haha. Of course not. I just like collecting imaginary internet points, in the form of little stars, that people seem to give to me whenever I embark on pointless quests for rewriting X thing, but in Rust.

How is this different from llama.cpp?

This is a reimplementation of llama.cpp that does not share any code with it outside of ggml. This was done for a variety of reasons:

  • llama.cpp requires a C++ compiler, which can cause problems for cross-compilation to more esoteric platforms. An example of such a platform is WebAssembly, which can require a non-standard compiler SDK.
  • Rust is easier to work with from a development and open-source perspective; it offers better tooling for writing "code in the large" with many other authors. Additionally, we can benefit from the larger Rust ecosystem with ease.
  • We would like to make ggml an optional backend (see this issue).

In general, we hope to build a solution for model inferencing that is as easy to use and deploy as any other Rust crate.

Footnotes

  1. A modern-ish C toolchain is required to compile ggml. A C++ toolchain should not be necessary.

Comments
  • Copy v_transposed like llama.cpp

    Copy v_transposed like llama.cpp

    See https://github.com/ggerganov/llama.cpp/pull/439

    Closes #67

    I'm not necessarily proposing to merge this, just putting it here in case it's useful.


    From my very, very, very unscientific testing, it seems like this does very slightly increase memory usage and also increases token generation time a little bit.

    These notes are super unscientific, but included just in case it's useful. Also note these tests on were on a machine running other applications like VS Code, web browsers, etc. The tests with the 30B model are close to my machine's memory limit (32GB) so may have caused some swapping as well.

    The differences are definitely in the margin of error just because the tests weren't very controlled. (It also did seem like it made more of a difference with 12 threads vs 6. My CPU only has 6 physical cores.)

    ==== 7B 12t
    new 
            Maximum resident set size (kbytes): 4261744
    feed_prompt_duration: 5502ms
    prompt_tokens: 18
    predict_duration: 14414ms
    predict_tokens: 50
    per_token_duration: 288.280ms
    
    6t
           Maximum resident set size (kbytes): 4361328
    feed_prompt_duration: 3942ms
    prompt_tokens: 18
    predict_duration: 47036ms
    predict_tokens: 146
    per_token_duration: 322.164ms
    
    old 12t
            Maximum resident set size (kbytes): 4253076
    feed_prompt_duration: 4119ms
    prompt_tokens: 18
    predict_duration: 12705ms
    predict_tokens: 50
    per_token_duration: 254.100ms
    
    t6
          Maximum resident set size (kbytes): 4290144
    feed_prompt_duration: 4001ms
    prompt_tokens: 18
    predict_duration: 39464ms
    predict_tokens: 146
    per_token_duration: 270.301ms
    
            
    -------------
    new 13B 12t
            Maximum resident set size (kbytes): 8326708
    feed_prompt_duration: 8033ms
    prompt_tokens: 18
    predict_duration: 83420ms
    predict_tokens: 146
    per_token_duration: 571.370ms
    
    new 13B 6t
        Maximum resident set size (kbytes): 8173012
    feed_prompt_duration: 7985ms
    prompt_tokens: 18
    predict_duration: 42496ms
    predict_tokens: 82
    per_token_duration: 518.244ms
    
    feed_prompt_duration: 8160ms
    prompt_tokens: 18
    predict_duration: 41615ms
    predict_tokens: 82
    per_token_duration: 507.500ms
    
    
    old 13B 12t
            Maximum resident set size (kbytes): 8210536
    feed_prompt_duration: 7813ms
    prompt_tokens: 18
    predict_duration: 71144ms
    predict_tokens: 146
    per_token_duration: 487.288ms
    
    6t
    feed_prompt_duration: 9226ms
    prompt_tokens: 18
    predict_duration: 39793ms
    predict_tokens: 82
    per_token_duration: 485.280ms
    
    ----
    
    new 30B 6t
            Maximum resident set size (kbytes): 20291036
    feed_prompt_duration: 18688ms
    prompt_tokens: 18
    predict_duration: 97053ms
    predict_tokens: 82
    per_token_duration: 1183.573ms
    
    old
            Maximum resident set size (kbytes): 20257344
    feed_prompt_duration: 19693ms
    prompt_tokens: 18
    predict_duration: 93953ms
    predict_tokens: 82
    per_token_duration: 1145.768ms
    
    
    opened by KerfuffleV2 19
  • Partially convert `pth` to `ggml`

    Partially convert `pth` to `ggml`

    I'm still working on adding the weights to the file, rn it only adds params and tokens to the file (the md5 hash matches the llama.cpp generated file without the weights). And also no quantizing. Also added generate and convert subcommands to the cli as well. Let me know if anything needs changing 🙂

    Partially resolves https://github.com/rustformers/llama-rs/issues/21

    opened by karelnagel 18
  • Embedding extraction

    Embedding extraction

    Implements #56.

    I ported the llama.cpp code to allow extracting word embeddings and logits from a call to evaluate. I validated this using an ad_hoc_test (currently hard-coded in main) and results seem to make sense: The dot product for different embeddings is higher, the more similar the two words are, which is exactly how embeddings should work.

    This serves as a proof of concept, but we need to discuss the API before we can merge. Currently, I added an EvaluateOutputRequest struct, so we can expand this in the future, allowing retrieval for other interesting bits of the inference process, but these values are not easily obtainable using the regular APIs (i.e. feed_prompt, infer_next_token). I'm not sure if that's a problem: Are we ok with users having to drop down to the lower level evaluate function when they need to retrieve this kind of information?

    On a different note, I would really like for someone with a bit of understanding to validate that the results here are correct. Perhaps @hlhr202 can shed some light there?

    Finally, should we consider exposing this to llama-cli at all?

    opened by setzer22 15
  • fix(llama): buffer tokens until valid UTF-8

    fix(llama): buffer tokens until valid UTF-8

    As discussed on Discord and in #11.

    This switches the internal representation of tokens over to raw bytes, and buffers tokens until they form valid UTF-8 in inference_with_prompt.

    Open questions:

    1. Should we use smallvec or similar for tokens? We're going to be making a lot of unnecessary tiny allocations as-is.
    2. FnMut as a bound is OK, right?
    opened by philpax 12
  • cross build failed since ggml-sys 0.1.0

    cross build failed since ggml-sys 0.1.0

    Hi, llama-node author here. Would like to know how to cross compile for ggml-sys? I m compiling apple arm64 on Ubuntu WSL2

    command is

    cargo build --release --target aarch64-apple-darwin
    

    the printed error

    error: failed to run custom build command for `ggml-sys v0.1.0 (https://github.com/rustformers/llama-rs.git?branch=main#57440bff)`
    
    Caused by:
      process didn't exit successfully: `/home/hlhr202/workspace/llama-node/target/release/build/ggml-sys-fa5fcd66bb36a9eb/build-script-build` (exit status: 1)
      --- stdout
      cargo:rerun-if-changed=ggml
      OPT_LEVEL = Some("3")
      TARGET = Some("aarch64-apple-darwin")
      HOST = Some("x86_64-unknown-linux-gnu")
      cargo:rerun-if-env-changed=CC_aarch64-apple-darwin
      CC_aarch64-apple-darwin = None
      cargo:rerun-if-env-changed=CC_aarch64_apple_darwin
      CC_aarch64_apple_darwin = None
      cargo:rerun-if-env-changed=TARGET_CC
      TARGET_CC = None
      cargo:rerun-if-env-changed=CC
      CC = None
      RUSTC_LINKER = Some("/home/hlhr202/.cache/napi-rs-nodejs/zig-linker-aarch64-apple-darwin.sh")
      cargo:rerun-if-env-changed=CROSS_COMPILE
      CROSS_COMPILE = None
      cargo:rerun-if-env-changed=CFLAGS_aarch64-apple-darwin
      CFLAGS_aarch64-apple-darwin = None
      cargo:rerun-if-env-changed=CFLAGS_aarch64_apple_darwin
      CFLAGS_aarch64_apple_darwin = None
      cargo:rerun-if-env-changed=TARGET_CFLAGS
      TARGET_CFLAGS = None
      cargo:rerun-if-env-changed=CFLAGS
      CFLAGS = None
      cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
      CRATE_CC_NO_DEFAULTS = None
      DEBUG = Some("false")
      CARGO_CFG_TARGET_FEATURE = Some("aes,crc,dit,dotprod,dpb,dpb2,fcma,fhm,flagm,fp16,frintts,jsconv,lor,lse,neon,paca,pacg,pan,pmuv3,ras,rcpc,rcpc2,rdm,sb,sha2,sha3,ssbs,vh")
      cargo:rustc-link-lib=framework=Accelerate
      running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-I" "include" "-mcpu=native" "-pthread" "-DGGML_USE_ACCELERATE" "-DNDEBUG" "-o" "/home/hlhr202/workspace/llama-node/target/aarch64-apple-darwin/release/build/ggml-sys-132533d640c61337/out/ggml/ggml.o" "-c" "ggml/ggml.c"
      cargo:warning=cc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead
      cargo:warning=cc: error: unrecognized command-line option ‘-arch’
      exit status: 1
    
      --- stderr
    
    
      error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-arch" "arm64" "-I" "include" "-mcpu=native" "-pthread" "-DGGML_USE_ACCELERATE" "-DNDEBUG" "-o" "/home/hlhr202/workspace/llama-node/target/aarch64-apple-darwin/release/build/ggml-sys-132533d640c61337/out/ggml/ggml.o" "-c" "ggml/ggml.c" with args "cc" did not execute successfully (status code exit status: 1).
    
    
    warning: build failed, waiting for other jobs to finish...
    You are cross compiling aarch64-apple-darwin on linux host
    Internal Error: Command failed: cargo build --release --target aarch64-apple-darwin
        at checkExecSyncError (node:child_process:828:11)
        at Object.execSync (node:child_process:902:15)
        at BuildCommand.<anonymous> (/home/hlhr202/workspace/llama-node/packages/core/node_modules/@napi-rs/cli/scripts/index.js:11474:30)
        at Generator.next (<anonymous>)
        at fulfilled (/home/hlhr202/workspace/llama-node/packages/core/node_modules/@napi-rs/cli/scripts/index.js:3509:58)
    
    opened by hlhr202 12
  • 30B model doesn't load

    30B model doesn't load

    Following the same steps works for 7B and 13B model, with the 30B parameters I get

    thread 'main' panicked at 'Could not load model: Tensor tok_embeddings.weight has the wrong size in model file', llama-rs/src/main.rs:39:10
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    opened by RCasatta 12
  • Let's collaborate

    Let's collaborate

    [apologies for early send, accidentally hit enter]

    Hey there! Turns out we think on extremely similar wavelengths - I did the exact same thing as you, for the exact same reasons (libraryification), and through the use of similar abstractions: https://github.com/philpax/ggllama

    Couple of differences I spotted on my quick perusal:

    • My version builds on both Windows and Linux, but fails to infer correctly past the first round. Windows performance is also pretty crappy because ggml doesn't support multithreading on Windows.
    • I use PhantomData with the Tensors to prevent them outliving the Context they're spawned from.
    • I vendored llama.cpp in so that I could track it more directly and use its ggml.c/h, and to make it obvious which version I was porting.

    Given yours actually works, I think that it's more promising :p

    What are your immediate plans, and what do you want people to help you out with? My plan was to get it working, then librarify it, make a standalone Discord bot with it as a showcase, and then investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency.

    opened by philpax 12
  •  llama-cli: Could not load model: InvalidMagic { path: ... }

    llama-cli: Could not load model: InvalidMagic { path: ... }

    Model sucessfully runs on llama.cpp but not in llama-rs

    Command:

    cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
    
    PS C:\Users\Usuário\Desktop\llama-rs> cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
        Finished release [optimized] target(s) in 2.83s
         Running `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"`
    thread 'main' panicked at 'Could not load model: InvalidMagic { path: "C:\\Users\\Usuário\\Downloads\\LLaMA\\7B\\ggml-model-q4_0.bin" }', llama-cli\src\main.rs:147:10
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    error: process didn't exit successfully: `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"` (exit code: 101)
    
    opened by mguinhos 11
  • Good ideas from llama.cpp

    Good ideas from llama.cpp

    I've been tracking the llama.cpp repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:

    • [ ] GPTQ quantization :eyes: https://github.com/ggerganov/llama.cpp/issues/9
    • [ ] Not sure how that is even possible (isn't the task I/O bound?), but people are claiming great speedups when loading the modelling in parallel. This should be pretty easy to implement using rayon. https://github.com/ggerganov/llama.cpp/issues/85#issuecomment-1470814328
    • [ ] Seems there's an issue with the normalization function used. It should be RMSNorm. Would be good to keep an eye on this, and simply swap the the ggml function once it's implemented on the C++ side :eyes: https://github.com/ggerganov/llama.cpp/issues/173#issuecomment-1470801468
    • [x] It looks like dropping to F16 for the memory_k and memory_v reduces memory usage. It is not known whether this hurts quality, but we should follow the C++ side and add a flag to drop to F16 for the memory. This would also make the cached prompts added as part of #14 take half the size on disk, which is a nice bonus: https://github.com/ggerganov/llama.cpp/pull/154#pullrequestreview-1342214004
    • [x] Looks like the fix from #1 just landed upstream. We should make sure to fix it here too https://github.com/ggerganov/llama.cpp/pull/161
    • [ ] The tokenizer used in llama.cpp has some issues. It would be better to use sentencepiece, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible https://github.com/ggerganov/llama.cpp/issues/167
    enhancement 
    opened by setzer22 11
  • Warning: Bad token in vocab at index xxx

    Warning: Bad token in vocab at index xxx

    running cargo run --release -- -m ~/dev/llama.cpp/models/7B/ggml-model-f16.bin -f prompt gives a bunch of "Warning: Bad token in vocab at index..." image

    The path points to ggml converted llama model, which I have verified that they work with llama.cpp

    bug 
    opened by CheatCod 10
  • Void Linux Build Error

    Void Linux Build Error

    System Info

    OS -> Void Linux (x86_64) (glibc), linux kernel 6.1.21_1 rustc -> rustc 1.68.1 (8460ca823 2023-03-20) cargo -> cargo 1.68.1 (115f34552 2023-02-26)

    Are the required C compilers present?

    I have clang version 12.0.1, and gcc version 12.2.0 installed.

    What error does the installation give?

    Here's the entire log, uploaded to termbin.com for convenience. This looks like the important bit:

    The following warnings were emitted during compilation:
    
    warning: In file included from /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/immintrin.h:107,
    warning:                  from ggml/ggml.c:155:
    warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h: In function 'ggml_vec_dot_f16':
    warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
    warning:    52 | _mm256_cvtph_ps (__m128i __A)
    warning:       | ^~~~~~~~~~~~~~~
    warning: ggml/ggml.c:916:33: note: called from here
    warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
    warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
    warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
    warning:       |                                     ^~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
    warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
    warning:       |                     ^~~~~~~~~~~~~~~~~
    warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
    warning:    52 | _mm256_cvtph_ps (__m128i __A)
    warning:       | ^~~~~~~~~~~~~~~
    warning: ggml/ggml.c:916:33: note: called from here
    warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
    warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
    warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
    warning:       |                                     ^~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
    warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
    warning:       |                     ^~~~~~~~~~~~~~~~~
    warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
    warning:    52 | _mm256_cvtph_ps (__m128i __A)
    warning:       | ^~~~~~~~~~~~~~~
    warning: ggml/ggml.c:916:33: note: called from here
    warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
    warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
    warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
    warning:       |                                     ^~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
    warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
    warning:       |                     ^~~~~~~~~~~~~~~~~
    warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
    warning:    52 | _mm256_cvtph_ps (__m128i __A)
    warning:       | ^~~~~~~~~~~~~~~
    warning: ggml/ggml.c:916:33: note: called from here
    warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
    warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
    warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
    warning:       |                                     ^~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
    warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
    warning:       |                     ^~~~~~~~~~~~~~~~~
    warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
    warning:    52 | _mm256_cvtph_ps (__m128i __A)
    warning:       | ^~~~~~~~~~~~~~~
    warning: ggml/ggml.c:916:33: note: called from here
    warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
    warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
    warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
    warning:       |                                     ^~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
    warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
    warning:       |                     ^~~~~~~~~~~~~~~~~
    warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
    warning:    52 | _mm256_cvtph_ps (__m128i __A)
    warning:       | ^~~~~~~~~~~~~~~
    warning: ggml/ggml.c:916:33: note: called from here
    warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
    warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
    warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
    warning:       |                                     ^~~~~~~~~~~~~~~~
    warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
    warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
    warning:       |                     ^~~~~~~~~~~~~~~~~
    
    error: failed to run custom build command for `ggml-raw v0.1.0 (/home/outsider/.cargo/git/checkouts/llama-rs-962022d29f37c95e/a067431/ggml-raw)`
    

    Should this be reported to the base repository, ggml, as it's a compilation error for that project, or is it still relevant here?

    opened by 13-05 9
  • Support for Dolly models

    Support for Dolly models

    There are some pretty fresh models available from DataBricks called Dolly trained using GPT-J. I've found one converted to GGML and quantized but it fails to run with llama-rs. It'd be nice to add support for those models as they seem to be promising for commercial usage.

    Currently, I'm getting this error using main branch:

    ❯ RUST_BACKTRACE=1 cargo r --bin llama-cli --release -- infer -m ~/dev/llm/int4_fixed_zero.bin -p "Tell me about yourself"
        Finished release [optimized] target(s) in 0.05s
         Running `target/release/llama-cli infer -m /dev/llm/int4_fixed_zero.bin -p 'Tell me about yourself'`
    [2023-04-16T16:03:16Z INFO  llama_cli::cli_args] ggml ctx size = 6316.05 MB
    
    [2023-04-16T16:03:16Z INFO  llama_cli::cli_args] Loading model part 1/1 from '/dev/llm/int4_fixed_zero.bin'
    
    thread 'main' panicked at 'Could not load model: UnknownTensor { tensor_name: "gpt_neox.embed_in.weight", path: "/dev/llm/int4_fixed_zero.bin" }', llama-cli/src/cli_args.rs:311:10
    stack backtrace:
       0: rust_begin_unwind
                 at /rustc/9aa5c24b7d763fb98d998819571128ff2eb8a3ca/library/std/src/panicking.rs:575:5
       1: core::panicking::panic_fmt
                 at /rustc/9aa5c24b7d763fb98d998819571128ff2eb8a3ca/library/core/src/panicking.rs:64:14
       2: core::result::unwrap_failed
                 at /rustc/9aa5c24b7d763fb98d998819571128ff2eb8a3ca/library/core/src/result.rs:1790:5
       3: llama_cli::cli_args::ModelLoad::load
       4: llama_cli::main
    note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
    
    

    Upstream discussion: https://github.com/ggerganov/llama.cpp/discussions/569 Dolly GGML Q4 model: https://huggingface.co/snphs/dolly-v2-12b-q4 https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

    opened by vv9k 1
  • Build and execute our own computation graph

    Build and execute our own computation graph

    At present, we are using GGML's computation graph. This works well, but it has a few flaws:

    1. We're reliant on whatever support GGML has for threading; the Rust threading ecosystem is more versatile/OS-agnostic
    2. Adding new operations requires patching GGML
    3. We're coupled pretty tightly to GGML, so switching to an alternate backend would be quite difficult; this will only get worse as we support more models
    4. Abstraction of shared pieces of functionality gets a little finicky with the exposed API

    After reading https://github.com/ggerganov/llama.cpp/discussions/915, I had a flash of inspiration and realised we could address these problems by using our own computation graph.

    The code would be fairly similar to what it is now - but instead of building up a GGML computation graph, we build up our own in Rust code with all of the usual strong-typing guarantees.

    To begin with, this computation graph would then be "compiled" to a GGML computation graph, so that it works identically.

    Once that's done, we would look at reimplementing the actual execution of the graph in Rust code and using GGML's operations to do so (e.g. we use its vec_dot_q4_0, etc).

    This would allow us to decouple from GGML in the future (#3), and gives us freedom to implement new operations that aren't supported by GGML without having to maintain our own patched version.

    Ideally, we would just use burn or something similar directly, but none of the existing libraries are in a position to serve our needs (GGML-like performance with quantization support). This lets us side-step that issue for now, and focus on describing models that could be executed by anything once support is available.


    Constructing our own computation graph and compiling it to GGML should be fairly simple (this could be done with petgraph or our own graph implementation, it's not that difficult).

    The main problem comes in the executor reimplementation - a lot of GGML's more complex operations are coupled to the executor, so we'd have to reimplement them (e.g. all the ggml_compute_forward_... functions). Additionally, a lot of the base operations are static void and not exposed to the outside world, so it's likely we'd have to patch GGML anyway.

    An alternate approach to full graph reimplementation might be to add support for custom elementwise operations once (as @KerfuffleV2 has done in their fork), so that we can polyfill custom operations from our computation graph.

    enhancement maintenance 
    opened by philpax 2
  • Rename to `llm`

    Rename to `llm`

    With our (hopefully) incoming support for #85 and #75, this crate is growing beyond just LLaMA support. At the same time, llama_rs is a little unwieldy as a crate name.

    To accommodate this, I've taken the liberty of reserving the llm crate name. Just before we release 0.1.0, llama-rs will be renamed to llm, and llama-cli will become llm-cli (but likely have the binary name llm, so you can do llm bloom infer -m bloom_model -p "what's the capital of Virginia?").

    Each model would be under its own feature flag, so you would only pay for what you use.

    enhancement maintenance 
    opened by philpax 5
  • perf: add benchmarks

    perf: add benchmarks

    https://github.com/rustformers/llama-rs/issues/4 mentions doing benchmarking. We should add some microbenchmarks to show the words per second v.s. llama.cpp. In fact, we should use a similar benchmarking methodology to llama.cpp to get an apples to apples comparison.

    opened by jon-chuang 0
  • feat: mmapped ggjt loader

    feat: mmapped ggjt loader

    Fixes the issues in https://github.com/rustformers/llama-rs/pull/125

    Improvements:

    • Loading 7B Vicuna (q4):
      • default: warm start: 1785ms, cold start: 2618ms
      • --features="mmap": warm start: 7ms, cold start: 38ms
    • Loading 13B Vicuna (q4):
      • default: warm start: 4833ms, cold start: 5905ms
      • --features="mmap": warm start: 9ms, cold start: 33ms

    So we get a 250X-500X speedup! Higher than the advertised 10-100x :)

    @iacore

    opened by jon-chuang 7
Owner
Rustformers
Transformers in Rust!
Rustformers
Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

MichelNowak 0 Mar 29, 2022
A Discord bot, written in Rust, that generates responses using the LLaMA language model.

llamacord A Discord bot, written in Rust, that generates responses using the LLaMA language model. Built on top of llama-rs. Setup Model Obtain the LL

Philpax 6 Mar 20, 2023
A Discord bot, written in Rust, that generates responses using the LLaMA language model.

llamacord A Discord bot, written in Rust, that generates responses using the LLaMA language model. Built on top of llama-rs. Setup Model Obtain the LL

Rustformers 18 Apr 9, 2023
A rusty interface to llama.cpp for rust

llama-cpp-rs Higher level API for the llama-cpp-sys library here: https://github.com/shadowmint/llama-cpp-sys/ A full end-to-end example can be found

Doug 3 Apr 16, 2023
A mimimal Rust implementation of Llama.c

llama2.rs Rust meets llama. A mimimal Rust implementation of karpathy's llama.c. Currently the code uses the 15M parameter model provided by Karpathy

null 6 Aug 8, 2023
Rust bindings to llama.cpp, using metal on macOS

llama-rs Rust bindings to llama.cpp, for macOS, with metal support, for testing and evaluating whether it would be worthwhile to run an Llama model lo

Max Brunsfeld 7 Aug 31, 2023
High-level, optionally asynchronous Rust bindings to llama.cpp

llama_cpp-rs Safe, high-level Rust bindings to the C++ project of the same name, meant to be as user-friendly as possible. Run GGUF-based large langua

Binedge.ai 4 Nov 21, 2023
OpenAI compatible API for serving LLAMA-2 model

Cria - Local llama OpenAI-compatible API The objective is to serve a local llama-2 model by mimicking an OpenAI API service. The llama2 model runs on

AmineDiro 66 Aug 8, 2023
A fusion of OTP lib/dialyzer + lib/compiler for regular Erlang with type inference

Typed ERLC The Problem I have a dream, that one day there will be an Erlang compiler, which will generate high quality type-correct code from deduced

Dmytro Lytovchenko 35 Sep 5, 2022
Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

candle-vllm Efficient platform for inference and serving local LLMs including an OpenAI compatible API server. Features OpenAI compatible API server p

Eric Buehler 21 Nov 15, 2023
uefi update 4 amd cpu's

Description UEFI is the successor to the BIOS. It provides an early boot environment for OS loaders, hypervisors and other low-level applications. Whi

Revons Community 1 Mar 14, 2022
Define safe interfaces to MMIO and CPU registers with ease

regi regi lets you define safe interfaces to MMIO and CPU registers with ease. License Licensed under either of Apache License, Version 2.0 or MIT lic

Valentin B. 2 Feb 10, 2022
Telegram bot help you to run Rust code in Telegram via Rust playground

RPG_BOT (Rust Playground Bot) Telegram bot help you to run Rust code in Telegram via Rust playground Bot interface The bot supports 3 straightforward

TheAwiteb 8 Dec 6, 2022
TAT agent is an agent written in Rust, which run in CVM or Lighthouse instances.

TAT agent is an agent written in Rust, which run in CVM, Lighthouse or CPM 2.0 instances. Its role is to run commands remotely without ssh login, invoked from TencentCloud Console/API. Commands include but not limited to: Shell, PowerShell, Python. TAT stands for TencentCloud Automation Tools. See more info at https://cloud.tencent.com/product/tat.

Tencent 97 Dec 21, 2022
Rust Keeper bots that run various functions, from liquidations, to orderbook cranks, and more.

The zo-keeper (pronounced "zoo keeper") repository runs large scale instructions that secure the 01 network, and allow it to operate in a fully decentralized manner.

Zero One Global Foundation 61 Dec 16, 2022
This is a public snapshot of Fly's init code. It powers every Firecracker microvm we run for our users.

Fly Init This is a public snapshot of Fly's init code. It powers every Firecracker microvm we run for our users. It is Rust-based and we thought makin

fly.io 186 Dec 30, 2022
Buildomat manages the provisioning of ephemeral UNIX systems on which to run software builds

B U I L D O M A T a software build labour-saving device Buildomat manages the provisioning of ephemeral UNIX systems (e.g., instances in AWS EC2) on w

Oxide Computer Company 33 Dec 4, 2022
Yellhole is a lightweight tumblelog which can run on e.g. fly.io for cheap.

Yellhole A Hole To Yell In Yellhole is a lightweight tumblelog which can run on e.g. fly.io for cheap. Features Runs on a single node. Use a CDN if yo

Coda Hale 8 Dec 15, 2022
First Git on Rust is reimplementation with rust in order to learn about rust, c and git.

First Git on Rust First Git on Rust is reimplementation with rust in order to learn about rust, c and git. Reference project This project refer to the

Nobkz 1 Jan 28, 2022