OpenAI compatible API for serving LLAMA-2 model

Related tags

Miscellaneous cria
Overview

Cria - Local llama OpenAI-compatible API

The objective is to serve a local llama-2 model by mimicking an OpenAI API service. The llama2 model runs on GPU using ggml-sys crate with specific compilation flags.

Quickstart:

  1. Git clone project

    git clone [email protected]:AmineDiro/cria.git
    cd cria/
    git submodule update --init --recursive
  2. Build project ( I ❀️ cargo !).

    cargo b --release
    • For cuBLAS (nvidia GPU ) acceleration use
      cargo b --release --features cublas
    • For metal acceleration use
      cargo b --release --features metal

      ❗ NOTE: If you have issues building for GPU, checkout the building issues section

  3. Download GGML .bin LLama-2 quantized model (for example llama-2-7b)

  4. Run API, use the use-gpu flag to offload model layers to your GPU

    ./target/cria llama-2 {MODEL_BIN_PATH} --use-gpu --gpu-layers 32

Completion Example

You can use openai python client or directly use the sseclient python library and stream messages. Here is an example :

Here is a example using a Python client
import json
import sys
import time

import sseclient
import urllib3

url = "http://localhost:3000/v1/completions"


http = urllib3.PoolManager()
response = http.request(
    "POST",
    url,
    preload_content=False,
    headers={
        "Content-Type": "application/json",
    },
    body=json.dumps(
        {
            "prompt": "Morocco is a beautiful country situated in north africa.",
            "temperature": 0.1,
        }
    ),
)

client = sseclient.SSEClient(response)

s = time.perf_counter()
for event in client.events():
    chunk = json.loads(event.data)
    sys.stdout.write(chunk["choices"][0]["text"])
    sys.stdout.flush()
e = time.perf_counter()

print(f"\nGeneration from completion took {e-s:.2f} !")

You can clearly see generation using my M1 GPU:

Building with GPU issues

I had some issues compiling llm crate with cuda support for my RTX2070 Super (Turing architecture). After some debugging, I needed to provide nvcc with the correct gpu-architecture version, for now ggml-sys crates only supports. Here are the set of changes to the build.rs :

diff --git a/crates/ggml/sys/build.rs b/crates/ggml/sys/build.rs
index 3a6e841..ef1e1b0 100644
--- a/crates/ggml/sys/build.rs
+++ b/crates/ggml/sys/build.rs
@@ -330,8 +330,9 @@ fn enable_cublas(build: &mut cc::Build, out_dir: &Path) {
             .arg("--compile")
             .arg("-cudart")
             .arg("static")
-            .arg("--generate-code=arch=compute_52,code=[compute_52,sm_52]")
-            .arg("--generate-code=arch=compute_61,code=[compute_61,sm_61]")
+            .arg("--generate-code=arch=compute_75,code=[compute_75,sm_75]")
             .arg("-D_WINDOWS")
             .arg("-DNDEBUG")
             .arg("-DGGML_USE_CUBLAS")
@@ -361,8 +362,7 @@ fn enable_cublas(build: &mut cc::Build, out_dir: &Path) {
             .arg("-Illama-cpp/include/ggml")
             .arg("-mtune=native")
             .arg("-pthread")
-            .arg("--generate-code=arch=compute_52,code=[compute_52,sm_52]")
-            .arg("--generate-code=arch=compute_61,code=[compute_61,sm_61]")
+            .arg("--generate-code=arch=compute_75,code=[compute_75,sm_75]")
             .arg("-DGGML_USE_CUBLAS")
             .arg("-I/usr/local/cuda/include")
             .arg("-I/opt/cuda/include")

The only thing left to do is to change Cargo.toml file to

TODO/ Roadmap:

  • Run Llama.cpp on CPU using llm-chain
  • Run Llama.cpp on GPU using llm-chain
  • Implement /models route
  • Implement basic /completions route
  • Implement streaming completions SSE
  • Cleanup cargo features with llm
  • Support MacOS Metal
  • Merge completions / completion_streaming routes in same endpoint
  • Implement /embeddings route
  • Setup good tracing
  • Better errors
  • Implement route /chat/completions
  • Implement streaming chat completions SSE
  • Metrics ??
  • Batching requests(ala iouring):
    • For each response put an entry in a ringbuffer queue with : Entry(Flume mpsc (resp_rx,resp_tx))
    • Spawn a model in separate task reading from ringbuffer, get entry and put each token in response
    • Construct stream from flue resp_rx chan and return SSE(stream) to user.

Routes

You might also like...
High-level, optionally asynchronous Rust bindings to llama.cpp

llama_cpp-rs Safe, high-level Rust bindings to the C++ project of the same name, meant to be as user-friendly as possible. Run GGUF-based large langua

πŸš€ Fast and 100% API compatible postcss replacer, built in Rust

πŸš€ Fast and 100% API compatible postcss replacer, built in Rust

Meet Rustacean GPT, an experimental project transforming OpenAi's GPT into a helpful, autonomous software engineer to support senior developers and simplify coding life! πŸš€πŸ€–πŸ§ 
Meet Rustacean GPT, an experimental project transforming OpenAi's GPT into a helpful, autonomous software engineer to support senior developers and simplify coding life! πŸš€πŸ€–πŸ§ 

Rustacean GPT Welcome, fellow coding enthusiasts! πŸš€ πŸ€– I am excited to introduce you to Rustacean GPT, my humble yet ambitious project that aims to t

Pbot - pan93412's extensible userbot, which is full-documented, enginnered and based on Actor model.

pbot pan93412's extensible user bot, which is full-documented, engineered and based on Actor model. Usage Run cargo run --release [--features modules

A rust and SageMath implementation of (2,2)-isogenies in the theta model

An Algorithmic Approach to (2, 2)-isogenies in the Theta Model Code accompanying the research paper: An Algorithmic Approach to (2, 2)-isogenies in th

Emerald, the EVM compatible paratime

The Emerald ParaTime This is the Emerald ParaTime, an official EVM-compatible Oasis Protocol Foundation's ParaTime for the Oasis Network built using t

Unofficial Bitwarden compatible server written in Rust, formerly known as bitwarden_rs

Alternative implementation of the Bitwarden server API written in Rust and compatible with upstream Bitwarden clients*, perfect for self-hosted deploy

A Litecord compatible/inspired OSS implementation of Discord's backend for fun and profit.

A Litecord compatible/inspired OSS implementation of Discord's backend for fun and profit.

A rust library for interacting with multiple Discord.com-compatible APIs and Gateways at the same time.
A rust library for interacting with multiple Discord.com-compatible APIs and Gateways at the same time.

Chorus A rust library for interacting with (multiple) Spacebar-compatible APIs and Gateways (at the same time). Explore the docs Β» Report Bug Β· Reques

Comments
  • "git submodule update --init --recursive" requires ssh

    I am trying to get this running in Docker

    Here is my current build

    from ubuntu:latest
    
    RUN apt-get update && apt-get install -y \
        sudo 
    
    RUN apt-get update && \
        apt-get install -y \
        curl \
        build-essential \
        libssl-dev \
        pkg-config
    
    
    # Install Rust and Cargo
    RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
    ENV PATH="/root/.cargo/bin:${PATH}"
    
    
    RUN yes | sudo apt install git 
    
    RUN git clone https://github.com/AmineDiro/cria.git
    WORKDIR /cria
    #RUN yes | git submodule update --init --recursive
    
    # Build the project using Cargo in release mode
    #RUN cargo build --release
    
    #COPY ggml-model-q4_0.bin .
    

    Unfortunately, it seems like the git submodule update --init --recursive command is somehow interfacing with the repo via ssh

    Have you considered maybe building this in a way that the dependencies can be installed without ssh?

    image

    opened by snakewizardd 5
  • Missing LICENSE

    Missing LICENSE

    I see you have no LICENSE file for this project. The default is copyright.

    I would suggest releasing the code under the GPL-3.0-or-later or AGPL-3.0-or-later license so that others are encouraged to contribute changes back to your project.

    opened by TechnologyClassroom 2
  • llama-2 is not a supported architecture

    llama-2 is not a supported architecture

    I'm getting this error when trying to run on MacOS:

    error: invalid value 'llama-2' for '<MODEL_ARCHITECTURE>': llama-2 is not one of supported model architectures: [Bloom, Gpt2, GptJ, GptNeoX, Llama, Mpt]
    

    If I use LLama instead, it crashes (as it probably should)

    GGML_ASSERT: llama-cpp/ggml.c:6192: ggml_nelements(a) == ne0*ne1*ne2
    fish: Job 1, 'target/release/cria Llama ../ll…' terminated by signal SIGABRT (Abort)
    
    bug 
    opened by undefdev 6
Releases(v0.1.0)
Owner
AmineDiro
Telecom - ENSAE Data scientist and Data engineer. Computer vision expert. Interested in everything computer science. Fitroulette co-founder
AmineDiro
Crates Registry is a tool for serving and publishing crates and serving rustup installation in offline networks.

Crates Registry Description Crates Registry is a tool for serving and publishing crates and serving rustup installation in offline networks. (like Ver

TalYRoni 5 Jul 6, 2023
A Discord bot, written in Rust, that generates responses using the LLaMA language model.

llamacord A Discord bot, written in Rust, that generates responses using the LLaMA language model. Built on top of llama-rs. Setup Model Obtain the LL

Philpax 6 Mar 20, 2023
A Discord bot, written in Rust, that generates responses using the LLaMA language model.

llamacord A Discord bot, written in Rust, that generates responses using the LLaMA language model. Built on top of llama-rs. Setup Model Obtain the LL

Rustformers 18 Apr 9, 2023
Tiny Discord ticket support bot that utilizes the OpenAI GPT-3.5-turbo model.

BDFD AI Mod Our tiny Discord ticket support bot that utilizes the OpenAI GPT-3.5-turbo model. This project aims to help users by providing a very fast

NilPointer Software 5 Apr 20, 2023
This crate converts Rust compatible regex-syntax to Vim's NFA engine compatible regex.

This crate converts Rust compatible regex-syntax to Vim's NFA engine compatible regex.

kaiuri 1 Feb 11, 2022
Run LLaMA inference on CPU, with Rust πŸ¦€πŸš€πŸ¦™

LLaMA-rs Do the LLaMA thing, but now in Rust ?? ?? ?? Image by @darthdeus, using Stable Diffusion LLaMA-rs is a Rust port of the llama.cpp project. Th

Rustformers 2.7k Apr 17, 2023
Run LLaMA inference on CPU, with Rust πŸ¦€πŸš€πŸ¦™

LLaMA-rs Do the LLaMA thing, but now in Rust ?? ?? ?? Image by @darthdeus, using Stable Diffusion LLaMA-rs is a Rust port of the llama.cpp project. Th

Rustformers 2.7k Apr 17, 2023
A rusty interface to llama.cpp for rust

llama-cpp-rs Higher level API for the llama-cpp-sys library here: https://github.com/shadowmint/llama-cpp-sys/ A full end-to-end example can be found

Doug 3 Apr 16, 2023
A mimimal Rust implementation of Llama.c

llama2.rs Rust meets llama. A mimimal Rust implementation of karpathy's llama.c. Currently the code uses the 15M parameter model provided by Karpathy

null 6 Aug 8, 2023
Rust bindings to llama.cpp, using metal on macOS

llama-rs Rust bindings to llama.cpp, for macOS, with metal support, for testing and evaluating whether it would be worthwhile to run an Llama model lo

Max Brunsfeld 7 Aug 31, 2023