Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Overview

candle-vllm

Continuous integration

Efficient platform for inference and serving local LLMs including an OpenAI compatible API server.

Features

  • OpenAI compatible API server provided for serving LLMs.
  • Highly extensible trait-based system to allow rapid implementation of new module pipelines,
  • Streaming support in generation.

Overview

One of the goals of candle-vllm is to interface locally served LLMs using an OpenAI compatible API server.

  1. During initial setup: the model, tokenizer and other parameters are loaded.

  2. When a request is received:

  3. Sampling parameters are extracted, including n - the number of choices to generate.

  4. The request is converted to a prompt which is sent to the model pipeline.

    • If a streaming request is received, token-by-token streaming using SSEs is established (n choices of 1 token).
    • Otherwise, a n choices are generated and returned.

Contributing

The following features are planned to be implemented, but contributions are especially welcome:

  • Sampling methods:
  • Pipeline batching (#3)
  • KV cache (#3)
  • PagedAttention (#3)
  • More pipelines (from candle-transformers)

Resources

You might also like...
Meet Rustacean GPT, an experimental project transforming OpenAi's GPT into a helpful, autonomous software engineer to support senior developers and simplify coding life! 🚀🤖🧠
Meet Rustacean GPT, an experimental project transforming OpenAi's GPT into a helpful, autonomous software engineer to support senior developers and simplify coding life! 🚀🤖🧠

Rustacean GPT Welcome, fellow coding enthusiasts! 🚀 🤖 I am excited to introduce you to Rustacean GPT, my humble yet ambitious project that aims to t

Tiny Discord ticket support bot that utilizes the OpenAI GPT-3.5-turbo model.
Tiny Discord ticket support bot that utilizes the OpenAI GPT-3.5-turbo model.

BDFD AI Mod Our tiny Discord ticket support bot that utilizes the OpenAI GPT-3.5-turbo model. This project aims to help users by providing a very fast

Erlang Language Platform. LSP server and CLI.
Erlang Language Platform. LSP server and CLI.

Erlang Language Platform (ELP) Description ELP integrates Erlang into modern IDEs via the language server protocol. ELP was inspired by rust-analyzer.

Another Network Tunnel; A simple program for local/remote port forwarding over a SSH tunnel.

🐜 ANT Another Network Tunnel; A simple program for local/remote port forwarding over a SSH tunnel. Table of Contents Installation Pre-requisites Pre-

Very simple Rust binary that can turn on/off a TP-Link L920 led light strip in your local network
Very simple Rust binary that can turn on/off a TP-Link L920 led light strip in your local network

TP-Link L920 on/off script This is a (very) simple Rust binary that can turn on/off a TP-Link L920 led light strip in your local network. Installation

A Litecord compatible/inspired OSS implementation of Discord's backend for fun and profit.

A Litecord compatible/inspired OSS implementation of Discord's backend for fun and profit.

A rust library for interacting with multiple Discord.com-compatible APIs and Gateways at the same time.
A rust library for interacting with multiple Discord.com-compatible APIs and Gateways at the same time.

Chorus A rust library for interacting with (multiple) Spacebar-compatible APIs and Gateways (at the same time). Explore the docs » Report Bug · Reques

Emerald, the EVM compatible paratime

The Emerald ParaTime This is the Emerald ParaTime, an official EVM-compatible Oasis Protocol Foundation's ParaTime for the Oasis Network built using t

Modrinth API is a simple library for using, you guessed it, the Modrinth API in Rust projects

Modrinth API is a simple library for using, you guessed it, the Modrinth API in Rust projects. It uses reqwest as its HTTP(S) client and deserialises responses to typed structs using serde.

Comments
  • Can the architectural design be improved?

    Can the architectural design be improved?

    Thanks for your candle-vllm.

    I have my own LLM and tokenizer with diffrent prompt generation, and I want to make it serve as OpenAI compatiable API. I also don't want to modify the code of candle-vllm, want just use it as a library.

    But now,looking at your design, it seems like you want to do everything by yourself. https://github.com/EricLBuehler/candle-vllm/blob/cbfd463d72709ab70a360d07cc17be543e56ebd2/src/openai/conversation.rs#L16-L20

    I wonder if model, tokenizer and prompt generation can be abstracted into traits?

    enhancement 
    opened by mokeyish 9
  • Support streaming of tokens

    Support streaming of tokens

    I noticed that OpenAI's stream=True could be perhaps implemented here:

    https://github.com/EricLBuehler/candle-vllm/blob/bc31515011147bea52f854f411627f21e6261a1f/src/openai/pipelines/llama.rs#L248-L260

    enhancement 
    opened by michaelfeil 1
  • Batching and VLLM-style kv caching missing

    Batching and VLLM-style kv caching missing

    Your implementation is looking great so far.

    I got a bit confused, as with the name vllm, I would have expected that two features are implemented:

    • [ ] support batching, in the pipeline
    • [ ] support vllm style kv-caching using paged-attention

    Is there a plan to support them?

    enhancement 
    opened by michaelfeil 5
Owner
Eric Buehler
Eric Buehler
Crates Registry is a tool for serving and publishing crates and serving rustup installation in offline networks.

Crates Registry Description Crates Registry is a tool for serving and publishing crates and serving rustup installation in offline networks. (like Ver

TalYRoni 5 Jul 6, 2023
This crate converts Rust compatible regex-syntax to Vim's NFA engine compatible regex.

This crate converts Rust compatible regex-syntax to Vim's NFA engine compatible regex.

kaiuri 1 Feb 11, 2022
A library for working with alpaca.markets APIs, including the Broker API.

alpaca-rs, an unofficial Rust SDK for the Alpaca API (WIP) Features Easy to use - Minimal setup and no boilerplate. Cross-platform - can run on most p

null 4 Dec 2, 2023
A fusion of OTP lib/dialyzer + lib/compiler for regular Erlang with type inference

Typed ERLC The Problem I have a dream, that one day there will be an Erlang compiler, which will generate high quality type-correct code from deduced

Dmytro Lytovchenko 35 Sep 5, 2022
Run LLaMA inference on CPU, with Rust 🦀🚀🦙

LLaMA-rs Do the LLaMA thing, but now in Rust ?? ?? ?? Image by @darthdeus, using Stable Diffusion LLaMA-rs is a Rust port of the llama.cpp project. Th

Rustformers 2.7k Apr 17, 2023
Run LLaMA inference on CPU, with Rust 🦀🚀🦙

LLaMA-rs Do the LLaMA thing, but now in Rust ?? ?? ?? Image by @darthdeus, using Stable Diffusion LLaMA-rs is a Rust port of the llama.cpp project. Th

Rustformers 2.7k Apr 17, 2023
Simplify temporary email management and interaction, including message retrieval and attachment downloads, using Rust.

Tempmail The Tempmail simplifies temporary email management and interaction, including message retrieval and attachment downloads, using the Rust prog

Dilshad 6 Sep 21, 2023
🚀 Fast and 100% API compatible postcss replacer, built in Rust

?? Fast and 100% API compatible postcss replacer, built in Rust

迷渡 472 Jan 7, 2023
Unofficial Bitwarden compatible server written in Rust, formerly known as bitwarden_rs

Alternative implementation of the Bitwarden server API written in Rust and compatible with upstream Bitwarden clients*, perfect for self-hosted deploy

Daniel García 21.5k Jan 8, 2023
A multipurpose (including music) Discord bot written in Rust

filloabot-rs A multipurpose (including music) Discord bot. This is a full rewrite using Rust of the original FilloaBot. Some features that were consid

FilloaBot 2 Oct 21, 2022