Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Eric Buehler

Last update: Nov 15, 2023

Related tags

Miscellaneous candle-vllm

Overview

candle-vllm

Efficient platform for inference and serving local LLMs including an OpenAI compatible API server.

Features

OpenAI compatible API server provided for serving LLMs.
Highly extensible trait-based system to allow rapid implementation of new module pipelines,
Streaming support in generation.

Overview

One of the goals of candle-vllm is to interface locally served LLMs using an OpenAI compatible API server.

During initial setup: the model, tokenizer and other parameters are loaded.
When a request is received:
Sampling parameters are extracted, including n - the number of choices to generate.
The request is converted to a prompt which is sent to the model pipeline.
- If a streaming request is received, token-by-token streaming using SSEs is established (n choices of 1 token).
- Otherwise, a n choices are generated and returned.

Contributing

The following features are planned to be implemented, but contributions are especially welcome:

Sampling methods:
- Beam search (huggingface/candle#1319)
- presence_penalty and frequency_penalty
Pipeline batching (#3)
KV cache (#3)
PagedAttention (#3)
More pipelines (from candle-transformers)

Resources

Python implementation: vllm-project
vllm paper

You might also like...

Meet Rustacean GPT, an experimental project transforming OpenAi's GPT into a helpful, autonomous software engineer to support senior developers and simplify coding life! 🚀🤖🧠

Rustacean GPT Welcome, fellow coding enthusiasts! 🚀 🤖 I am excited to introduce you to Rustacean GPT, my humble yet ambitious project that aims to t

3 May 10, 2023

Tiny Discord ticket support bot that utilizes the OpenAI GPT-3.5-turbo model.

BDFD AI Mod Our tiny Discord ticket support bot that utilizes the OpenAI GPT-3.5-turbo model. This project aims to help users by providing a very fast

5 Apr 20, 2023

Erlang Language Platform. LSP server and CLI.

Erlang Language Platform (ELP) Description ELP integrates Erlang into modern IDEs via the language server protocol. ELP was inspired by rust-analyzer.

31 Aug 7, 2023

Another Network Tunnel; A simple program for local/remote port forwarding over a SSH tunnel.

🐜 ANT Another Network Tunnel; A simple program for local/remote port forwarding over a SSH tunnel. Table of Contents Installation Pre-requisites Pre-

20 Jun 25, 2023

Very simple Rust binary that can turn on/off a TP-Link L920 led light strip in your local network

TP-Link L920 on/off script This is a (very) simple Rust binary that can turn on/off a TP-Link L920 led light strip in your local network. Installation

3 Aug 21, 2023

A Litecord compatible/inspired OSS implementation of Discord's backend for fun and profit.

3 May 9, 2022

A rust library for interacting with multiple Discord.com-compatible APIs and Gateways at the same time.

Chorus A rust library for interacting with (multiple) Spacebar-compatible APIs and Gateways (at the same time). Explore the docs » Report Bug · Reques

4 Apr 30, 2023

Emerald, the EVM compatible paratime

The Emerald ParaTime This is the Emerald ParaTime, an official EVM-compatible Oasis Protocol Foundation's ParaTime for the Oasis Network built using t

5 Mar 31, 2022

Modrinth API is a simple library for using, you guessed it, the Modrinth API in Rust projects

Modrinth API is a simple library for using, you guessed it, the Modrinth API in Rust projects. It uses reqwest as its HTTP(S) client and deserialises responses to typed structs using serde.

21 Jan 1, 2023

Comments

Can the architectural design be improved?

Thanks for your candle-vllm.

I have my own LLM and tokenizer with diffrent prompt generation, and I want to make it serve as OpenAI compatiable API. I also don't want to modify the code of candle-vllm, want just use it as a library.

But now，looking at your design, it seems like you want to do everything by yourself. https://github.com/EricLBuehler/candle-vllm/blob/cbfd463d72709ab70a360d07cc17be543e56ebd2/src/openai/conversation.rs#L16-L20

I wonder if model, tokenizer and prompt generation can be abstracted into traits?
enhancement

opened by mokeyish 9
Support streaming of tokens

I noticed that OpenAI's stream=True could be perhaps implemented here:

https://github.com/EricLBuehler/candle-vllm/blob/bc31515011147bea52f854f411627f21e6261a1f/src/openai/pipelines/llama.rs#L248-L260
enhancement

opened by michaelfeil 1
Batching and VLLM-style kv caching missing
Your implementation is looking great so far.

I got a bit confused, as with the name vllm, I would have expected that two features are implemented:

[ ] support batching, in the pipeline

[ ] support vllm style kv-caching using paged-attention

Is there a plan to support them?
enhancement
opened by michaelfeil 5

Efficent platform for inference and serving local LLMs including an OpenAI compatible API server.

Related tags

Overview

candle-vllm

Features

Overview

Contributing

Resources

You might also like...

Meet Rustacean GPT, an experimental project transforming OpenAi's GPT into a helpful, autonomous software engineer to support senior developers and simplify coding life! 🚀🤖🧠

Tiny Discord ticket support bot that utilizes the OpenAI GPT-3.5-turbo model.

Erlang Language Platform. LSP server and CLI.

Another Network Tunnel; A simple program for local/remote port forwarding over a SSH tunnel.

Very simple Rust binary that can turn on/off a TP-Link L920 led light strip in your local network

A Litecord compatible/inspired OSS implementation of Discord's backend for fun and profit.

A rust library for interacting with multiple Discord.com-compatible APIs and Gateways at the same time.

Emerald, the EVM compatible paratime

Modrinth API is a simple library for using, you guessed it, the Modrinth API in Rust projects

Comments

Can the architectural design be improved?

Support streaming of tokens

Batching and VLLM-style kv caching missing

Owner

Eric Buehler

Crates Registry is a tool for serving and publishing crates and serving rustup installation in offline networks.

This crate converts Rust compatible regex-syntax to Vim's NFA engine compatible regex.

A library for working with alpaca.markets APIs, including the Broker API.

A fusion of OTP lib/dialyzer + lib/compiler for regular Erlang with type inference

Run LLaMA inference on CPU, with Rust 🦀🚀🦙

Run LLaMA inference on CPU, with Rust 🦀🚀🦙

Simplify temporary email management and interaction, including message retrieval and attachment downloads, using Rust.

🚀 Fast and 100% API compatible postcss replacer, built in Rust

Unofficial Bitwarden compatible server written in Rust, formerly known as bitwarden_rs

A multipurpose (including music) Discord bot written in Rust