A single-binary, GPU-accelerated LLM server (HTTP and WebSocket API) written in Rust

Related tags

Geospatial poly
Overview

Poly

Poly is a versatile LLM serving back-end. What it offers:

  • High-performance, efficient and reliable serving of multiple local LLM models
  • Optional GPU acceleration through either CUDA or Metal
  • Configurable LLM completion tasks (prompts, recall, stop tokens, etc.)
  • Streaming completion responses through HTTP SSE, chat using WebSockets
  • Biased sampling of completion output using JSON schema
  • Memory retrieval using vector databases (either built-in file based, or external such as Qdrant)
  • Accepts and automatically chunks PDF and DOCX files for storage to memory
  • API secured using either static API keys or JWT tokens
  • Simple, single binary + config file server deployment, horizontally scalable

Nice extras:

  • A web client to easily test and fine-tune configuration
  • A single-binary cross platform desktop client for locally running models

Supported models include:

  • LLaMa and derivatives (Alpaca, Vicuna, Guanaco, etc.)
  • LLaMA2
  • RedPajamas
  • MPT
  • Orca-mini
Web client Desktop app
Web client demonstration Desktop client demonstration

Sample of a model+task+memory configuration:

[models.gpt2dutch]
model_path = "./data/gpt2-small-dutch-f16.bin"
architecture = "gpt2"
use_gpu = false

[memories.dutch_qdrant]
store = { qdrant = { url = "http://127.0.0.1:6333", collection = "nl" } }
dimensions = 768
embedding_model = "gpt2dutch"

[tasks.dutch_completion]
model = "gpt2dutch"
prelude = "### System:\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\n"
prefix =  "\n### User:\n"
postfix = "\n### Response:"
memorization = { memory = "dutch_qdrant", retrieve = 2 }

See config.example.toml for more example configurations.

Custom samplers can be configured using a string-based description, see here. Any biaser (i.e. JSON biaser) is injected as first sampler in the chain.

Concepts

In Poly, models are LLM models that support basic text generation and embedding operations. Models can be run on the GPU and have specific context lengths, but are otherwise unconfigurable.

A task uses a model in a specific way (i.e. using specific prompts, stop tokens, sampling, et cetera. Tasks are highly configurable. A model may be shared by multiple tasks.

A memory is a database that stores chunks of text, and allows retrieval of such chunks using vector similarity (where each chunk has a vector calculated as an embedding from an LLM). Memories can be re-used between tasks.

classDiagram
    class Model {
        String name
        Boolean use_gpu
        Int context_length
    }
    class Task {
        String name
        String prompts
        Inference parameters
    }
    class Memory {
        String name
        Chunking parameters
    }
    class Chunk {
        String text
        Vector embedding
    }

    Task "*" --> "1" Model: Generation model
    Task "*" --> "1" Memory
    Memory "*"--> "1" Model: Embedding model
    Memory "1" --> "*" Chunk: Stored chunks

The API exposes models, tasks and memories by their name (which is unique within their category).

Tasks

A task configures the way user input is transformed before it is fed to an LLM and the way the LLM output is transformed before it is returned to the user, in order to perform a specific task. A task can be configured to use (optional) prelude, prefix and postfix prompts. The prelude is fed once to the model for each session. The prefix and postfix are applied to each user input (i.e. each chat message):

sequenceDiagram
    actor User
    Task->>LLM: Prelude
    loop
        User->>+Task: Prompt
		alt When recall is enabled
			Task->>LLM: Recalled memory items (based on user prompt)
		end
        Task->>+LLM: Prefix
        Task->>LLM: Prompt
        Task->>LLM: Postfix

        note over LLM: Generate until stop token occurs or token/context limit reached
        LLM->>-Task: Response
        Task->>-User: Completion
    end

When biasing is enabled, an optional bias prompt can be configured. When configured the model will be asked to generate a response (following the flow as shown above). This response is however not directly returned to the user. Instead, the bias prompt is then fed, after which the biaser is enabled (and the biased response is returned to the user).

sequenceDiagram
    actor User
    Task->>LLM: Prelude
    loop
        User->>+Task: Prompt
		alt When recall is enabled
			Task->>LLM: Recalled memory items (based on user prompt)
		end
        Task->>+LLM: Prefix
        Task->>LLM: Prompt
        Task->>LLM: Postfix

        note over LLM: Generate until stop token occurs or token/context limit reached
        Task->>LLM: Bias prompt
        note over LLM: Generate with biaser enabled

        LLM->>-Task: Response (biased)
        Task->>-User: Completion
    end

Architecture

Poly is divided into separate crates that can be used independently:

  • poly-server: Serve LLMs through HTTP and WebSocket APIs (provides llmd)
  • poly-backend: Back-end implementation of LLM tasks
  • poly-extract: Crate for extracting plaintext from various document types
  • poly-bias: Crate for biasing LLM output to e.g. JSON following a schema
  • poly-ui: Simple desktop UI for local LLMs

Applications that want to employ Poly's functionality should use the HTTP REST API exposed by poly-server. Rust applications looking to integrate Poly's capabilities could also depend on poly-backend directly.

flowchart TD

subgraph Poly
	PW[poly web-client]
	PS[poly-server]
	PB[poly-backend]
	PE[poly-extract]
	PU[poly-ui]
	PBs[poly-bias]

	PW-->|HTTP,WS,SSE|PS
	PB-->PBs
	PS<-.->PE

	PS-->PB
	PU-->PB
end

LLM[rustformers/llm]
PBs<-.->LLM
PB<-.->Qdrant

T[tokenizers]
GGML
Metal
PB-->LLM
LLM<-.->T
LLM-->GGML
GGML-->Metal
GGML-->CUDA

Authors

License

Poly is licensed under the Apache 2.0 license from Git revision f329d35 onwards (only). Kindly note that this license provides no entitlement to support whatsoever. If you have any specific licensing or support requirements, please contact the authors.

We will accept contributions only if accompanied with a written statement that the contributed code is also (at least) Apache 2.0 licensed.

Specific licenses apply to the following files:

You might also like...
Zero-Copy reading and writing of geospatial data.

GeoZero Zero-Copy reading and writing of geospatial data. GeoZero defines an API for reading geospatial data formats without an intermediate represent

OpenStreetMap flatdata format and compiler
OpenStreetMap flatdata format and compiler

osmflat Flat OpenStreetMap (OSM) data format providing an efficient random data access through memory mapped files. The data format is described and i

A traffic simulation game exploring how small changes to roads affect cyclists, transit users, pedestrians, and drivers.
A traffic simulation game exploring how small changes to roads affect cyclists, transit users, pedestrians, and drivers.

A/B Street Ever been stuck in traffic on a bus, wondering why is there legal street parking instead of a dedicated bus lane? A/B Street is a game expl

Calculates a stars position and velocity in the cartesian coordinate system.
Calculates a stars position and velocity in the cartesian coordinate system.

SPV Calculates a stars position and velocity in the cartesian coordinate system. Todo Expand the number of available operation Batch processing by tak

A set of tools for generating isochrones and reverse isochrones from geographic coordinates

This library provides a set of tools for generating isochrones and reverse isochrones from geographic coordinates. It leverages OpenStreetMap data to construct road networks and calculate areas accessible within specified time limits.

Rust crate for performing coordinate transforms
Rust crate for performing coordinate transforms

Synopsis A Rust crate use for performing coordinate transformations. The crate relies on nalgebra vectors to perform the coordinate transformations. C

Rust bindings for GDAL

gdal [] GDAL bindings for Rust. So far, you can: open a raster dataset for reading/writing get size and number of bands get/set projection and geo-tra

Rust bindings for the latest stable release of PROJ

PROJ Coordinate transformation via bindings to the PROJ v7.2.1 API. Two coordinate transformation operations are currently provided: projection (and i

Geohash for Rust

Rust-Geohash Rust-Geohash is a Rust library for Geohash algorithm. Ported from node-geohash module. Documentation Docs Check the API doc at docs.rs Li

Owner
Tommy van der Vorst
Partner at Dialogic, providing insights in the domain of AI/ML and telecommunications, backed by meaningful and innovative data analysis.
Tommy van der Vorst
An fast, offline reverse geocoder (>1,000 HTTP requests per second) in Rust.

Rust Reverse Geocoder A fast reverse geocoder in Rust. Inspired by Python reverse-geocoder. Links Crate 2.0.0 Docs 1.0.1 Docs Description rrgeo takes

Grant Miner 91 Dec 29, 2022
Blazing fast and lightweight PostGIS vector tiles server

Martin Martin is a PostGIS vector tiles server suitable for large databases. Martin is written in Rust using Actix web framework. Requirements Install

Urbica 921 Jan 7, 2023
Fast Geospatial Feature Storage API

Hecate OpenStreetMap Inspired Data Storage Backend Focused on Performance and GeoJSON Interchange Hecate Feature Comparison Feature Hecate ESRI MapSer

Mapbox 243 Dec 19, 2022
Kubernetes leader election using the coordination.k8s.io API.

Kube Coordinate Kubernetes leader election using the coordination.k8s.io API. Kube Coordinate builds upon the kube-rs ecosystem, and implements a stre

Anthony Dodd 4 Dec 14, 2023
Didactic implementation of the type checker described in "Complete and Easy Bidirectional Typechecking for Higher-Rank Polymorphism" written in OCaml

bidi-higher-rank-poly Didactic implementation of the type checker described in "Complete and Easy Bidirectional Typechecking for Higher-Rank Polymorph

Søren Nørbæk 23 Oct 18, 2022
Computational Geometry library written in Rust

RGeometry Computational Geometry in Rust! What is RGeometry? RGeometry is a collection of data types such as points, polygons, lines, and segments, an

null 101 Dec 6, 2022
Optimized geometry primitives for Microsoft platforms with the same memory layout as DirectX and Direct2D and types.

geoms Geometry for Microsoft platforms - a set of geometry primitives with memory layouts optimized for native APIs (Win32, Direct2D, and Direct3D). T

Connor Power 2 Dec 11, 2022
Geospatial primitives and algorithms for Rust

geo Geospatial Primitives, Algorithms, and Utilities The geo crate provides geospatial primitive types such as Point, LineString, and Polygon, and pro

GeoRust 989 Dec 29, 2022
Geospatial primitives and algorithms for Rust

geo Geospatial Primitives, Algorithms, and Utilities The geo crate provides geospatial primitive types such as Point, LineString, and Polygon, and pro

GeoRust 990 Jan 1, 2023
TIFF decoding and encoding library in pure Rust

image-tiff TIFF decoding and encoding library in pure Rust Supported Features Baseline spec (other than formats and tags listed below as not supported

image-rs 66 Dec 30, 2022