Believe in AI democratization. llama for nodejs backed by llama-rs, work locally on your laptop CPU. support llama/alpaca model.

Genkagaku.GPT

Last update: Apr 10, 2023

Related tags

Machine learning nodejs nlp chat ai embeddings llama gpt napi larg napi-rs large-language-models llm llama-rs llama-node

Overview

llama-node

Large Language Model LLaMA on node.js

This project is in an early stage, the API for nodejs may change in the future, use it with caution.

中文文档

_{Picture generated by stable diffusion.}

llama-node

Introduction

This is a nodejs client library for llama LLM built on top of llama-rs. It uses napi-rs for channel messages between node.js and llama thread.

Currently supported platforms:

darwin-x64
darwin-arm64
linux-x64-gnu
win32-x64-msvc

I do not have hardware for testing 13B or larger models, but I have tested it supported llama 7B model with both ggml llama and ggml alpaca.

Getting the weights

The llama-node uses llama-rs under the hook and uses the model format derived from llama.cpp. Due to the fact that the meta-release model is only used for research purposes, this project does not provide model downloads. If you have obtained the original .pth model, please read the document Getting the weights and use the convert tool provided by llama-rs for conversion.

Model versioning

There are now 3 versions from llama.cpp community:

GGML: legacy format, oldest ggml tensor file format
GGMF: also legacy format, newer than GGML, older than GGJT
GGJT: mmap-able format

The llama-rs backend now only supports GGML/GGMF models, so does the llama-node. For GGJT(mmap) models support, please wait for standalone loader to be merged.

Usage

The current version supports only one inference session on one LLama instance at the same time

If you wish to have multiple inference sessions concurrently, you need to create multiple LLama instances

Inference

import path from "path";
import { LLamaClient } from "llama-node";

const model = path.resolve(process.cwd(), "./ggml-alpaca-7b-q4.bin");

const llama = new LLamaClient(
    {
        path: model,
        numCtxTokens: 128,
    },
    true
);

const template = `how are you`;

const prompt = `Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

${template}

### Response:`;

llama.createTextCompletion(
    {
        prompt,
        numPredict: 128,
        temp: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
        feedPrompt: true,
    },
    (response) => {
        process.stdout.write(response.token);
    }
);

Chatting

Working on alpaca, this just make a context of alpaca instructions. Make sure your last message is end with user role.

import { LLamaClient } from "llama-node";
import path from "path";

const model = path.resolve(process.cwd(), "./ggml-alpaca-7b-q4.bin");

const llama = new LLamaClient(
    {
        path: model,
        numCtxTokens: 128,
    },
    true
);

const content = "how are you?";

llama.createChatCompletion(
    {
        messages: [{ role: "user", content }],
        numPredict: 128,
        temp: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
    },
    (response) => {
        if (!response.completed) {
            process.stdout.write(response.token);
        }
    }
);

Tokenize

Get tokenization result from LLaMA

import { LLamaClient } from "llama-node";
import path from "path";

const model = path.resolve(process.cwd(), "./ggml-alpaca-7b-q4.bin");

const llama = new LLamaClient(
    {
        path: model,
        numCtxTokens: 128,
    },
    true
);

const content = "how are you?";

llama.tokenize(content).then(console.log);

Embedding

Preview version, embedding end token may change in the future. Do not use it in production!

import { LLamaClient } from "llama-node";
import path from "path";

const model = path.resolve(process.cwd(), "./ggml-alpaca-7b-q4.bin");

const llama = new LLamaClient(
    {
        path: model,
        numCtxTokens: 128,
    },
    true
);

const prompt = `how are you`;

llama
    .getEmbedding({
        prompt,
        numPredict: 128,
        temp: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
        feedPrompt: true,
    })
    .then(console.log);

Performance related

We provide prebuild binaries for linux-x64, win32-x64, apple-x64, apple-silicon. For other platforms, before you install the npm package, please install rust environment for self built.

Due to complexity of cross compilation, it is hard for pre-building a binary that fits all platform needs with best performance.

If you face low performance issue, I would strongly suggest you do a manual compilation. Otherwise you have to wait for a better pre-compiled native binding. I am trying to investigate the way to produce a matrix of multi-platform supports.

Manual compilation (from node_modules)

The following steps will allow you to compile the binary with best quality on your platform

Pre-request: install rust
Under node_modules/@llama-node/core folder
```
npm run build
```

Manual compilation (from source)

The following steps will allow you to compile the binary with best quality on your platform

Pre-request: install rust
Under root folder, run
```
npm install && npm run build
```
Under packages/core folder, run
```
npm run build
```
You can use the dist under root folder

Install

npm install llama-node

Future plan

prompt extensions
more platforms and cross compile (performance related)
tweak embedding API, make end token configurable
cli and interactive
support more open source models as llama-rs planned rustformers/llama-rs#85 rustformers/llama-rs#75

This is a rewrite of the RAMP (Rapid Assistance in Modelling the Pandemic) model

RAMP from scratch This is a rewrite of the RAMP (Rapid Assistance in Modelling the Pandemic) model, based on the EcoTwins-withCommuting branch, in Rus

3 Oct 26, 2022

A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

random_search A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the los

2 Apr 1, 2022

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

2.3k Dec 31, 2022

Using OpenAI Codex's "davinci-edit" Model for Gradual Type Inference

OpenTau: Using OpenAI Codex for Gradual Type Inference Current implementation is focused on TypeScript Python implementation comes next Requirements r

11 Dec 18, 2022

Python+Rust implementation of the Probabilistic Principal Component Analysis model

Probabilistic Principal Component Analysis (PPCA) model This project implements a PPCA model implemented in Rust for Python using pyO3 and maturin. In

11 Dec 16, 2022

A demo repo that shows how to use the latest component model feature in wasmtime to implement a key-value capability defined in a WIT file.

Key-Value Component Demo This repo serves as an example of how to use the latest wasm runtime wasmtime and its component-model feature to build and ex

3 Dec 20, 2022

Experimenting with Rust's fundamental data model

ferrilab Redefining the Rust fundamental data model bitvec funty radium Introduction The ferrilab project is a collection of crates that provide more

13 Dec 13, 2022

Library for the Standoff Text Annotation Model, in Rust

STAM Library STAM is a data model for stand-off text annotation and described in detail here. This is a sofware library to work with the model, writte

3 Jan 11, 2023

A rust implementation of the csl-next model.

Vision This is a project to write the CSL-Next typescript model and supporting libraries and tools in Rust, and convert to JSON Schema from there. At

4 Jun 13, 2023

Comments

tsx: not found while running npm install

Hi there! Nice package.

I'm trying to install this in a brand project, but I'm facing this issue

npm ERR! code 127
npm ERR! path /MYPATHHERE/node_modules/@llama-node/core
npm ERR! command failed
npm ERR! command sh -c -- tsx scripts/build.ts
npm ERR! sh: 1: tsx: not found

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/jonit/.npm/_logs/2023-04-07T01_54_20_183Z-debug-0.log

Any toughts?

Is it a typo?

Shouldn't it be tsc scripts/build.ts?

bug

opened by jonit-dev 3

fix: ensure that final completion message is sent

Hello - fantastic work on this project!

I noticed an issue when trying it out - the callback with { completed: true } is never sent currently, since the logic is configured to not trigger the callback on the final completion (\n\n<end>\n) message.

This makes interacting with the model a bit tricky, since the calling code does not get notified when the response is finished.

opened by triestpa 1
Option to get value as buffer - Support emoji
When you get tokens as text, some information is lost, for example, emojis that are built with more than one byte. The token sends half an emoji, and the utf8 encodes that to an unknown character.

I thought about 2 possible solutions:

to send as a buffer to ensure no information is lost, and let the client handle it

if you see the last character is encoded to unknown save the byte for the next token.

I implement the second solution for something similar, maybe it could help massage-builder
enhancement
opened by ido-pluto 1

Releases(v0.0.18)

v0.0.18(Apr 7, 2023)
What's Changed

feature: init cli by @hlhr202 in https://github.com/hlhr202/llama-node/pull/10

Full Changelog: https://github.com/hlhr202/llama-node/compare/v0.0.17...v0.0.18
Source code(tar.gz)
Source code(zip)
v0.0.17(Apr 7, 2023)

Full Changelog: https://github.com/hlhr202/llama-node/compare/v0.0.16...v0.0.17

Fixed: postinstall script
Source code(tar.gz)
Source code(zip)
v0.0.16(Apr 5, 2023)
What's Changed

This version will enable full acceleration features for win32/mac/linux x64 build and also apple silicon.

v0.0.14 by @hlhr202 in https://github.com/hlhr202/llama-node/pull/7

feat: upgrade ci cross compile by @hlhr202 in https://github.com/hlhr202/llama-node/pull/8

Full Changelog: https://github.com/hlhr202/llama-node/compare/v0.0.13...v0.0.16
Source code(tar.gz)
Source code(zip)
v0.0.13(Apr 3, 2023)
What's Changed

release: v0.0.13 by @hlhr202 in https://github.com/hlhr202/llama-node/pull/6

Full Changelog: https://github.com/hlhr202/llama-node/compare/v0.0.10...v0.0.13
Source code(tar.gz)
Source code(zip)
v0.0.10(Mar 27, 2023)
What's Changed

fix: ensure that final completion message is sent by @triestpa in https://github.com/hlhr202/llama-node/pull/3

New Contributors

@triestpa made their first contribution in https://github.com/hlhr202/llama-node/pull/3

Full Changelog: https://github.com/hlhr202/llama-node/compare/v0.0.7...v0.0.10
Source code(tar.gz)
Source code(zip)
v0.0.7(Mar 27, 2023)
update llama-rs dependency

support get word embedding, I just set a random end token for it to extract tensor in fixed size just for preview purpose, may change in future, do not use it in production

Source code(tar.gz)
Source code(zip)
v0.0.6(Mar 24, 2023)
feat:

support windows x64 built under msvc

support feed prompt, so that you can skip input prompt in the inference response. Chat mode is available!

Source code(tar.gz)
Source code(zip)
v0.0.5(Mar 24, 2023)
feat: update llama-rs to latest version

support 65B model (test it and feed back to me)

bias token support, original mr here

Source code(tar.gz)
Source code(zip)
v0.0.4(Mar 22, 2023)
initial release:

released: core function which bridge llama nodejs api to llama-rs

released: wrapper class for core which enables async style

fix: end token not called correctly

Source code(tar.gz)
Source code(zip)

Owner

Genkagaku.GPT

TS/Rust lover; Believe in AI democratizaton; Senior software developer with expanded skills in graphic design, music composition and sound design

GitHub https://www.npmjs.com/package/llama-node

Open Machine Intelligence Framework for Hackers. (GPU/CPU)

Leaf • Introduction Leaf is a open Machine Learning Framework for hackers to build classical, deep or hybrid machine learning applications. It was ins

5.5k Jan 1, 2023

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Open Deep Learning Compiler Stack Documentation | Contributors | Community | Release Notes Apache TVM is a compiler stack for deep learning systems. I

8.9k Jan 4, 2023

Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

Damavand is a code that simulates quantum circuits. In order to learn more about damavand, refer to the documentation. Development status Core feature

6 Mar 29, 2022

Signed distance functions + Rust (CPU & GPU) = ❤️❤️

sdf-playground Signed distance functions + Rust (CPU & GPU) = ❤️❤️ Platforms: Windows, Mac & Linux. About sdf-playground is a demo showcasing how you

5 Nov 16, 2023

Rust+OpenCL+AVX2 implementation of LLaMA inference code

RLLaMA RLLaMA is a pure Rust implementation of LLaMA large language model inference.. Supported features Uses either f16 and f32 weights. LLaMA-7B, LL

344 Apr 16, 2023

LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!

LLaMa 7b in rust This repo contains the popular LLaMa 7b language model, fully implemented in the rust programming language! Uses dfdx tensors and CUD

16 May 8, 2023

Small crate to work with URL in miniquad/macroquad.

quad-url This is the crate to work with URL and open links in miniquad/macroquad environment. Web demo. Usage Add this to your Cargo.toml dependencies

3 Jun 11, 2022

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

405 Dec 20, 2022

Masked Language Model on Wasm

Masked Language Model on Wasm This project is for OPTiM TECH BLOG. Please see below: WebAssemblyを用いてBERTモデルをフロントエンドで動かす Demo Usage Build image docker

20 Sep 23, 2022

Docker for PyTorch rust bindings `tch`. Example of pretrain model.

tch-rs-pretrain-example-docker Docker for PyTorch rust bindings tch-rs. Example of pretrain model. Docker files support the following install libtorch

5 Oct 7, 2022

Believe in AI democratization. llama for nodejs backed by llama-rs, work locally on your laptop CPU. support llama/alpaca model.

Related tags

Overview

llama-node

Introduction

Getting the weights

Model versioning

Usage

Inference

Chatting

Tokenize

Embedding

Performance related

Manual compilation (from node_modules)

Manual compilation (from source)

Install

Future plan

You might also like...

This is a rewrite of the RAMP (Rapid Assistance in Modelling the Pandemic) model

A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Using OpenAI Codex's "davinci-edit" Model for Gradual Type Inference

Python+Rust implementation of the Probabilistic Principal Component Analysis model

A demo repo that shows how to use the latest component model feature in wasmtime to implement a key-value capability defined in a WIT file.

Experimenting with Rust's fundamental data model

Library for the Standoff Text Annotation Model, in Rust

A rust implementation of the csl-next model.

Comments

tsx: not found while running npm install

fix: ensure that final completion message is sent

Option to get value as buffer - Support emoji

Releases(v0.0.18)

v0.0.18(Apr 7, 2023)

What's Changed

v0.0.17(Apr 7, 2023)

v0.0.16(Apr 5, 2023)

What's Changed

v0.0.13(Apr 3, 2023)

What's Changed

v0.0.10(Mar 27, 2023)

What's Changed

New Contributors

v0.0.7(Mar 27, 2023)

v0.0.6(Mar 24, 2023)

v0.0.5(Mar 24, 2023)

v0.0.4(Mar 22, 2023)

Owner

Genkagaku.GPT

Open Machine Intelligence Framework for Hackers. (GPU/CPU)

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

Signed distance functions + Rust (CPU & GPU) = ❤️❤️

Rust+OpenCL+AVX2 implementation of LLaMA inference code

LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!

Small crate to work with URL in miniquad/macroquad.

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Masked Language Model on Wasm

Docker for PyTorch rust bindings `tch`. Example of pretrain model.