12 Repositories
Rust tokenizer Libraries
A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.
FileQL - File Query Language FileQL is a tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK. Sampl
The Bytepiece Tokenizer Implemented in Rust.
bytepiece Implementation of Su's bytepiece. Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little prep
Viterbi-based accelerated tokenizer (Python wrapper)
π python-vibrato π€ Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra
ik-analyzer for rust; chinese tokenizer for tantivy
ik-rs ik-analyzer for Rust support Tantivy Usage Chinese Segment let mut ik = IKSegmenter::new(); let text = "δΈεδΊΊζ°ε ±εε½"; let tokens = ik.to
A type-safe, high speed programming language for scalable systems
A type-safe, high speed programming language for scalable systems! (featuring a cheesy logo!) note: the compiler is unfinished and probably buggy. if
π₯ Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
π python-vaporetto π₯ Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation
A WHATWG-compliant HTML5 tokenizer and tag soup parser
html5gum html5gum is a WHATWG-compliant HTML tokenizer. use std::fmt::Write; use html5gum::{Tokenizer, Token}; let html = "title hello world/tit
Vaporetto: a fast and lightweight pointwise prediction based tokenizer
π₯ VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository
SQL / SQLI tokenizer parser analyzer
libinjection SQL / SQLI tokenizer parser analyzer. For C and C++ PHP Python Lua Java (external port) [LuaJIT/FFI]
Rust wrapper for the BlingFire tokenization library
BlingFire in Rust blingfire is a thin Rust wrapper for the BlingFire tokenization library. Add the library to Cargo.toml to get started cargo add blin
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra
A morphological analysis library.
Lindera A Japanese morphological analysis library in Rust. This project fork from fulmicoton's kuromoji-rs. Lindera aims to build a library which is e