Composable n-gram combinators that are ergonomic and bare-metal fast

Overview

CREATURE FEATUR(ization)

A crate for polymorphic ML & NLP featurization that leverages zero-cost abstraction. It provides composable n-gram combinators that are ergonomic and bare-metal fast. Although created with NLP in mind, it's very general and can be applied in a plethera of domains such as computer vision.

There are many n-gram crates, but the majority force heap allocation or lock you into a concrete type that doesn’t fit your use-case or performance needs. In most benchmarks, creature_feature is anywhere between 4x - 60x faster.

Image

See a live demo

Here is a live demo of creature_feature using WASM

A Swiss Army Knife of Featurization

use creature_feature::traits::Ftzr;
use creature_feature::ftzrs::{bigram, bislice, for_each, whole};
use creature_feature::HashedAs;
use creature_feature::convert::Bag;
use std::collections::{HashMap, HashSet, BTreeMap, LinkedList, BTreeSet, BinaryHeap, VecDeque};

let int_data = &[1, 2, 3, 4, 5];
let str_data = "one fish two fish red fish blue fish";

// notice how the left-hand side remains almost unchanged.

// we're using 'bislice' right now (which is a 2-gram of referenced data), 'ftzrs::bigram' would yield owned data instead of references

let ref_feats: Vec<&str>                  = bislice().featurize(str_data);
let ref_feats: LinkedList<&[u8]>          = bislice().featurize(str_data);
let ref_bag:   Bag<HashMap<&[usize], u8>> = bislice().featurize(int_data);
let ref_trigram_bag:   Bag<BTreeMap<&str, i16>>   = for_each(whole()).featurize(str_data.split_ascii_whitespace());
let hashed_trigrams: BTreeSet<HashedAs<u64>> = trislice().featurize(int_data);

The above five will have the following values, respectively.

["on", "ne", "e ", " f", "fi", "is", "sh", "h ", " t", "tw", "wo", "o ", " f", "fi", "is", "sh", "h ", " r", "re", "ed", "d ", " f", "fi", "is", "sh", "h ", " b", "bl", "lu", "ue", "e ", " f", "fi", "is", "sh"]

[[111, 110], [110, 101], [101, 32], [32, 102], [102, 105], [105, 115], [115, 104], [104, 32], [32, 116], [116, 119], [119, 111], [111, 32], [32, 102], [102, 105], [105, 115], [115, 104], [104, 32], [32, 114], [114, 101], [101, 100], [100, 32], [32, 102], [102, 105], [105, 115], [115, 104], [104, 32], [32, 98], [98, 108], [108, 117], [117, 101], [101, 32], [32, 102], [102, 105], [105, 115], [115, 104]]

Bag({[2, 3, 4]: 1, [3, 4, 5]: 1, [1, 2, 3]: 1})

Bag({"blue": 1, "fish": 4, "one": 1, "red": 1, "two": 1})

{HashedAs(3939941806544028562), HashedAs(7191405660579021101), HashedAs(16403185381100005216)}

Here are more examples of what's possible:

 // let's now switch to 'bigram'
let owned_feats: BTreeSet<[u8; 2]>        = bigram().featurize(str_data);
let owned_feats: Vec<String>              = bigram().featurize(str_data);
let owned_feats: HashSet<Vec<usize>>      = bigram().featurize(int_data);
let owned_bag:   Bag<HashMap<Vec<usize>, u16>>      = bigram().featurize(int_data);

let hashed_feats: BinaryHeap<HashedAs<u32>> = bislice().featurize(str_data);
let hashed_feats: VecDeque<HashedAs<u64>>   =  bigram().featurize(int_data);

let sentence = str_data.split_ascii_whitespace();
let bag_of_words: Bag<HashMap<String, u128>> = for_each(bigram()) .featurize(sentence.clone());
let bag_of_words: Bag<HashMap<&str, u8>>     = for_each(bislice()).featurize(sentence.clone());

 // and many, MANY more posibilities

We can even produce multiple outputs while still only featurizing the input once

let (set, list): (BTreeSet<HashedAs<u64>>, Vec<&str>) = bislice().featurize_x2(str_data);

creature_feature provides three general flavors of featurizers:

  1. NGram<const N: usize> provides n-grams over copied data and produces owned data or multiple [T;N]. Examples include ftzrs::n_gram, ftzrs::bigram and ftzrs::trigram.

  2. SliceGram provides n-grams over referenced data and produces owned data or multiple &[T]. Examples include ftzrs::n_slice, ftzrs::bislice and ftzrs::trislice.

  3. Combinators that compose one or more featurizers and return a new featurizer with different behavior. Examples include ftzrs::for_each, ftzrs::gap_gram, featurizers! and ftzrs::bookends.

WHY POLYMORPHISM == PERFORMANCE

Here is a small quiz to show why polymorphic featurization and FAST featurization go hand-in-hand.

Here are four different ways to featurize a string that are basically equivalent. But, which one of the four is fastest? By how much?

let sentence = "It is a truth universally acknowledged that Jane Austin must be used in nlp examples";

let one:   Vec<String> = trigram().featurize(sentence);
let two:   Vec<[u8;3]> = trigram().featurize(sentence);
let three: Vec<&str>   = trislice().featurize(sentence); // same performance as &[u8]
let four:  Vec<HashedAs<u64>> = trislice().featurize(sentence); // could have used trigram

Trigrams of String, HashedAs<u64>, &str and [u8; 3] each have their place depending on your use-case. But there can be roughly two orders of magnitude of difference in performance between the fastest and the slowest. If you choose the wrong one for your needs (or use a less polymorphic crate), you're losing out on speed!

What type should I use to represent my features?

  • use Collection<[T; N]> via ftzrs::n_gram if both T and N are small. This is most of the time.

  • use Collection<&[T]> (or Collection<&str>) via ftzrs::n_slice if [T; N] would be larger (in bytes) than (usize, usize). This is more common if N is large or you're using char instead of u8. This is also depends on the lifetime of the original data vs the lifetime of the features produced.

  • HashedAs<u64> has the opposite time complexity as &[T], linear creation and O(1) equality. If you're ok with one-in-a-millionish hash collisions, this can be a great compromise.

  • Never use Collection<String> or Collection<Vec<T>> in a performance critical section.

Where does creature_feature fit in with other tokenizers?

creature_feature is very flexible, and traits::Ftzr/traits::IterFtzr can be easily implemented with a newtype for whatever other tokenizer/featurizer you please. Anything could be featurized: images, documents, time-series data, etc.

Example: Featurizing books

Consider a custom struct to represent a book

struct Book {
   author: String,
   genre: Genre,
   sub_genre: SubGenre,
   year: u16,
}

#[derive(Hash)]
enum Genre {
   Fiction,
   NonFiction,
   Religion,
}

#[derive(Hash)]
enum SubGenre {
   Romance,
   History,
   DataScience,
}

impl Book {
   fn decade(&self) -> u8 {
       unimplemented!()
   }
}

We can easily make a custom featurizer for Book by visitation with traits::Ftzr.

use creature_feature::ftzrs::whole;
use creature_feature::traits::*;
use creature_feature::HashedAs;

struct BookFtzr;

impl<'a> Ftzr<&'a Book> for BookFtzr {
   type TokenGroup = HashedAs<u64>;
   fn push_tokens<Push: FnMut(Self::TokenGroup)>(&self, book: &'a Book, push: &mut Push) {
       whole().push_tokens_from(&book.author, push);
       push(FeatureFrom::from(&book.genre));
       push(FeatureFrom::from(&book.sub_genre));
       push(FeatureFrom::from(book.year));
       push(FeatureFrom::from(book.decade()));
   }
}

Now we could easily implement a similarity metric for Book via Vec<HashedAs<u64>>, like cosine or jaccard.

Usage notes

  • No bounds checking is performed. This is the responsibility of the user.
  • To handle unicode, convert to Vec<char>

YOU CAN HELP

I'm actually an English teacher, not a dev. So any PRs, observations or feedback is very welcome. I've done my best to document everything well, but if you have any questions feel free to reach out. Your help with any of the small number of current issues would be VERY much welcome :)

You might also like...
Blazingly fast framework for in-process microservices on top of Tower ecosystem
Blazingly fast framework for in-process microservices on top of Tower ecosystem

norpc = not remote procedure call Motivation Developing an async application is often a very difficult task but building an async application as a set

Ultra-fast, spookily accurate text summarizer that works on any language
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Vidyut मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥ Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard co

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

A blazingly fast( possibly the fastest) markdown to html parser and syntax highlighter built using Rust's pulldown-cmark and tree-sitter-highlight crate natively for Node's Foreign Function Interface.

Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

A command-line tool and library for generating regular expressions from user-provided test cases
A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

Find and replace text in source files
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Comments
  • impl `Ftzr` in terms of `Ftzr`, not `IterFtzr` for `GapGram` & `Bookends`

    impl `Ftzr` in terms of `Ftzr`, not `IterFtzr` for `GapGram` & `Bookends`

    Some featurization combinators implement Ftzr by using a where Self: IterFtzr<Origin>. This slightly limits the usability of user-defined featurizers that don't implement IterFtzr, particularly with GapGram and Bookends. I'd love to see it changed to a clause of something like where A: Ftzr<Origin>, B: Ftzr<Origin>

    I tried to do this and it got hairy quick.

    opened by Lambda-Logan 0
  • Add support for pre-allocation of accumulators

    Add support for pre-allocation of accumulators

    I'm not 100% of the best way to do this with the Accumulates trait. As it stands, I've only benchmarked a 20% speed-up for pre-allocation in only certain situations where TokenGroup is very large. That's substantial-ish, but not urgent.

    opened by Lambda-Logan 0
Owner
Logan Diamond
null
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Jan 5, 2023
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Fast and easy random number generation.

alea A zero-dependency crate for fast number generation, with a focus on ease of use (no more passing &mut rng everywhere!). The implementation is bas

Jeff Shen 26 Dec 13, 2022
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022
Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

PDFRip Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks. ?? Table of Contents Int

Mufeed VH 226 Jan 4, 2023
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Vishal Telangre 310 Dec 29, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 662 Dec 31, 2022