(Read-only) Generate n-grams

Overview

N-grams

Build Status Coverage Status

Documentation

This crate takes a sequence of tokens and generates an n-gram for it. For more information about n-grams, check wikipedia: https://en.wikipedia.org/wiki/N-gram

Note: The canonical version of this crate is hosted on Gitlab

Usage

Probably the easiest way to use it is to use the iterator adaptor. If your tokens are strings (&str, String, char, or Vec), you don't have to do anything other than generate the token stream:

use ngrams::Ngram;
let grams: Vec<_> = "one two three".split(' ').ngrams(2).collect();
// => vec![
//        vec!["\u{2060}", "one"],
//        vec!["one", "two"],
//        vec!["two", "three"],
//        vec!["three", "\u{2060}"],
//    ]

(re: the "\u{2060}": We use the unicode WORD JOINER symbol as padding on the beginning and end of the token stream.)

If your token type isn't one of the listed types, you can still use the iterator adaptor by implementing the ngram::Pad trait for your type.

You might also like...
Check the reproducibility status of your Arch Linux packages (read-only mirror)
Check the reproducibility status of your Arch Linux packages (read-only mirror)

arch-repro-status A CLI tool for querying the reproducibility status of the Arch Linux packages using data from a rebuilderd instance such as reproduc

A read-only, memory-mapped cache.

mmap-cache A low-level API for a memory-mapped cache of a read-only key-value store. Design The [Cache] index is an [fst::Map], which maps from arbitr

EncyroHop an open source external CS2 read only kernel gameplay enhancer.

Download - Latest Release Encyro is a robust open-source external CS2 (Counter-Strike 2) read-only kernel gameplay enhancer, designed to augment the g

Ampseer examines reads in fastq format and identifies which multiplex PCR primer set was used to generate the SARS-CoV-2 sequencing library they are read from.

Ampseer examines reads in fastq format and identifies which multiplex PCR primer set was used to generate the SARS-CoV-2 sequencing library they are read from.

A simple to use rust package to generate or parse Twitter snowflake IDs,generate time sortable 64 bits unique ids for distributed systems

A simple to use rust package to generate or parse Twitter snowflake IDs,generate time sortable 64 bits unique ids for distributed systems (inspired from twitter snowflake)

Read input lines as byte slices for high efficiency

bytelines This library provides an easy way to read in input lines as byte slices for high efficiency. It's basically lines from the standard library,

Rust crate to extend io::Read & io::Write types with progress callbacks

progress-streams Rust crate to provide progress callbacks for types which implement io::Read or io::Write. Examples Reader extern crate progress_strea

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Read and modify constituency trees in Rust.

lumberjack Read and process constituency trees in various formats. Install: From crates.io: cargo install lumberjack-utils From GitHub: cargo install

Rust read/write support for well-known text (WKT)

wkt Rust read/write support for well-known text (WKT). License Licensed under either of Apache License, Version 2.0 (LICENSE-APACHE or http://www.apac

Rust read/write support for GPS Exchange Format (GPX)

gpx gpx is a library for reading and writing GPX (GPS Exchange Format) files. It uses the primitives provided by geo-types to allow for storage of GPS

Rust read/write support for GPS Exchange Format (GPX)

gpx gpx is a library for reading and writing GPX (GPS Exchange Format) files. It uses the primitives provided by geo-types to allow for storage of GPS

Rust read/write support for well-known text (WKT)

wkt Rust read/write support for well-known text (WKT). License Licensed under either of Apache License, Version 2.0 (LICENSE-APACHE or http://www.apac

An interactive scripting language where you can read and modify code comments as if they were regular strings
An interactive scripting language where you can read and modify code comments as if they were regular strings

An interactive scripting language where you can read and modify code comments as if they were regular strings. Add and view text-based visualizations and debugging information inside your source code file.

Benchmarks to read parquet to arrow
Benchmarks to read parquet to arrow

Parquet benchmarks This repository contains a set of benchmarks of different implementations of Parquet (storage format) - Arrow (in-memory format).

Rust crate for making Read streams peekable.

peekread This crate allows you to take an arbitrary Read stream and 'peek ahead' into the stream without consuming the original stream. This is done t

This is a simple lnd poller and web front-end to see and read boosts and boostagrams.

Helipad This package will poll a Lightning LND node for invoices related to Podcasting 2.0 and display them in a web interface. It's intended for use

A Rust synchronisation primitive for "Multiplexed Concurrent Single-Threaded Read" access

exit-left verb; 1. To exit or disappear in a quiet, non-dramatic fashion, making way for more interesting events. 2. (imperative) Leave the scene, and

A naive (read: slow) implementation of Word2Vec. Uses BLAS behind the scenes for speed.

SloWord2Vec This is a naive implementation of Word2Vec implemented in Rust. The goal is to learn the basic principles and formulas behind Word2Vec. BT

Comments
  • Fails when size is 1.

    Fails when size is 1.

    Hi, I tried do create a monogram with this, and it fails like this:

    let grams: Vec<_> = "one two three".split(' ').ngrams(1).collect();
    println!("{:?}", grams);
    
    [["one"], ["one", "two"], ["two", "three"]]
    

    I expected this to return

    [["one"], ["two"], ["three"]]
    
    opened by reitermarkus 0
Owner
Paul Woolcock
Software Developer who loves writing Rust
Paul Woolcock
Read input lines as byte slices for high efficiency

bytelines This library provides an easy way to read in input lines as byte slices for high efficiency. It's basically lines from the standard library,

Isaac Whitfield 53 Sep 24, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
Read and modify constituency trees in Rust.

lumberjack Read and process constituency trees in various formats. Install: From crates.io: cargo install lumberjack-utils From GitHub: cargo install

Sebastian Pütz 10 Apr 28, 2022
A naive (read: slow) implementation of Word2Vec. Uses BLAS behind the scenes for speed.

SloWord2Vec This is a naive implementation of Word2Vec implemented in Rust. The goal is to learn the basic principles and formulas behind Word2Vec. BT

Lloyd 2 Jul 5, 2018
Which words can you spell using only element abbreviations from the periodic table?

Periodic Words Have you ever wondered which words you can spell using only element abbreviations from the periodic table? Well thanks to this extremel

J Spencer 11 Apr 26, 2021
🔴〰️🔵〰️⚫ Not Only a Translator

?? Not Only a Translator ?? English·中文 ?? This program is not just a translation software, it is not named yet. Supports conversion of input character

Breaker 12 Dec 5, 2022
A quick way to decode a contract's transaction data with only the contract address and abi.

tx-decoder A quick way to decode a contract's transaction data with only the contract address and abi. E.g, let tx_data = "0xe70dd2fc00000000000000000

DeGatchi 15 Feb 13, 2023
Generate easy to remember sentences that acts as human readable UUIDs 🥳

uuid-readable-rs Easy to remember unique sentences acting as UUID Generate easy to remember sentences that acts as human readable UUIDs. Built on UUID

Martin André 48 Nov 8, 2022
tongrams-rs: Tons of N-grams in Rust

tongrams-rs: Tons of N-grams in Rust This is a Rust port of tongrams to index and query large language models in compressed space, in which the data s

Shunsuke Kanda 15 Oct 19, 2022
🐙 Grams knows best. GPT3 Chat hot key enabled osx desktop app

grams Welcome to the grams repository! ?? What is grams? Grams desktop app and way to mainline chat.openai.com into you're day to day life. grams was

drbh 5 Dec 21, 2022