Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Overview

Untanglr

Untanglr

Untanglr takes in a some mangled words and makes sense out of them so you dont have to. It goes through the input and splits it probabilistically into words. The crate includes both a bin.rs and a lib.rs to facilitate both usage as a command line utility, and as a library that you can use in your code.

Usage

Pass the tangled words as a cli argument:

$ untanglr thequickbrownfoxjumpedoverthelazydog
the quick brown fox jumped over the lazy dog

Or use it in your projects:

extern crate untanglr;

fn main() {
	let lm = untanglr::LanguageModel::new();
	println!("{:?}", lm.untangle("helloworld"));
}

Installation

If you find that untanglr might be useful on your machine you can install it. Just make sure cargo is installed and run:

$ cargo install untanglr

Note: Don't be discouraged if this project hasn't been updated in a while. I will address potential issues but the crate does not need regular updates.

Credits

I have developed this project around Derek Anderson's wordninja python implementation for some exercising in rust while producing something useful.

You might also like...
Neural network transition-based dependency parser (in Rust)

dpar Introduction dpar is a neural network transition-based dependency parser. The original Go version can be found in the oldgo branch. Dependencies

Simple STM32F103 based glitcher FW

Airtag glitcher (Bluepill firmware) Simple glitcher firmware running on an STM32F103 on a bluepill board. See https://github.com/pd0wm/airtag-dump for

Difftastic is an experimental structured diff tool that compares files based on their syntax.
Difftastic is an experimental structured diff tool that compares files based on their syntax.

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Vaporetto: a fast and lightweight pointwise prediction based tokenizer

🛥 VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A small rust library for creating regex-based lexers

A small rust library for creating regex-based lexers

A rule based sentence segmentation library.

cutters A rule based sentence segmentation library. 🚧 This library is experimental. 🚧 Features Full UTF-8 support. Robust parsing. Language specific

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

🐍 python-vaporetto 🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
Simple NLP in Rust with Python bindings

vtext NLP in Rust with Python bindings This package aims to provide a high performance toolkit for ingesting textual data for machine learning applica

Roman Yurchak 133 Jan 3, 2023
An NLP-suite powered by deep learning

DeepFrog - NLP Suite Introduction DeepFrog aims to be a (partial) successor of the Dutch-NLP suite Frog. Whereas the various NLP modules in Frog wre b

Maarten van Gompel 16 Feb 28, 2022
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Which words can you spell using only element abbreviations from the periodic table?

Periodic Words Have you ever wondered which words you can spell using only element abbreviations from the periodic table? Well thanks to this extremel

J Spencer 11 Apr 26, 2021
Common stop words in a variety of languages

About Stop words are words that don't carry much meaning, and are typically removed as a preprocessing step before text analysis or natural language p

Chris McComb 9 Dec 9, 2022
Predict next words (`・ω・´)

Mocword Predict next words (`・ω・´) Installation Important: You must prepare Mocword dataset in advance. See below (Dataset and Environment Variable).

high-moctane 36 Dec 3, 2022
SHA256 sentence: discover a SHA256 checksum that matches a sentence's description of hex digit words.

SHA256 sentence "The SHA256 for this sentence begins with: one, eight, two, a, seven, c and nine." Inspired by @lauriewired post Inspired by @humbleha

Joel Parker Henderson 16 Oct 9, 2023
A small CLI utility for helping you learn japanese words made in rust 🦀

Memofante (Clique aqui ver em português) Memofante is here, a biiiig help: Do you often forget japanese words you really didn't want to forget? Do you

Tiaguinho 3 Nov 4, 2023