fastest text uwuifier in the west

Overview

uwuify

fastest text uwuifier in the west

transforms

Hey... I think I really love you. Do you want a headpat?

into

hey... i think i w-weawwy wuv you. (⑅˘꒳˘) d-do you want a headpat?

there's an uwu'd version of this readme

faq

what?

u want large amounts of text uwu'd in a smol amount of time

where?

ur computer, if it has a recent x86 cpu (intel, amd) that supports sse4.1

why?

why not?

how?

tldr: 128-bit simd vectorization plus some big brain algos

click for more info

after hours of research, i've finally understood the essence of uwu'd text

there are a few transformations:

  1. nya-ify (eg. naruhodo -> nyaruhodo)
  2. replace l and r with w
  3. stutter sometimes (hi -> h-hi)
  4. add a text emoji after punctuation (,, ., or !) sometimes
  5. replace some words (small -> smol, etc.)

these transformation passes take advantage of sse4.1 vector intrinsics to process 16 bytes at once. for string searching, i'm using a custom simd implementation of the bitap algorithm for matching against multiple strings. for random number generation, i'm using XorShift32. for most character-level detection within simd registers, its all masking and shifting to simulate basic state machines in parallel

multithreading is supported, so u can exploit all of ur cpu cores for the noble goal of uwu-ing massive amounts of text

utf-8 is handled elegantly by simply ignoring non-ascii characters in the input

unfortunately, due to both simd parallelism and multithreading, some words may not be fully uwu'd if they were lucky enough to cross the boundary of a simd vector or a thread's buffer. they won't escape so easily next time

ok i want uwu'd text, how do i run this myself?

install command-line tool

  1. install rust: run curl https://sh.rustup.rs -sSf | sh on unix, or go here for more options
  2. run cargo install uwuify
  3. run uwuify which will read from stdin and output to stdout. make sure u press ctrl + d (unix) or ctrl + z and enter (windows) after u type stuff in stdin to send an EOF

if you are having trouble running uwuify, make sure you have ~/.cargo/bin in your $PATH

it is possible to read and write from files by specifying the input file and output file, in that order. u can use --help for more info. pass in -v for timings

this is on crates.io here

include as library

  1. put uwuify = "^0.2" under [dependencies] in your Cargo.toml file
  2. the library is called uwuifier (slightly different from the name of the binary!) use it like so:
use uwuifier::uwuify_str_sse;
assert_eq!(uwuify_str_sse("hello world"), "hewwo wowwd");

documentation is here

build from this repo

click for more info

  1. install rust
  2. run git clone https://github.com/Daniel-Liu-c0deb0t/uwu.git && cd uwu
  3. run cargo run --release
testing
  1. run cargo test
benchmarking
  1. run mkdir test && cd test

warning: large files of 100mb and 1gb, respectively

  1. run curl -OL http://mattmahoney.net/dc/enwik8.zip && unzip enwik8.zip
  2. run curl -OL http://mattmahoney.net/dc/enwik9.zip && unzip enwik9.zip
  3. run cd .. && ./bench.sh

i don't believe that this is fast. i need proof!!1!

tldr: can be almost as fast as simply copying a file

click for more info

raw numbers from running ./bench.sh on a 2019 macbook pro with eight intel 2.3 ghz i9 cpus and 16 gb of ram are shown below. the dataset used is the first 100mb and first 1gb of english wikipedia. the same dataset is used for the hutter prize for text compression

1 thread uwu enwik8
time taken: 178 ms
input size: 100000000 bytes
output size: 115095591 bytes
throughput: 0.55992 gb/s

2 thread uwu enwik8
time taken: 105 ms
input size: 100000000 bytes
output size: 115095591 bytes
throughput: 0.94701 gb/s

4 thread uwu enwik8
time taken: 60 ms
input size: 100000000 bytes
output size: 115095591 bytes
throughput: 1.64883 gb/s

8 thread uwu enwik8
time taken: 47 ms
input size: 100000000 bytes
output size: 115095591 bytes
throughput: 2.12590 gb/s

copy enwik8

real	0m0.035s
user	0m0.001s
sys	0m0.031s

1 thread uwu enwik9
time taken: 2087 ms
input size: 1000000000 bytes
output size: 1149772651 bytes
throughput: 0.47905 gb/s

2 thread uwu enwik9
time taken: 992 ms
input size: 1000000000 bytes
output size: 1149772651 bytes
throughput: 1.00788 gb/s

4 thread uwu enwik9
time taken: 695 ms
input size: 1000000000 bytes
output size: 1149772651 bytes
throughput: 1.43854 gb/s

8 thread uwu enwik9
time taken: 436 ms
input size: 1000000000 bytes
output size: 1149772651 bytes
throughput: 2.29214 gb/s

copy enwik9

real	0m0.387s
user	0m0.001s
sys	0m0.341s

//TODO: compare with other tools

why isn't this readme uwu'd?

so its readable

if u happen to find uwu'd text more readable, there's always an uwu'd version

ok but why aren't there any settings i can change?!1?!!1

free will is an illusion

wtf this is so unprofessional how are u gonna get hired at faang now?!

don't worry, i've got u covered

Title: uwu is all you need

Abstract

Recent advances in computing have made strides in parallelization, whether at a fine-grained level with SIMD instructions, or at a high level with multiple CPU cores. Taking advantage of these advances, we explore how the useful task of performing an uwu transformation on plain text can be scaled up to large input datasets. Our contributions in this paper are threefold: first, we present, to our knowledge, the first rigorous definition of uwu'd text. Second, we show our novel algorithms for uwu-ing text, exploiting vectorization and multithreading features that are available on modern CPUs. Finally, we provide rigorous experimental results that show how our implementation could be the "fastest in the west." In our benchmarks, we observe that our implementation was almost as a fast as a simple file copy, which is entirely IO-bound. We believe our work has potential applications in various domains, from data augmentation and text preprocessing for natural language processing, to giving authors the ability to convey potentially wholesome or cute meme messages with minimal time and effort.

// TODO: write paper

// TODO: write more about machine learning so i get funding

ok i need to use this for something and i need the license info

mit license

ok but i have an issue with this or a suggestion or a question not answered here

open an issue, be nice

projects using this

  • uwu-tray: a tray icon to uwuify your text
  • uwubot: discord bot for uwuifying text
  • uwupedia: the uwuified encycwopedia
  • discord uwu webhook: automatically uwuifies all sent messages on discord via webhooks
  • twent weznowor: best twitter bot ever
  • let me know if you make a project with uwuify! i appreciate u all!

references

Comments
  • How to use lib with strings?

    How to use lib with strings?

    Hello!

    I want to use the lib in a bot, to try out, but it seems to be able to only use files/stdin. Is there an easy way to use strings with this lib?

    Thanks in advance!

    opened by ShadowMitia 9
  • nyani??!?!??!?!

    nyani??!?!??!?!

    e-ewwor instawwing o-on em1 makbuk p-pwo ('_')

    ```error[E0432]: unresolved import uwuifier::uwuify_sse --> /Users/myrealnameudontsee/.cargo/registry/src/github.com-1ecc6299db9ec823/uwuify-0.2.2/src/main.rs:1:5 | 1 | use uwuifier::uwuify_sse; | ^^^^^^^^^^^^^^^^^^^^ no uwuify_sse in the root

    error: aborting due to previous error

    For more information about this error, try rustc --explain E0432. error: failed to compile uwuify v0.2.2, intermediate artifacts can be found at /var/folders/q3/v84w5_7j11g__d2g4b0z5sqh0000gp/T/cargo-install2HXfsc

    
    also i uwuified the first sentence by hand lmao
    opened by ghost 5
  • Formatting does not work properly when piping in neofetch

    Formatting does not work properly when piping in neofetch

    Title says it all. I piped neofetch in and it broke the formatting, Im assuming uwuify is breaking the ASCII character return escape codes that neofetch uses

    Expected

    Screen Shot 2022-07-18 at 00 38 29

    Actual

    Screen Shot 2022-07-18 at 00 41 33
    opened by Akari202 2
  • i made a joke pwoject

    i made a joke pwoject

    my fwiend intwoduced youw fastest uwuifiew in the west to me and i jokingwy said that i shouwd c-cweate a ~~mawwawe~~ p-pwogwam that uwuifies as you type! enjoyed doing it and i'm wwiting this issue nyow with it wunning hehe.

    wink is here

    opened by joshualeejunyi 2
  • Release?

    Release?

    When you're ready, would you mind creating a release so I can package this?

    when you'we weady, σωσ w-wouwd you mind c-cweating a wewease so i can package t-this?

    opened by zethra 2
  • mowe emoji (ꈍᴗꈍ)

    mowe emoji (ꈍᴗꈍ)

    i added some mowe e-emoji, 😳 these ones a-awe cat themed and awso incwude s-some unicode stuff i hope you w-wike them,,, OwO

    i kept the wut tabwe t-to a powew of 2 because i know computews wike t-that (ꈍᴗꈍ)

    opened by katef 1
  • add verbose argument

    add verbose argument

    by default, the program shows extra information at the end of the output:

    time taken: 0 ms
    input size: 16 bytes
    output size: 16 bytes
    throughput: 0.00010 gb/s
    

    this pr changes it to only print that information when -v, --verbose is passed by the user.

    opened by cerulis64 1
  • works on winux

    works on winux

    so i did this:

     ~/D/linux-6.0  touch ../uwu.c                                                                         Mon Oct 10 23:34:56 2022
     ~/D/linux-6.0  find . -name "*.c" -exec sh -c 'cat {} >> /home/hunter/Downloads/uwu.c' \;             Mon Oct 10 23:34:59 2022
     ~/D/linux-6.0  uwuify -t 32 ../uwu.c ../uwuwu.c                                               42.3s  Mon Oct 10 23:35:45 2022
     ~/D/linux-6.0  du -sh ../uwu.c                                                                226ms  Mon Oct 10 23:36:18 2022
    579M	../uwu.c
     ~/D/linux-6.0  du -sh ../uwuwu.c                                                                      Mon Oct 10 23:36:31 2022
    662M	../uwuwu.c
     ~/D/linux-6.0  head -n 30 ../uwuwu.c                                                                  Mon Oct 10 23:36:45 2022
    /*
     * the fowwowing pwogwam is used t-to genewate the c-constants fow
     * c-computing sched a-avewages. 🥺
     *
     * ==============================================================
     *		c-c pwogwam (compiwe w-with -wm)
     * ==============================================================
     */
    
    #incwude <math.h>
    #incwude <stdio.h>
    
    #define h-hawfwife 32
    #define s-shift 32
    
    doubwe y;
    
    void cawc_wunnabwe_avg_yn_inv(void)
    {
    	int i;
    	unsigned int x;
    
    	/* t-to siwence -wunused-but-set-vawiabwe wawnings. ^^ */
    	pwintf("static c-const u32 wunnabwe_avg_yn_inv[] __maybe_unused = {");
    	fow (i = 0; i-i < hawfwife; i++) {
    		x = ((1uw<<32)-1)*pow(y, -.- i);
    
    		i-if (i % 6 == 0) pwintf("\n\t");
    		p-pwintf("0x%8x, ^^ ", x-x);
    	}
    

    unfortunatewy, uwuwu.c does not compiwe :c

    opened by cryptoquick 2
Releases(v0.2.2)
Owner
Daniel Liu
1st yr cs boi at @ucla, incoming @10xgenomics, prev @ucsd @openmined
Daniel Liu
The fastest way to identify any mysterious text or analyze strings from a file, just ask `lemmeknow` !

The fastest way to identify anything lemmeknow ⚡ Identify any mysterious text or analyze strings from a file, just ask lemmeknow. lemmeknow can be use

Swanand Mulay 594 Dec 30, 2022
Text Expression Runner – Readable and easy to use text expressions

ter - Text Expression Runner ter is a cli to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the c

Maximilian Schulke 72 Jul 31, 2022
Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

Kasper 82 Jan 4, 2023
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

Tanker 331 Dec 28, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
bottom encodes UTF-8 text into a sequence comprised of bottom emoji

bottom encodes UTF-8 text into a sequence comprised of bottom emoji (with , sprinkled in for good measure) followed by ????. It can encode any valid UTF-8 - being a bottom transcends language, after all - and decode back into UTF-8.

Bottom Software Foundation 345 Dec 30, 2022
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
Semantic text segmentation. For sentence boundary detection, compound splitting and more.

NNSplit A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is a

Benjamin Minixhofer 273 Dec 29, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
A crate using DeepSpeech bindings to convert mic audio from speech to text

DS-TRANSCRIBER Need an Offline Speech To Text converter? Records your mic, and returns a String containing what was said. Features Begins transcriptio

null 32 Oct 8, 2022
Sorta Text Format in UTF-8

STFU-8: Sorta Text Format in UTF-8 STFU-8 is a hacky text encoding/decoding protocol for data that might be not quite UTF-8 but is still mostly UTF-8.

Rett Berg 18 Sep 4, 2022
Source text parsing, lexing, and AST related functionality for Deno

Source text parsing, lexing, and AST related functionality for Deno.

Deno Land 90 Jan 1, 2023
better tools for text parsing

nom-text Goal: a library that extends nom to provide better tools for text formats (programming languages, configuration files). current needs Recogni

null 5 Oct 18, 2022
Font independent text analysis support for shaping and layout.

lipi Lipi (Sanskrit for 'writing, letters, alphabet') is a pure Rust crate that provides font independent text analysis support for shaping and layout

Chad Brokaw 12 Sep 22, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
Makdown-like text parser.

Makdown-like text parser.

Ryo Nakamura 1 Dec 7, 2021
A Rust wrapper for the Text synthesization service TextSynth API

A Rust wrapper for the Text synthesization service TextSynth API

ALinuxPerson 2 Mar 24, 2022
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

Catherine Koshka 13 Oct 31, 2022
A simple cli tool for generating quotes in your terminal from Kanye west. Start the day out strong.

Kanyey A simple cli tool for generating quotes in your terminal from Kanye West. Install Just do cargo install kanyey and be blessed. Bonus: throw it

null 3 Sep 29, 2023
WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

WriteForAll: tips to make text better WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

Joel Parker Henderson 2 Dec 27, 2022