A Quest to Find a Highly Compressed Emoji :shortcode: Lookup Function

Overview

Highly Compressed Emoji Shortcode Mapping

An experiment to try and find a highly compressed representation of the entire unicode shortcodes-to-emoji mapping that can be indexed without requiring any dynamic allocation. In other words: what's the smallest amount of static storage(code and data) required to write a function with the following signature:

fn shortcode_to_emoji(input: &str) -> Option<&str>

Check out the project's writeup for more details on the rationale and idea behind this project.

DISCLAIMER: This is a project that was hacked together over the course of a couple weekends, where iteration speed took priority over code quality/robustness. The implementation is a mess, and the build system is incredibly brittle. If you're even remotely considering building off this repo - please don't. You'd be much better off building off these ideas and writing a new implementation from scratch.

Make sure to try out the online demo!

Building and Running

At the moment, building this library requires running a *nix OS with curl installed, which is required to download the emoji database used as part of the build process. Aside from that, the core maximally-compressed-emoji-shortcodes library is a bog-standard Rust crate which can be built with cargo build.

Given that this is more of a quick-and-dirty experiment rather than a proper ready-to-use library, there isn't any easy-to-use config file to play around with. If you're seriously interested in playing around with this mess of code, get ready to do some spelunking. Some potentially useful starting off points:

  • Changing the size of the hash - ./rust-phf/phf/src/map.rs:18 (the keys: Slice<u16> field) and ./rust-phf/phf_shared/src/lib.rs:43 (the checksum function).
  • Fixup Table search times - build.rs
  • Swapping out the key/value data set - build.rs

This repo includes several test/example projects that use the maximally-compressed-emoji-shortcodes library:

  • kowalski-analysis: a very messy playground for testing various properties of the library (e.g: false-positive rates, accuracy, compression ratio, etc...).
  • example_no_std: a no_std Rust binary that serves as a rough benchmark for how much space the library will occupy when deployed in an embedded context.
  • shortcode-web: using the magic of wasm-pack, you can play with this project via a neat little online demo! Try it out at here!
    • Note: the .wasm size of the shortcode-web demo is not representative of the binary size on an proper embedded platform, since wasm-bindgen introduces almost 15kb of overhead for some reason (i.e: when the single exported function is replaced with a noop). I could probably slim this down by bypassing wasm-bindgen entirely, and figuring out how to accept Javascript Strings over the FFI, but that's high effort. So yeah, just subtract 15kb from the (uncompressed) .wasm size to get a better idea of the compression factor.

Future Work?

I'm pretty much done with this project for now, but there are still a few ideas that might be worth exploring to compress things down even more:

  • Compressing the emoji UTF-8 strings using some sort of domain specific representation
    • e.g: it seems like most emoji fall under a small subrange of unicode codepoints, it might be possible to shave a couple bits of overhead from each emoji mapping by adding/removing an offset from the stored value.
  • Using non-standard key hash sizes (i.e: 9-bit, 10-bit, etc...).
    • Follow up: automatically trying out various key hash sizes to minimize space overhead while maintaining a favorable hash-collision rate.
  • Actually cleaning up this abomination of a codebase and releasing a proper library that employs these various techniques

Oh, and of course, I should probably port it to one of my keyboards. After all, that was the whole inspiration for this endeavor!

Acknowledgements

This project wouldn't have been possible without the incredible rust-phf library. The in-tree version of rust-phf is a stripped down version and heavily modified version the library, optimizing the map's internal representation for this particular use-case.

The initial POC of this project was based off of https://github.com/kornelski/gh-emoji.

The emoji shortcode database is downloaded directly from Github's gemoji library.

Special thanks to Matt D'Souza and Ethan Hardy, who I nerd-sniped into helping me with this funky little project.

You might also like...
🦀 A Rust CLI to find the optimal time to meet given a when2meet URL

when3meet 🦀 The Rust when2meet CLI Install | Usage | Contributing & Issues | Docs Built with ❤️ and 🦀 by Garrett Ladley Install cargo install when3m

A rust crate can find first `Err` in `IteratorResultT, E` and iterating continuously, without allocation.

Api Document first-err Find the first Err in IteratorResultT, E and allow iterating continuously. This crate is specifically designed to replace t

Emoji-printer - Utility to convert strings with emoji shortcodes to strings with the emoji unicode

Emoji Printer Intro Utility to convert strings with emoji shortcodes (:sushi:) to strings with the emoji unicode ( 🍣 ) Install cargo add emoji-printe

🏪 Modern emoji picker popup for desktop, based on Emoji Mart, built with Tauri and Svelte
🏪 Modern emoji picker popup for desktop, based on Emoji Mart, built with Tauri and Svelte

Emoji Mart desktop popup Modern emoji picker popup app for desktop, based on the amazing Emoji Mart web component. 🍾 Built as a popup: quick invocati

tectonicdb is a fast, highly compressed standalone database and streaming protocol for order book ticks.

tectonicdb crate docs.rs crate.io tectonicdb tdb-core tdb-server-core tdb-cli tectonicdb is a fast, highly compressed standalone database and streamin

Demonstration of flexible function calls in Rust with function overloading and optional arguments

Table of Contents Table of Contents flexible-fn-rs What is this trying to demo? How is the code structured? Named/Unnamed and Optional arguments Mecha

A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

random_search A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the los

SIMD Floating point and integer compressed vector library

compressed_vec Floating point and integer compressed vector library, SIMD-enabled for fast processing/iteration over compressed representations. This

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing.

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing. High performance and parallel implementation for 10 Gbps playback throughput with large numbers of files (~4k).

Convenience library for reading and writing compressed files/streams

compress_io Convenience library for reading and writing compressed files/streams The aim of compress_io is to make it simple for an application to sup

VCF/BCF [un]compressed [un]indexed

This is a small library that attempts to allow efficient reading of VCF and BCF files that are either compressed or uncompressed and indexed or not. n

bottom encodes UTF-8 text into a sequence comprised of bottom emoji
bottom encodes UTF-8 text into a sequence comprised of bottom emoji

bottom encodes UTF-8 text into a sequence comprised of bottom emoji (with , sprinkled in for good measure) followed by 👉👈. It can encode any valid UTF-8 - being a bottom transcends language, after all - and decode back into UTF-8.

A simple command-line utility (and Rust crate!) for converting from a conventional image file (e.g. a PNG file) into a pixel-art version constructed with emoji
A simple command-line utility (and Rust crate!) for converting from a conventional image file (e.g. a PNG file) into a pixel-art version constructed with emoji

EmojiPix This is a simple command-line utility (and Rust crate!) for converting from a conventional image file (e.g. a PNG file) into a pixel-art vers

a cute language with a bunch emoji🐶
a cute language with a bunch emoji🐶

nylang a cute language with a bunch emoji documentation WIKI usage dependancies rust ( cargo ) install & uninstall install chmod +x scripts/install.sh

Encode/Decode bytes as emoji base2048

mojibake Encode and decode arbitrary bytes as a sequence of emoji optimized to produce the smallest number of graphemes. Description This is not a spa

fd is a program to find entries in your filesystem. It is a simple, fast and user-friendly alternative to find
fd is a program to find entries in your filesystem. It is a simple, fast and user-friendly alternative to find

fd is a program to find entries in your filesystem. It is a simple, fast and user-friendly alternative to find. While it does not aim to support all of find's powerful functionality, it provides sensible (opinionated) defaults for a majority of use cases.

fas stand for Find all stuff and it's a go app that simplify the find command and allow you to easily search everything you nedd
fas stand for Find all stuff and it's a go app that simplify the find command and allow you to easily search everything you nedd

fas fas stands for Find all stuff and it's a rust app that simplify the find command and allow you to easily search everything you need. Note: current

K-dimensional tree in Rust for fast geospatial indexing and lookup

kdtree K-dimensional tree in Rust for fast geospatial indexing and nearest neighbors lookup Crate Documentation Usage Benchmark License Usage Add kdtr

K-dimensional tree in Rust for fast geospatial indexing and lookup

kdtree K-dimensional tree in Rust for fast geospatial indexing and nearest neighbors lookup Crate Documentation Usage Benchmark License Usage Add kdtr

Comments
  • Missing license

    Missing license

    Hi! This seems like a very cool project, I enjoyed the blog to go with it as well. I actually stumbled over this searching for a library to parse shortcodes with during markdown rendering, but I'm unsure about the license. Neither cargo.toml nor any other files make any mention of a license, save for the phf copy. Is this intentional?

    opened by black-puppydog 3
Owner
Daniel Prilik
Probably hacking away at some silly side-project · Emulation Enthusiast · Rustacean · UWaterloo SE 2020 · Working 9 to 5 on OSS @microsoft (check out @daprilik)
Daniel Prilik
Highly experimental, pure-Rust big integer library

grou-num (Pronounced "groo", from the Chiac meaning "big") This package is a highly experimental, unstable big integer library. I would not recommend

Patrick Poitras 1 Dec 18, 2021
A highly performant HTTP bittorrent tracker (WIP)

kiryuu Rewrite of kouko in Rust, for better performance! Kiryuu powers http://tracker.mywaifu.best:6969/announce Thanks Many thanks to horsie and anon

Raghu Saxena 6 Dec 15, 2022
A highly extensible runner that can execute any workflow.

Astro run Astro Run is a highly extensible runner that can execute any workflow. Features Workflow runtime for Docker Support for gRPC server to coord

Panghu 3 Aug 19, 2023
a function programming language for real world applications made in rust

a function programming language for real world applications made in rust

Tanay Pingalkar 6 Jun 12, 2022
Allow function lifetime elision and explicit `for<'a>` annotations on closures.

::higher-order-closure Allow function lifetime elision and explicit for<'a> annotations on closures. Motivation / Rationale See the RFC #3216: this cr

Daniel Henry-Mantilla 18 Dec 26, 2022
A Rust attribute macro that adds memoization to a function (rhymes with Mickey)

michie (sounds like Mickey) — an attribute macro that adds memoization to a function. Table of contents Features Non-features key_expr key_type store_

Mobus Operandi 16 Dec 20, 2022
Tool to convert variable and function names in C/C++ source code to snake_case

FixNameCase Tool to convert variable and function names in C/C++ source code to snake_case. Hidden files and files listed in .gitignore are untouched.

AgriConnect 4 May 25, 2023
A procedural macro to generate a new function implementation for your struct.

Impl New ?? A procedural macro to generate a new function implementation for your struct. ?? Add to your project Add this to your Cargo.toml: [depende

Mohammed Alotaibi 4 Sep 8, 2023
Rust Programming Fundamentals - one course to rule them all, one course to find them...

Ultimate Rust Crash Course This is the companion repository for the Ultimate Rust Crash Course published online, presented live at O'Reilly virtual ev

Nathan Stocks 1.3k Jan 8, 2023
In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang.

Learn Rust What is this? In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang. This is usef

Domagoj Ratko 5 Nov 5, 2022