mojibake
Encode and decode arbitrary bytes as a sequence of emoji optimized to produce the smallest number of graphemes.
Description
This is not a space efficient library.
Generally services(Twitter,Mastodon,etc) will restrict the number of characters you're allowed to submit based on the grapheme count, not the literal character count. Singular emoji graphemes often consist of multi byte sequences that include multiple characters.
Therefore, if you can encode more data in a smaller number of graphemes, you can transmit more information while also having far more bytes than you otherwise would.
There are at least 2048 unique emoji graphemes in the unicode specification. Therefore an emoji is actually just an 11 bit unsigned integer with extra steps.
This library packs bytes bytes into 11 bit unsigned integers, which are then mapped to sequences of unicode characters that display as a single grapheme.
Example
Original Text:
Value: Shrek 2 was the greatest film ever made!!
Bytes: 41,
Characters: 41,
Graphemes: 41
Mojibake Encoded:
Value: ๐ป๐ณ๐??๐ช๐ถ๐ซณ๐ฟ๐ง๐ป๐ผ๐บ๐พ๐ค๐ป๐ฆบ๐คต๐ฝ๐ฆ๐ผ๐๏ธ๐๐ฟโ๏ธโ๏ธ2๏ธโฃ๐งฅ๐คต๐ป๐ค๐๐ซ๐ช๐๐ฆ๐ช๐ซณ๐ฝ๐ธ๐ฒ๐น๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ๐๐ป
Bytes: 210,
Characters: 55,
Graphemes: 30
Decoded Text:
Value: Shrek 2 was the greatest film ever made!!
Bytes: 41,
Characters: 41,
Graphemes: 41