a universal meta-transliterator that can decipher arbitrary encoding schemas, built in pure Rust

Overview

transliterati

demo

a universal meta-transliterator that can decipher arbitrary encoding schemas, built in pure Rust

what does it do?

You give it this:

Барлығына сенімді және тиімді бағдарламалық жасақтаманы құруға мүмкіндік беретін тіл. Ол өте жылдам және жадты үнемдейді: жұмыс уақыты немесе қоқыс жинағышсыз ол өнімділігі маңызды қызметтерді қуаттай алады, ендірілген құрылғыларда жұмыс істей алады және басқа тілдермен оңай біріктіре алады. тың тамаша құжаттамалары, пайдалы қате туралы хабарлары бар ыңғайлы компиляторы және жоғары деңгейлі құралдары — біріктірілген пакет менеджері және құрастыру құралы, автоматты аяқтау және типті тексерулері бар смарт мульти-редакторды қолдау, автоматты пішімдеу және т.б. бар.

And this:

Barlığına senimdi jäne tïimdi bağdarlamalıq jasaqtamanı qurwğa mümkindik beretin til. Ol öte jıldam jäne jadtı ünemdeydi: jumıs waqıtı nemese qoqıs jïnağışsız ol önimdiligi mañızdı qızmetterdi qwattay aladı, endirilgen qurılğılarda jumıs istey aladı jäne basqa tildermen oñay biriktire aladı. tıñ tamaşa qujattamaları, paydalı qate twralı xabarları bar ıñğaylı kompïlyatorı jäne joğarı deñgeyli quraldarı — biriktirilgen paket menedjeri jäne qurastırw quralı, avtomattı ayaqtaw jäne tïpti tekserwleri bar smart mwltï-redaktordı qoldaw, avtomattı pişimdew

And it gives you this:

{
  etc...
  "ал": "al",
  "ар": "ar",
  "б": "e",
  "в": "ü",
  "г": "g",
  "д": "d",
  "ді": "di",
  ...etc
}

Except it works for any transliteration schema in any language. Here I just used a single paragraph, but the longer, the better.

how fast is it?

The longest newline-separated paragraph constrains its speed, since everything is executed in parallel. Generally it takes between 15ms and 600ms.

how accurate is it?

It seems to be a matter of:

  • How much data do you have? The more the better.
  • Is the orthography between the two transliterated pairs is a 1:1 match? Russian is close to perfect even for as little as 14 words, Japanese is only 75% accurate even with 1000 because of the mix of writing systems.
  • Are they completely different writing systems? If you pair a logographic language like Chinese with phonetic pinyin, you will need a godawful amount of data. That's pretty much it.

how do I use it?

transliterati file1.txt file2.txt 200

Where 200 is the minimum vocab size, if you're really sure you know what you're doing. I think you might have to clone and build it from source since I just learned Rust a week ago and I'm not confident enough with cargo yet.

Tips:

  • If you have a long text, chunk it evenly into pieces if you know where the boundaries are. The longer the chunks are, the longer it will take. The number of chunks doesn't really matter. Make sure there aren't any blank lines.
  • Play around with the vocab size if you're getting weird results
You might also like...
Traversal of tree-sitter Trees and any arbitrary tree with a TreeCursor-like interface

tree-sitter-traversal Traversal of tree-sitter Trees and any arbitrary tree with a TreeCursor-like interface. Using cursors, iteration over the tree c

Abuse the node.js inspector mechanism in order to force any node.js/electron/v8 based process to execute arbitrary javascript code.
Abuse the node.js inspector mechanism in order to force any node.js/electron/v8 based process to execute arbitrary javascript code.

jscythe abuses the node.js inspector mechanism in order to force any node.js/electron/v8 based process to execute arbitrary javascript code, even if t

A PoC for the CVE-2022-44268 - ImageMagick arbitrary file read
A PoC for the CVE-2022-44268 - ImageMagick arbitrary file read

CVE-2022-44268 Arbitrary File Read PoC - PNG generator This is a proof of concept of the ImageMagick bug discovered by https://www.metabaseq.com/image

AI-TOML Workflow Specification (aiTWS), a comprehensive and flexible specification for defining arbitrary Ai centric workflows.

AI-TOML Workflow Specification (aiTWS) The AI-TOML Workflow Specification (aiTWS) is a flexible and extensible specification for defining arbitrary wo

Blazingly fast interpolated LUT generator and applicator for arbitrary and popular color palettes.
Blazingly fast interpolated LUT generator and applicator for arbitrary and popular color palettes.

lutgen-rs A blazingly fast interpolated LUT generator and applicator for arbitrary and popular color palettes. Theme any image to your dekstop colorsc

Encode and decode dynamically constructed values of arbitrary shapes to/from SCALE bytes

scale-value · This crate provides a Value type, which is a runtime representation that is compatible with scale_info::TypeDef. It somewhat analogous t

A library that allows for the arbitrary inspection and manipulation of the memory and code of a process on a Linux system.
A library that allows for the arbitrary inspection and manipulation of the memory and code of a process on a Linux system.

raminspect raminspect is a crate that allows for the inspection and manipulation of the memory and code of a running process on a Linux system. It pro

Base 32 + 64 encoding and decoding identifiers + bytes in rust, quickly

fast32 Base32 and base64 encoding in Rust. Primarily for integer (u64, u128) and UUID identifiers (behind feature uuid), as well as arbitrary byte arr

A small command-line utility for encoding and decoding bech32 strings

A small command-line utility for encoding and decoding bech32 strings.

Releases(transliteration)
Owner
Catherine Koshka
Catherine Koshka
Create, manage and deploy p2panda schemas

fishy Create, manage and deploy p2panda schemas Releases | Contribute | Website Command-line-tool to easily create update and share your p2panda schem

null 4 Jul 28, 2023
Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

dbz A library (dbz-lib) and CLI tool (dbz-cli) for working with Databento Binary Encoding (DBZ) files. Python bindings for dbz-lib are provided in the

Databento, Inc. 15 Nov 4, 2022
Animated app icons in your Dock that can run an arbitrary shell script when clicked.

Live App Icon for Mac Animated app icons in your Dock that can run an arbitrary shell script when clicked. Requirements macOS 13 (Ventura) or higher X

Daichi Fujita 13 Jun 8, 2023
🐱 HQ9C is a very serioues compiler for HQ9+, it meta-compiles with Rust.

HQ9+ Compiler HQ9c (Or HQ9+ Compiler) is a blockchain-based NFT minting AI machine-learning cloud infraestructure for the compiling of the great progr

Alex 5 Aug 28, 2022
🚧 Meta Programming language automating multilang communications in a smart way

Table of Contents Merge TLDR Manifest merge-lang Inference File Structure Compile Scheduling Execution Runtime Package Manager API Merge NOTE: Any of

camel_case 4 Oct 17, 2023
Meta framework. Support for dynamic plug-ins and AOP

Kokoro Dynamic publish-subscribe pattern framework. Support for dynamic plug-ins and AOP Not yet stable, do not use in production !! 下面的内容有些老旧了,但是 exa

Kokoro 18 Mar 1, 2024
Wikit - A universal dictionary

Wikit - A universal dictionary What is it? To be short, Wikit is a tool which can (fully, may be in future) render and create dictionary file in MDX/M

bugnofree 120 Dec 3, 2022
Universal Windows library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D3D12).

Shroud Universal library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D

Chase 6 Dec 10, 2022
ABQ is a universal test runner that runs test suites in parallel. It’s the best tool for splitting test suites into parallel jobs locally or on CI

?? abq.build   ?? @rwx_research   ?? discord   ?? documentation ABQ is a universal test runner that runs test suites in parallel. It’s the best tool f

RWX 13 Apr 7, 2023
🗽 Universal Node Package Manager

?? NY Universal Node Package Manager node • yarn • pnpm Features Universal - Picks the right package manager for you based on the lockfile in your fol

Kris Kaczor 46 Oct 12, 2023