A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Overview

Vidyut

मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥

Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard components that are fast, memory-efficient, and competitive with the state of the art.

Vidyut compiles to native code and can be bound to your language of choice. As part of our work on Ambuda, we provide first-class support for Python bindings through vidyut-py.

Vidyut is currently experimental code, and its API is not stable. If you wish to use Vidyut for your production use case, please file an issue first.

Build status

Components

Vidyut currently contains two major components.

Lexicon

Lexicon maps Sanskrit words to their semantics with high performance and minimal memory usage. In one recent test, we were able to store 29.5 million inflected Sanskrit words in 31 megabytes of disk space for a total cost of around 1 byte per word, and we were able to retrieve these words at around 820 ns/word, as compared to 530 ns/word for a standard in-memory hash map.

Lexicon's underlying data structure is a finite-state transducer, as implemented in the fst crate. The one downside to an FST is that we must construct it ahead of time and cannot add new keys to it once it has been created. But since the Sanskrit word list is largely stable, this is a minor concern.

Segmenter

Segmenter performs a padaccheda on a Sanskrit phrase and annotates each pada with its basic morphological information.

Segmenter is not yet competitive with other options, but we are optimistic that we can improve it over time. What is quite special, however, is its sheer speed: Segmenter can process a shloka in under 10 milliseconds, and we expect it to become even faster in the future.

Usage

As mentioned above, Vidyut is currently experimental code, and its API is not stable. If you wish to use Vidyut for your production use case, please file an issue first.

In addition, we encourage you to join the #nlp channel on Ambuda's Discord server, where you can chat directly with the development team and get fast answers to your questions.

Occasional discussion related to Vidyut might also appear on ambuda-discuss or on standard mailing lists like sanskrit-programmers.

Development

Build the code and fetch our linguistic data:

make install

Run a simple evaluation script:

make eval

Run unit tests:

make test

Profile overall runtime and memory usage:

make profile-general

Profile runtime per function:

make target=time profile-target-osx

Profile memory allocations:

make target=alloc profile-target-osx
You might also like...
Fast and easy random number generation.

alea A zero-dependency crate for fast number generation, with a focus on ease of use (no more passing &mut rng everywhere!). The implementation is bas

Vaporetto: a fast and lightweight pointwise prediction based tokenizer

🛥 VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

Blazingly fast framework for in-process microservices on top of Tower ecosystem
Blazingly fast framework for in-process microservices on top of Tower ecosystem

norpc = not remote procedure call Motivation Developing an async application is often a very difficult task but building an async application as a set

Composable n-gram combinators that are ergonomic and bare-metal fast
Composable n-gram combinators that are ergonomic and bare-metal fast

CREATURE FEATUR(ization) A crate for polymorphic ML & NLP featurization that leverages zero-cost abstraction. It provides composable n-gram combinator

A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Ultra-fast, spookily accurate text summarizer that works on any language
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.
Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

PDFRip Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks. 📖 Table of Contents Int

A highly modular Bitcoin Lightning library written in Rust. Its Rust-Lightning, not Rusty's Lightning!

Rust-Lightning is a Bitcoin Lightning library written in Rust. The main crate, lightning, does not handle networking, persistence, or any other I/O. Thus, it is runtime-agnostic, but users must implement basic networking logic, chain interactions, and disk storage. More information is available in the About section.

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

A lightning fast version of tmux-fingers written in Rust, copy/pasting tmux like vimium/vimperator
A lightning fast version of tmux-fingers written in Rust, copy/pasting tmux like vimium/vimperator

tmux-thumbs A lightning fast version of tmux-fingers written in Rust for copy pasting with vimium/vimperator like hints. Usage Press ( prefix + Space

⚡️Lightning-fast linter for .env files. Written in Rust 🦀

⚡️ Lightning-fast linter for .env files. Written in Rust 🦀 Dotenv-linter can check / fix / compare .env files for problems that may cause the applica

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ ⚡ Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine 🔍 M

⚡️ Lightning-fast and minimal calendar command line. Written in Rust 🦀
⚡️ Lightning-fast and minimal calendar command line. Written in Rust 🦀

⚡️ Lightning-fast and minimal calendar command line. It's similar to cal. Written in Rust 🦀

Automatically deploy from GitHub to Replit, lightning fast ⚡️

repl.deploy Automatically deploy from GitHub to Replit, lightning fast ⚡️ repl.deploy is split into A GitHub app, which listens for code changes and s

Lightning-fast and Powerful Code Editor written in Rust
Lightning-fast and Powerful Code Editor written in Rust

Lapce Lightning-fast and Powerful Code Editor written in Rust About Lapce is written in pure Rust, with UI in Druid. It's using Xi-Editor's Rope Scien

🌳 A lightning-fast system fetch tool made with Rust.
🌳 A lightning-fast system fetch tool made with Rust.

🌳 treefetch A lightning-fast minimalist system fetch tool made in Rust. Even faster than neofetch and pfetch. Made to practice my new Rust skills 🦀

 🔎 Search millions of files at lightning-fast speeds to find what you are looking for
🔎 Search millions of files at lightning-fast speeds to find what you are looking for

🔎 Search millions of files at lightning-fast speeds to find what you are looking for

A lightning fast state management module for Yew.

yewv A lightning fast state management module for Yew built with performance and simplicity as a first priority. Who is this for? If you wish to use a

A lightning-fast Sixel serializer/deserializer

sixel-image This is a (pretty fast!) sixel serializer/deserializer with cropping support. It accepts a sixel serialized string byte-by-byte, deseriali

Owner
Ambuda
Ambuda, a breakthrough Sanskrit library. This repository contains our code and data.
Ambuda
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 15 Aug 8, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Nov 9, 2022
A crate using DeepSpeech bindings to convert mic audio from speech to text

DS-TRANSCRIBER Need an Offline Speech To Text converter? Records your mic, and returns a String containing what was said. Features Begins transcriptio

null 32 Oct 8, 2022
Google CP-SAT solver Rust bindings

Google CP-SAT solver Rust bindings Rust bindings to the Google CP-SAT constraint programming solver. To use this library, you need a C++ compiler and

Kardinal 11 Nov 16, 2022
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Vishal Telangre 309 Nov 26, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 204 Oct 12, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 74 Nov 25, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.1k Dec 4, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 495 Dec 1, 2022
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 648 Dec 1, 2022