A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Ambuda

Last update: Dec 30, 2022

Related tags

Overview

Vidyut

मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥

Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard components that are fast, memory-efficient, and competitive with the state of the art.

Vidyut compiles to native code and can be bound to your language of choice. As part of our work on Ambuda, we provide first-class support for Python bindings through vidyut-py.

Vidyut is currently experimental code, and its API is not stable. If you wish to use Vidyut for your production use case, please file an issue first.

Components

Vidyut currently contains two major components.

Lexicon

Lexicon maps Sanskrit words to their semantics with high performance and minimal memory usage. In one recent test, we were able to store 29.5 million inflected Sanskrit words in 31 megabytes of disk space for a total cost of around 1 byte per word, and we were able to retrieve these words at around 820 ns/word, as compared to 530 ns/word for a standard in-memory hash map.

Lexicon's underlying data structure is a finite-state transducer, as implemented in the fst crate. The one downside to an FST is that we must construct it ahead of time and cannot add new keys to it once it has been created. But since the Sanskrit word list is largely stable, this is a minor concern.

Segmenter

Segmenter performs a padaccheda on a Sanskrit phrase and annotates each pada with its basic morphological information.

Segmenter is not yet competitive with other options, but we are optimistic that we can improve it over time. What is quite special, however, is its sheer speed: Segmenter can process a shloka in under 10 milliseconds, and we expect it to become even faster in the future.

Usage

As mentioned above, Vidyut is currently experimental code, and its API is not stable. If you wish to use Vidyut for your production use case, please file an issue first.

In addition, we encourage you to join the #nlp channel on Ambuda's Discord server, where you can chat directly with the development team and get fast answers to your questions.

Occasional discussion related to Vidyut might also appear on ambuda-discuss or on standard mailing lists like sanskrit-programmers.

Development

Build the code and fetch our linguistic data:

make install

Run a simple evaluation script:

make eval

Run unit tests:

make test

Profile overall runtime and memory usage:

make profile-general

Profile runtime per function:

make target=time profile-target-osx

Profile memory allocations:

make target=alloc profile-target-osx

Comments

[prakriya] Handling rule conflicts

We currently have reasonable support for karmani prayoga, and I'll also add support for sanAdi pratyayas by the end of the year. We have experimental support for various krdantas and basic support for subantas.

Currently, how are rule conflicts handled in prakriyA simulation? The regular interpretation of विप्रतिषेधे परं कार्यम्, augmented by a web of paribhAShA-s?

Would it be simple to implement an option to resolve such rule conflicts by means of the simpler framework described in Rishi rajpopat's thesis which recently entered the news and fascinated / surprised many? This will be enormously valuable in validating the claims made therein, and will likely lead to advances in our understanding of what pANini intended + drawbacks therein.

opened by vvasuki 2
[prakriya] Add a better test suite for krdantas

Online data for tinantas and subantas is quite reasonable. But as far as I'm aware, there are no high-quality lists of krdantas. I have started a basic test suite in basic_krdantas, but we should add more cases here.

This task requires some knowledege of व्याकरण or else a willingness to go through various grammar books, etc. to determine which forms we should expect.
vyakarana

opened by akprasad 0
[prakriya] Optimize the `tripadi` module
Profiling indicates that the tripadi module is slow.

Many of the rules in the tripadi need to iterate over every character in the string so that they can apply various sandhi changes. Currently, we create a new CompactString for each of these rules. My rough guess is that we create a dozen such strings for each word we derive, even if none of the rules have scope to apply. CompactString shouldn't stack allocate in most cases, but the copy work required here is still slow.

Once we confirm that this is a problem with profiling, we should avoid the extra copies here. Two approaches that come to mind:

Instead of creating a new string, iterate over the Term strings and manage indices carefully.

Store one copy of the string and rebuild it only if a rule applies. The code would follow the basic pattern of ItPrakriya, e.g., by extending the Prakriya struct with new data and helper methods.

I think (2) is generally cleaner, and it has the side effect of improving our APIs.
performance
opened by akprasad 0
Update public APIs according to best practices

https://rust-lang.github.io/api-guidelines/about.html

When we publish on cargo, we should aim for a minimal API that is maximally permissive. This is a great first issue if you know some Rust already.
good first issue

opened by akprasad 0
[kosha] Explore different bitfield orderings

In packing, I chose a bitfield ordering more or less on a hunch, but I don't think our current ordering works very well because our modular_bitfield crate uses an endianness that's different from what I expected.

I think a better ordering or approach here could potentially decrease the size of the FST. My guess is that we might save up to 10% on size, which means more of the FST can be kept in the processor cache.

A good PR here should quantify the size decrease when using a different bitfield ordering.
performance

opened by akprasad 0

Owner

Ambuda

Ambuda, a breakthrough Sanskrit library. This repository contains our code and data.

GitHub

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

17 Dec 22, 2022

lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

7 Dec 30, 2022

Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

20 Dec 29, 2022

Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

7 Mar 3, 2023

A crate using DeepSpeech bindings to convert mic audio from speech to text

DS-TRANSCRIBER Need an Offline Speech To Text converter? Records your mic, and returns a String containing what was said. Features Begins transcriptio

32 Oct 8, 2022

Google CP-SAT solver Rust bindings

Google CP-SAT solver Rust bindings Rust bindings to the Google CP-SAT constraint programming solver. To use this library, you need a C++ compiler and

11 Nov 16, 2022

Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

310 Dec 29, 2022

Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

207 Dec 26, 2022

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal