A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Overview

Vidyut

मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥

Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard components that are fast, memory-efficient, and competitive with the state of the art.

Vidyut compiles to native code and can be bound to your language of choice. As part of our work on Ambuda, we provide first-class support for Python bindings through vidyut-py.

Vidyut is currently experimental code, and its API is not stable. If you wish to use Vidyut for your production use case, please file an issue first.

Build status

Components

Vidyut currently contains two major components.

Lexicon

Lexicon maps Sanskrit words to their semantics with high performance and minimal memory usage. In one recent test, we were able to store 29.5 million inflected Sanskrit words in 31 megabytes of disk space for a total cost of around 1 byte per word, and we were able to retrieve these words at around 820 ns/word, as compared to 530 ns/word for a standard in-memory hash map.

Lexicon's underlying data structure is a finite-state transducer, as implemented in the fst crate. The one downside to an FST is that we must construct it ahead of time and cannot add new keys to it once it has been created. But since the Sanskrit word list is largely stable, this is a minor concern.

Segmenter

Segmenter performs a padaccheda on a Sanskrit phrase and annotates each pada with its basic morphological information.

Segmenter is not yet competitive with other options, but we are optimistic that we can improve it over time. What is quite special, however, is its sheer speed: Segmenter can process a shloka in under 10 milliseconds, and we expect it to become even faster in the future.

Usage

As mentioned above, Vidyut is currently experimental code, and its API is not stable. If you wish to use Vidyut for your production use case, please file an issue first.

In addition, we encourage you to join the #nlp channel on Ambuda's Discord server, where you can chat directly with the development team and get fast answers to your questions.

Occasional discussion related to Vidyut might also appear on ambuda-discuss or on standard mailing lists like sanskrit-programmers.

Development

Build the code and fetch our linguistic data:

make install

Run a simple evaluation script:

make eval

Run unit tests:

make test

Profile overall runtime and memory usage:

make profile-general

Profile runtime per function:

make target=time profile-target-osx

Profile memory allocations:

make target=alloc profile-target-osx
Comments
  • [prakriya] Handling rule conflicts

    [prakriya] Handling rule conflicts

    ​We currently have reasonable support for karmani prayoga, and I'll also add support for sanAdi pratyayas by the end of the year. We have experimental support for various krdantas and basic support for subantas.

    Currently, how are rule conflicts handled in prakriyA simulation? The regular interpretation of विप्रतिषेधे परं कार्यम्, augmented by a web of paribhAShA-s?

    Would it be simple to implement an option to resolve such rule conflicts by means of the simpler framework described in Rishi rajpopat's thesis which recently entered the news and fascinated / surprised many? This will be enormously valuable in validating the claims made therein, and will likely lead to advances in our understanding of what pANini intended + drawbacks therein.

    opened by vvasuki 2
  • [prakriya] Add a better test suite for krdantas

    [prakriya] Add a better test suite for krdantas

    Online data for tinantas and subantas is quite reasonable. But as far as I'm aware, there are no high-quality lists of krdantas. I have started a basic test suite in basic_krdantas, but we should add more cases here.

    This task requires some knowledege of व्याकरण or else a willingness to go through various grammar books, etc. to determine which forms we should expect.

    vyakarana 
    opened by akprasad 0
  • [prakriya] Optimize the `tripadi` module

    [prakriya] Optimize the `tripadi` module

    Profiling indicates that the tripadi module is slow.

    Many of the rules in the tripadi need to iterate over every character in the string so that they can apply various sandhi changes. Currently, we create a new CompactString for each of these rules. My rough guess is that we create a dozen such strings for each word we derive, even if none of the rules have scope to apply. CompactString shouldn't stack allocate in most cases, but the copy work required here is still slow.

    Once we confirm that this is a problem with profiling, we should avoid the extra copies here. Two approaches that come to mind:

    1. Instead of creating a new string, iterate over the Term strings and manage indices carefully.

    2. Store one copy of the string and rebuild it only if a rule applies. The code would follow the basic pattern of ItPrakriya, e.g., by extending the Prakriya struct with new data and helper methods.

    I think (2) is generally cleaner, and it has the side effect of improving our APIs.

    performance 
    opened by akprasad 0
  • Update public APIs according to best practices

    Update public APIs according to best practices

    https://rust-lang.github.io/api-guidelines/about.html

    When we publish on cargo, we should aim for a minimal API that is maximally permissive. This is a great first issue if you know some Rust already.

    good first issue 
    opened by akprasad 0
  • [kosha] Explore different bitfield orderings

    [kosha] Explore different bitfield orderings

    In packing, I chose a bitfield ordering more or less on a hunch, but I don't think our current ordering works very well because our modular_bitfield crate uses an endianness that's different from what I expected.

    I think a better ordering or approach here could potentially decrease the size of the FST. My guess is that we might save up to 10% on size, which means more of the FST can be kept in the processor cache.

    A good PR here should quantify the size decrease when using a different bitfield ordering.

    performance 
    opened by akprasad 0
Owner
Ambuda
Ambuda, a breakthrough Sanskrit library. This repository contains our code and data.
Ambuda
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

Omar MHAIMDAT 7 Mar 3, 2023
A crate using DeepSpeech bindings to convert mic audio from speech to text

DS-TRANSCRIBER Need an Offline Speech To Text converter? Records your mic, and returns a String containing what was said. Features Begins transcriptio

null 32 Oct 8, 2022
Google CP-SAT solver Rust bindings

Google CP-SAT solver Rust bindings Rust bindings to the Google CP-SAT constraint programming solver. To use this library, you need a C++ compiler and

Kardinal 11 Nov 16, 2022
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Vishal Telangre 310 Dec 29, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Jan 5, 2023
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 662 Dec 31, 2022
Fast and easy random number generation.

alea A zero-dependency crate for fast number generation, with a focus on ease of use (no more passing &mut rng everywhere!). The implementation is bas

Jeff Shen 26 Dec 13, 2022
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022
Blazingly fast framework for in-process microservices on top of Tower ecosystem

norpc = not remote procedure call Motivation Developing an async application is often a very difficult task but building an async application as a set

Akira Hayakawa 15 Dec 12, 2022
Composable n-gram combinators that are ergonomic and bare-metal fast

CREATURE FEATUR(ization) A crate for polymorphic ML & NLP featurization that leverages zero-cost abstraction. It provides composable n-gram combinator

null 3 Aug 25, 2022
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

Catherine Koshka 13 Oct 31, 2022
Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

PDFRip Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks. ?? Table of Contents Int

Mufeed VH 226 Jan 4, 2023