🌾 High-performance Text processing library for the Thai language, built with Rust and exposed as a Python package.

Overview

Thongna 🌾

Thongna (ท้องนา) is a high-performance text processing library for the Thai language, built with Rust and exposed as a Python package. Inspired by PyThaiNLP, Thongna aims to provide efficient Thai language processing tools with the speed and reliability of Rust.

Features

  • Efficient Thai word segmentation: Break Thai text into meaningful tokens using the NewMM algorithm.
  • Fast and reliable: Built with Rust, Thongna offers the performance you need for large-scale text processing.
  • Python integration: Easily use Thongna in your Python projects with its simple and intuitive API.
  • Custom dictionary support: Load and use custom dictionaries for specialized segmentation tasks.
  • Text normalization: Standardize Thai text by handling common inconsistencies and variations.
  • Parallel processing: Utilize multi-core processors for faster processing of large texts.
  • Safe mode: Prevent infinite loops in tokenization for extra reliability.

Project Details

  • Version: 0.2.2 (as of the latest release)
  • License: Apache-2.0
  • Supported Python versions: 3.8+
  • Rust edition: 2021
  • Key dependencies:
    • PyO3 for Rust-Python interoperability
    • Rayon for parallel processing
    • Regex for text manipulation
  • CI/CD: Utilizes GitHub Actions for automated testing and building on multiple platforms (Linux, macOS, Windows)
  • Package distribution: Available on PyPI, with pre-built wheels for various platforms and architectures

Installation

To install Thongna, ensure you have Python 3.8+ installed, then use pip:

Why Thongna? 🌾

The name "Thongna" (ท้องนา) means "rice field" in Thai, symbolizing growth, nourishment, and the foundational aspects of life. Just like a rice field sustains life, Thongna provides the essential tools for working with Thai text, ensuring that your applications can grow and thrive.

Contributing

We welcome contributions from the community! If you’d like to contribute to Thongna, please follow these steps:

  • Fork the repository.
  • Create a new branch for your feature or bugfix.
  • Submit a pull request with a clear explanation of your changes.

License

Thongna is licensed under the Apache License. See the LICENSE file for more details.

Contact

For any questions, suggestions, or issues, feel free to open an issue or contact the maintainers directly.

You might also like...
A package for common types for Cargo index interactions, and conversion between them.

Development stream: https://youtu.be/zGS-HqcAvA4 License Licensed under either of Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org

Library for the Standoff Text Annotation Model, in Rust

STAM Library STAM is a data model for stand-off text annotation and described in detail here. This is a sofware library to work with the model, writte

Rust numeric library with R, MATLAB & Python syntax

Peroxide Rust numeric library contains linear algebra, numerical analysis, statistics and machine learning tools with R, MATLAB, Python like macros. W

Rust crate to create Anki decks. Based on the python library genanki

genanki-rs: A Rust Crate for Generating Anki Decks With genanki-rs you can easily generate decks for the popular open source flashcard platform Anki.

Network-agnostic, high-level game networking library for client-side prediction and server reconciliation.
Network-agnostic, high-level game networking library for client-side prediction and server reconciliation.

WARNING: This crate currently depends on nightly rust unstable and incomplete features. crystalorb Network-agnostic, high-level game networking librar

A Rust library with homemade machine learning models to classify the MNIST dataset. Built in an attempt to get familiar with advanced Rust concepts.

mnist-classifier Ideas UPDATED: Finish CLI Flags Parallelize conputationally intensive functions Class-based naive bayes README Image parsing Confusio

Locality Sensitive Hashing in Rust with Python bindings

lsh-rs (Locality Sensitive Hashing) Locality sensitive hashing can help retrieving Approximate Nearest Neighbors in sub-linear time. For more informat

Pyxirr - Rust-powered collection of financial functions for Python.
Pyxirr - Rust-powered collection of financial functions for Python.

PyXIRR Rust-powered collection of financial functions. PyXIRR stands for "Python XIRR" (for historical reasons), but contains many other financial fun

Rust-port of spotify/annoy as a wrapper for Approximate Nearest Neighbors in C++/Python optimized for memory usage.

Rust-port of spotify/annoy as a wrapper for Approximate Nearest Neighbors in C++/Python optimized for memory usage.

Releases(v0.2.4)
Owner
fr4nk
I'm passionate in Computational linguistics & Embedded System
fr4nk
Msgpack serialization/deserialization library for Python, written in Rust using PyO3, and rust-msgpack. Reboot of orjson. msgpack.org[Python]

ormsgpack ormsgpack is a fast msgpack library for Python. It is a fork/reboot of orjson It serializes faster than msgpack-python and deserializes a bi

Aviram Hassan 139 Dec 30, 2022
Python package to compute levensthein distance in rust

Contents Introduction Installation Usage License Introduction Rust implementation of levensthein distance (https://en.wikipedia.org/wiki/Levenshtein_d

Thibault Blanc 2 Feb 21, 2022
RustFFT is a high-performance FFT library written in pure Rust.

RustFFT is a high-performance FFT library written in pure Rust. It can compute FFTs of any size, including prime-number sizes, in O(nlogn) time.

Elliott Mahler 411 Jan 9, 2023
A Python CLI tool that finds all third-party packages imported into your Python project

python-third-party-imports This is a Python CLI tool built with Rust that finds all third-party packages imported into your Python project. Install Yo

Maksudul Haque 24 Feb 1, 2023
A Machine Learning Framework for High Performance written in Rust

polarlight polarlight is a machine learning framework for high performance written in Rust. Key Features TBA Quick Start TBA How To Contribute Contrib

Chris Ohk 25 Aug 23, 2022
High-performance runtime for data analytics applications

Weld Documentation Weld is a language and runtime for improving the performance of data-intensive applications. It optimizes across libraries and func

Weld 2.9k Jan 7, 2023
High-performance automatic differentiation of LLVM.

The Enzyme High-Performance Automatic Differentiator of LLVM Enzyme is a plugin that performs automatic differentiation (AD) of statically analyzable

William Moses 870 Jan 2, 2023
High performance distributed framework for training deep learning recommendation models based on PyTorch.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI platform@Kuaishou Technology, collaborating with ETH. It

null 340 Dec 30, 2022
Damavand is a quantum circuit simulator. It can run on laptops or High Performance Computing architectures, such CPU distributed architectures or multi GPU distributed architectures.

Damavand is a code that simulates quantum circuits. In order to learn more about damavand, refer to the documentation. Development status Core feature

prevision.io 6 Mar 29, 2022
Robust and Fast tokenizations alignment library for Rust and Python

Robust and Fast tokenizations alignment library for Rust and Python

Yohei Tamura 14 Dec 10, 2022