A modern high-performance open source file analysis library for automating localization tasks

Overview

crates.io

🧛 Filecount

Filecount is a modern high-performance open source file analysis library for automating localization tasks. It enables you to add file analysis functionality to your projects while maintaining a lot of customizability and extensibility. The hashment algorithm will always ensure optimal analysis performance.

Counting words is a notoriously difficult problem as it is really hard to define rules that give an "accurate" word count for every language. This means that many different text editing programs and CAT tools give different word counts for the same text! Filecount's philosophy is to be fast and accurate enough. Because for the purpose of having a fast file analysis it is often fine to be close enough to an accurate count.

If you want to see Filecount in action then visit the website: Filecount.io

Documentation

View the documentation on doc.rs

Example

use std::env;
use std::io::Read;
use std::process;
use filecount::analysis::analyze;
use filecount::extract::{extract, ExtractionRules};
use filecount::segmentation::{hashment_many};
use filecount::memory::HashedMemory;
use filecount::unicode::UnicodeRules;
use std::fs::{File};

fn main() {
    let args: Vec<String> = env::args().collect();

    if args.len() < 2 {
        println!("Not enough arguments passed. Please provide a path to a file or folder");
        process::exit(1);
    }

    let path = args[1].clone();

    let mut memfile = File::open("memory.tmx").unwrap();
    let mut memfile_buffer = Vec::new();
    memfile.read_to_end(&mut memfile_buffer).unwrap();  

    let memory = HashedMemory::from_tmx(&memfile_buffer).unwrap();

    let mut file = File::open(&path).unwrap();
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer).unwrap();  

    let texts = extract(buffer, &path, ExtractionRules::default()).unwrap();
    let hashments = hashment_many(texts, &UnicodeRules);
    let analysis = analyze(hashments, &memory);
    println!("{:?}", analysis);       
}

Usage

Filecount uses 3 basic principles, each represented by their respective function:

  • extract
  • hashment
  • analyze

The extract function extracts textual elements from files supported by injected extraction rules. A set of default extraction rules for common file types is included.

The hashment function converts these extracted sections into hashed segments (hence hashment) with word and character counts given injected segmentation rules (Unicode Standard Annex #29 supported by default).

The analyze function analyzes these hashments given an (optional) translation memory in order to get the total word and character counts, repetitions and TM matches.

Filecount deliberatly splits this functionality for optimal user control over the usage of these functions.

Theoretical specifications

By storing segments in hashed format (see hashment in the documentation) in a binary tree, lookups will have a complexity of O(log N) where N is the size of the memory. This way a full file analysis can be performed in O(N log N) with N being the amount of segments in the file. Filecount deliberatly doesn't calculate fuzzy matches (50% TM match, 80% TM match, etc.) as these matches usually have less value to the file processor and this will ensure a high-performance operation.

Installation

Use this package in your project by adding the following to your Cargo.toml:

[dependencies]
filecount = "0.1.0"

Supported file formats

  • docx
  • pptx
  • xlsx
  • json
  • xml
  • txt
  • xliff
  • md
  • html(x)

Planned features

  • Supporting many more default filetypes (including srt, doc, pdf, po, etc.) (all pull requests are welcome)
  • In context matches (although different CAT tools use different definitions of 'in context')
  • Adding seconds and minutes to analysis outputs for audiovisual files (relevant for subtitling related tasks)
  • .srx based default segmentation support
  • .xliff based .tmx and hashed memory management (using .xliff files to populate .tmx)
  • Any file to .xliff conversion based on segmentation rules
  • Reconverting translated .xliff files to their original filetypes
You might also like...
Modern file system navigation tool on Unix
Modern file system navigation tool on Unix

monat -- Modern file system Navigator įŽ€äŊ“中文 Introduction monat is a Unix shell auxiliary command focusing on the navigation of the file system, especia

An feature packed Google Tasks CLI written purely in Rust
An feature packed Google Tasks CLI written purely in Rust

rChore A feature packed unofficial Google Tasks CLI to boost your producitvity, written purely in Rust. 🤔 What is rChore? rChore is an unofficial Goo

Create tasks and save notes offline from your terminal

Create tasks and save notes offline from your terminal

A program that provides LLMs with the ability to complete complex tasks using plugins.

SmartGPT SmartGPT is an experimental program meant to provide LLMs (particularly GPT-3.5 and GPT-4) with the ability to complete complex tasks without

Stall tracking for Python's GIL and Trio tasks

Perpetuo perpetuo, verb: To cause to continue uninterruptedly, to proceed with continually Perpetuo is a stall tracker for Python. Specifically, it ca

A quick-and-dirty attempt to get scoped tasks in Rust.

scoped_tasks_prototype A quick-and-dirty attempt to get scoped tasks in Rust. This library tries to provide an interface similar to scoped threads, ac

High-performance and normalised trading interface capable of executing across many financial venues

High-performance and normalised trading interface capable of executing across many financial venues. Also provides a feature rich simulated exchange to assist with backtesting and dry-trading.

High-performance asynchronous computation framework for system simulation

Asynchronix A high-performance asynchronous computation framework for system simulation. What is this? Warning: this page is at the moment mostly addr

High-performance, low-level framework for composing flexible web integrations

High-performance, low-level framework for composing flexible web integrations. Used mainly as a dependency of `barter-rs` project

H2O Open Source Kubernetes operator and a command-line tool to ease deployment (and undeployment) of H2O open-source machine learning platform H2O-3 to Kubernetes.

H2O Kubernetes Repository with official tools to aid the deployment of H2O Machine Learning platform to Kubernetes. There are two essential tools to b

H2O.ai 16 Nov 12, 2022
A tool for automating terminal applications in Unix.

expectrl A tool for automating terminal applications in Unix. Using the library you can: Spawn process Control process Expect/Verify responces It was

Maxim Zhiburt 132 Dec 14, 2022
🚧 Meta Programming language automating multilang communications in a smart way

Table of Contents Merge TLDR Manifest merge-lang Inference File Structure Compile Scheduling Execution Runtime Package Manager API Merge NOTE: Any of

camel_case 4 Oct 17, 2023
An open source artifact manager. Written in Rust back end and an Vue front end to create a fast and modern experience

nitro_repo Nitro Repo is an open source free artifact manager. Written with a Rust back end and a Vue front end to create a fast and modern experience

Wyatt Jacob Herkamp 30 Dec 14, 2022
My own image file format created for fun! Install the "hif_opener.exe" to open hif files. clone the repo and compile to make your own hif file

Why am i creating this? I wanted to create my own image format since I was 12 years old using Windows 7, tryna modify GTA San Andreas. That day, when

hiftie 3 Dec 17, 2023
A library providing helpers for various StarkNet fees related tasks.

?? How Much ? ?? Table of Contents About Getting Started Prerequisites Installation Usage Estimate fees on network Authors & contributors Security Lic

Abdel @ StarkWare 4 Dec 15, 2022
A high-performance WebSocket integration library for streaming public market data. Used as a key dependency of the `barter-rs` project.

Barter-Data A high-performance WebSocket integration library for streaming public market data from leading cryptocurrency exchanges - batteries includ

Barter 23 Feb 3, 2023
A high-performance Rust library designed to seamlessly integrate with the Discord API.

Rucord - Rust Library for Discord API Interactions Note: This library is currently under development and is not yet recommended for production use. Ov

Coders' Collab 4 Feb 26, 2023
High-performance Javascript color gradient library powered by Rust + WebAssembly

colorgrad-js High-performance Javascript color gradient library powered by Rust + WebAssembly. No dependencies. Faster than d3-scale, chroma-js, culor

Nor Khasyatillah 168 Apr 25, 2023
Schemars is a high-performance Python serialization library, leveraging Rust and PyO3 for efficient handling of complex objects

Schemars Introduction Schemars is a Python package, written in Rust and leveraging PyO3, designed for efficient and flexible serialization of Python c

Michael Gendy 7 Nov 21, 2023