A modern high-performance open source file analysis library for automating localization tasks

Overview

crates.io

🧛 Filecount

Filecount is a modern high-performance open source file analysis library for automating localization tasks. It enables you to add file analysis functionality to your projects while maintaining a lot of customizability and extensibility. The hashment algorithm will always ensure optimal analysis performance.

Counting words is a notoriously difficult problem as it is really hard to define rules that give an "accurate" word count for every language. This means that many different text editing programs and CAT tools give different word counts for the same text! Filecount's philosophy is to be fast and accurate enough. Because for the purpose of having a fast file analysis it is often fine to be close enough to an accurate count.

If you want to see Filecount in action then visit the website: Filecount.io

Documentation

View the documentation on doc.rs

Example

use std::env;
use std::io::Read;
use std::process;
use filecount::analysis::analyze;
use filecount::extract::{extract, ExtractionRules};
use filecount::segmentation::{hashment_many};
use filecount::memory::HashedMemory;
use filecount::unicode::UnicodeRules;
use std::fs::{File};

fn main() {
    let args: Vec<String> = env::args().collect();

    if args.len() < 2 {
        println!("Not enough arguments passed. Please provide a path to a file or folder");
        process::exit(1);
    }

    let path = args[1].clone();

    let mut memfile = File::open("memory.tmx").unwrap();
    let mut memfile_buffer = Vec::new();
    memfile.read_to_end(&mut memfile_buffer).unwrap();  

    let memory = HashedMemory::from_tmx(&memfile_buffer).unwrap();

    let mut file = File::open(&path).unwrap();
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer).unwrap();  

    let texts = extract(buffer, &path, ExtractionRules::default()).unwrap();
    let hashments = hashment_many(texts, &UnicodeRules);
    let analysis = analyze(hashments, &memory);
    println!("{:?}", analysis);       
}

Usage

Filecount uses 3 basic principles, each represented by their respective function:

  • extract
  • hashment
  • analyze

The extract function extracts textual elements from files supported by injected extraction rules. A set of default extraction rules for common file types is included.

The hashment function converts these extracted sections into hashed segments (hence hashment) with word and character counts given injected segmentation rules (Unicode Standard Annex #29 supported by default).

The analyze function analyzes these hashments given an (optional) translation memory in order to get the total word and character counts, repetitions and TM matches.

Filecount deliberatly splits this functionality for optimal user control over the usage of these functions.

Theoretical specifications

By storing segments in hashed format (see hashment in the documentation) in a binary tree, lookups will have a complexity of O(log N) where N is the size of the memory. This way a full file analysis can be performed in O(N log N) with N being the amount of segments in the file. Filecount deliberatly doesn't calculate fuzzy matches (50% TM match, 80% TM match, etc.) as these matches usually have less value to the file processor and this will ensure a high-performance operation.

Installation

Use this package in your project by adding the following to your Cargo.toml:

[dependencies]
filecount = "0.1.0"

Supported file formats

  • docx
  • pptx
  • xlsx
  • json
  • xml
  • txt
  • xliff
  • md
  • html(x)

Planned features

  • Supporting many more default filetypes (including srt, doc, pdf, po, etc.) (all pull requests are welcome)
  • In context matches (although different CAT tools use different definitions of 'in context')
  • Adding seconds and minutes to analysis outputs for audiovisual files (relevant for subtitling related tasks)
  • .srx based default segmentation support
  • .xliff based .tmx and hashed memory management (using .xliff files to populate .tmx)
  • Any file to .xliff conversion based on segmentation rules
  • Reconverting translated .xliff files to their original filetypes
You might also like...
An open source, programmed in rust, privacy focused tool for reading programming resources (like stackoverflow) fast, efficient and asynchronous from the terminal.

Falion An open source, programmed in rust, privacy focused tool for reading programming resources (like StackOverFlow) fast, efficient and asynchronou

Open-source compiler for the Papyrus scripting language of Bethesda games.

Open Papyrus Compiler This project is still WORK IN PROGRESS. If you have any feature requests, head over to the Issues tab and describe your needs. Y

Over-simplified, featherweight, open-source and easy-to-use authentication and authorization server.

concess ⚠️ Early Development: This is not production ready, yet. Do not use it for anything important. Introduction concess is a over-simplified, feat

Add path effects to open glyphs in a UFO file
Add path effects to open glyphs in a UFO file

ufostroker Add path effects to open contours in a UFO file Given a glyph with open contours: You can apply a noodle effect: ufostroker -i Open.ufo -o

Shellcheck - a static analysis tool for shell scripts
Shellcheck - a static analysis tool for shell scripts

ShellCheck - A shell script static analysis tool ShellCheck is a GPLv3 tool that gives warnings and suggestions for bash/sh shell scripts: The goals o

Oxygen is a voice journal and audio analysis toolkit for people who want to change the way their voice comes across.

Oxygen Voice Journal Oxygen is a voice journal and audio analysis toolkit for people who want to change the way their voice comes across. Or rather, i

⚙️ A curated list of static analysis (SAST) tools for all programming languages, config files, build tools, and more.
⚙️ A curated list of static analysis (SAST) tools for all programming languages, config files, build tools, and more.

This repository lists static analysis tools for all programming languages, build tools, config files and more. The official website, analysis-tools.de

Single File Assets is a file storage format for images

SFA (Rust) Single File Assets is a file storage format for images. The packed images are not guaranteed to be of same format because the format while

A command-line tool aiming to upload the local image used in your markdown file to the GitHub repo and replace the local file path with the returned URL.
A command-line tool aiming to upload the local image used in your markdown file to the GitHub repo and replace the local file path with the returned URL.

Pup A command line tool aiming to upload the local image used in your markdown file to the GitHub repo and replace the local file path with the return

A tool for automating terminal applications in Unix.

expectrl A tool for automating terminal applications in Unix. Using the library you can: Spawn process Control process Expect/Verify responces It was

Maxim Zhiburt 131 Nov 30, 2022
An open source artifact manager. Written in Rust back end and an Vue front end to create a fast and modern experience

nitro_repo Nitro Repo is an open source free artifact manager. Written with a Rust back end and a Vue front end to create a fast and modern experience

Wyatt Jacob Herkamp 30 Nov 22, 2022
An feature packed Google Tasks CLI written purely in Rust

rChore A feature packed unofficial Google Tasks CLI to boost your producitvity, written purely in Rust. ?? What is rChore? rChore is an unofficial Goo

Hemanth Krishna 39 Nov 22, 2022
Create tasks and save notes offline from your terminal

Create tasks and save notes offline from your terminal

null 8 Apr 18, 2022
Modern file system navigation tool on Unix

monat -- Modern file system Navigator 简体中文 Introduction monat is a Unix shell auxiliary command focusing on the navigation of the file system, especia

Pavinberg 8 May 10, 2022
High-performance and normalised trading interface capable of executing across many financial venues

High-performance and normalised trading interface capable of executing across many financial venues. Also provides a feature rich simulated exchange to assist with backtesting and dry-trading.

Barter 4 Nov 30, 2022
High-performance asynchronous computation framework for system simulation

Asynchronix A high-performance asynchronous computation framework for system simulation. What is this? Warning: this page is at the moment mostly addr

Asynchronics 6 Nov 21, 2022
zigfi is an open-source stocks, commodities and cryptocurrencies price monitoring CLI app, written fully in Rust, where you can organize assets you're watching easily into watchlists for easy access on your terminal.

zigfi zigfi is an open-source stocks, commodities and cryptocurrencies price monitoring CLI app, written fully in Rust, where you can organize assets

Aldrin Zigmund Cortez Velasco 18 Oct 24, 2022
This utility traverses through your filesystem looking for open-source dependencies that are seeking donations by parsing README.md and FUNDING.yml files

This utility traverses through your filesystem looking for open-source dependencies that are seeking donations by parsing README.md and FUNDING.yml files

Mufeed VH 38 Nov 22, 2022
A blazing fast command line license generator for your open source projects written in Rust🚀

Overview This is a blazing fast ⚡ , command line license generator for your open source projects written in Rust. I know that GitHub

Shoubhit Dash 38 Nov 20, 2022