Like wc, but unicode-aware, and with per-line mode

Overview

uwc

Build Status crates.io page

Like wc, but unicode-aware, and with line mode.

uwc can count:

  • Lines
  • Words
  • Bytes
  • Grapheme clusters
  • Unicode code points

Additionally, it can operate in line mode, which will count things within lines.

Usage example

By default, uwc will count lines, words, and bytes. You can specify the counters you'd like, or ask for all counters with the -a flag.

$ uwc tests/fixtures/**/input
lines  words  bytes  filename
8      5      29     tests/fixtures/all_newlines/input
0      0      0      tests/fixtures/empty/input
0      0      0      tests/fixtures/empty_line_mode/input
1      9      97     tests/fixtures/flags_bp/input
1      9      97     tests/fixtures/flags_cl/input
1      9      97     tests/fixtures/flags_w/input
0      1      5      tests/fixtures/hello/input
1      9      97     tests/fixtures/i_can_eat_glass/input
8      8      29     tests/fixtures/line_mode/input
7      8      28     tests/fixtures/line_mode_no_trailing_newline/input
7      8      28     tests/fixtures/line_mode_no_trailing_newline_count_newlines/input
34     66     507    total

$ uwc -a tests/fixtures/**/input
lines  words  bytes  graphemes  codepoints  filename
8      5      29     23         24          tests/fixtures/all_newlines/input
0      0      0      0          0           tests/fixtures/empty/input
0      0      0      0          0           tests/fixtures/empty_line_mode/input
1      9      97     51         51          tests/fixtures/flags_bp/input
1      9      97     51         51          tests/fixtures/flags_cl/input
1      9      97     51         51          tests/fixtures/flags_w/input
0      1      5      5          5           tests/fixtures/hello/input
1      9      97     51         51          tests/fixtures/i_can_eat_glass/input
8      8      29     28         28          tests/fixtures/line_mode/input
7      8      28     27         27          tests/fixtures/line_mode_no_trailing_newline/input
7      8      28     27         27          tests/fixtures/line_mode_no_trailing_newline_count_newlines/input
34     66     507    314        315         total

You can also switch into line mode with the --mode flag:

$ uwc -a --mode line tests/fixtures/line_mode/input
lines  words  bytes  graphemes  codepoints  filename
0      1      1      1          1           tests/fixtures/line_mode/input:1
0      1      2      2          2           tests/fixtures/line_mode/input:2
0      1      3      3          3           tests/fixtures/line_mode/input:3
0      1      5      4          4           tests/fixtures/line_mode/input:4
0      1      1      1          1           tests/fixtures/line_mode/input:5
0      1      4      4          4           tests/fixtures/line_mode/input:6
0      1      2      2          2           tests/fixtures/line_mode/input:7
0      1      3      3          3           tests/fixtures/line_mode/input:8
0      8      21     20         20          tests/fixtures/line_mode/input:total

Why?

The goal of this project is to consider unicode rules correctly when counting things. Specifically, it should:

  • Count all newline characters correctly. This includes lesser-known line breaks, like NEL (U+0085), FF (U+000C), LS (U+2028), and PS (U+2029).
  • Count all words using the Unicode standard's word boundary rules.
  • Count all complete grapheme clusters correctly, so that even edge cases like Z҉͈͓͈͎a̘͈̠̭l̨̯g̶̬͇̭o̝̹̗͎̙ ͟t͖̙̟̹͇̥̝͡e̥͘x͚̺̭̻͘t͉͔̩̲̘, for example, are counted correctly.

It does not aim to implement these unicode algorithms, however, so it makes use of the unicode-segmentation library for most of the heavy lifting. And since Unicode support in the Rust ecosystem is not quite mature yet, that has some consequences for this project. See the caveats below.

It is primarily a fun side project for me, and an excuse to learn more about Rust and unicode.

Installation

It is published on crates.io, so simply:

$ cargo install uwc

Caveats

UTF-8

It only supports UTF-8 files. UTF-16 can go on my to-do list if there is demand. For now, you can use iconv to convert non-UTF-8 files first.

Speed

It is slower than wc. My analysis hasn't been extensive, but as far as I can tell, the reasons are:

  • It is using unicode algorithms, which are just going to be slower than ASCII no matter what.
  • I am not that experienced with Rust, so it's quite possible I'm not doing something as efficiently as possible.
  • My free time is limited, and I am prioritizing correctness over speed (though speed is good).

With that said, parallelization helps. With testing on my local laptop with larger data sets, the speed is within an order of magnitude of wc. I measured uwc being 3x slower than wc on a collection of 18 MiB of text files.

Localization

Rust, as yet, has no localization libraries, so this has some consequences. Some counts will just be wrong, such as hyphenated words, which is locale-specific and requires language dictionary lookups to be correct. Also, there are some languages that have no syntactic word separators, such as Japanese, so e.g.

ガラス食べられます

should be 5 words, but without localization, we cannot determine that.

You might also like...
MCUboot, but in Rust

MCUboot - In Rust This project is the beginnings of a fresh implementation of MCUboot in Rust. At this point, it implements SHA256 image verification,

A Matrix bot which can generate
A Matrix bot which can generate "This Week in X" like blog posts

hebbot A Matrix bot which can help to generate periodic / recurrent summary blog posts (also known as "This Week in X"). The bot was inspired by twim-

A conky-like system monitor made for the sole purpose of teaching myself rust-lang.

Pomky A conky-like system monitor made for the sole purpose of teaching myself rust-lang. It is not as configurable, modular, or feature packed as con

A cargo plugin for showing a tree-like overview of a crate's modules.

cargo-modules Synopsis A cargo plugin for showing an overview of a crate's modules. Motivation With time, as your Rust projects grow bigger and bigger

Rust bindings to the dos-like framework
Rust bindings to the dos-like framework

dos-like for Rust   This project provides access to Mattias Gustavsson's dos-like framework, so as to write DOS-like applications in Rust. How to use

Embeddable tree-walk interpreter for a "mostly lazy" Lisp-like scripting language.

ceceio Embeddable tree-walk interpreter for a "mostly lazy" Lisp-like scripting language. Just a work-in-progress testbed for now. Sample usage us

A Rust-like Hardware Description Language transpiled to Verilog

Introduction This projects attempts to create a Rust-like hardware description language. Note that this has nothing to do with Rust itself, it just ha

An OS like a lump of mud.

slimeOS An OS like a lump of mud. Run Clone this repo, and just do make run, and then you will see: [rustsbi] RustSBI version 0.3.0-alpha.4, adapting

A gitweb/cgit-like interface for the modern age

rgit See it in action! A gitweb/cgit-like interface for the modern age. Written in Rust using Axum, git2, Askama and Sled. Sled is used to store all m

Comments
  • Parallelize

    Parallelize

    This adds parallelization with rayon. It does this by chunking up the lines it reads and doing those in parallel. Local testing found 10,000 to be the optimal number, so that is the default. A consequence of this behavior is that if the input is slow, it will seem like it is doing nothing because it is waiting for a complete chunk before doing any counting. The option is given for this situation.

    It also adds some negative tests, i.e., verifies behavior in the event of errors. For now, this just means verifying a non-zero exit code and substring matching on stderr.

    opened by dead10ck 0
Owner
Skyler Hawthorne
Skyler Hawthorne
UNIC: Unicode and Internationalization Crates for Rust

UNIC: Unicode and Internationalization Crates for Rust https://github.com/open-i18n/rust-unic UNIC is a project to develop components for the Rust pro

open-i18n — Open Internationalization Initiative 219 Nov 12, 2022
OOLANG - an esoteric stack-based programming language where all instructions/commands are differnet unicode O characters

OOLANG is an esoteric stack-based programming language where all instructions/commands are differnet unicode O characters

RNM Enterprises 2 Mar 20, 2022
A crate for converting an ASCII text string or file to a single unicode character

A crate for converting an ASCII text string or file to a single unicode character. Also provides a macro to embed encoded source code into a Rust source file. Can also do the same to Python code while still letting the code run as before by wrapping it in a decoder.

Johanna Sörngård 17 Dec 31, 2022
Determine the Unicode class of a mathematical character in Rust.

unicode-math-class Determine the Unicode class of a mathematical character in Rust. Example use unicode_math_class::{class, MathClass}; assert_eq!(cl

Typst 3 Jan 10, 2023
Accompanying the 5-class, 1 class per week series of Ultimate Rust: Foundations

Ultimate Rust Foundations Presented by Ardan Labs, Ultima Rust: Foundations gives you a "zero to hero" class to get you started with Rust. You'll lear

Herbert 7 May 22, 2023
It's a library AND a binary, but at what cost?

aria-of-borrow It's a library AND a binary, but at what cost? This is a simple toy project that demonstrates the various failure modes of trying to ma

Aria Beingessner 5 Apr 2, 2024
Neofetch but in Rust (rust-toml-fetch)

rtfetch Configuration Recompile each time you change the config file logo = "arch.logo" # in src/assets. info = [ "", "", "<yellow>{host_n

Paolo Bettelini 6 Jun 6, 2022
Parsley, Sage, Rosemary, but no Thyme

Parsley, Sage, Rosemary, but no Thyme A tiny command line tool that runs your command for you and tries to tell you how much longer the damn thing is

Hendrik Sollich 3 Dec 29, 2022
🗝️ Superbacked, but in Rust

Hyperbacked A clone of Superbacked, written in Rust. Basically, it stores secrets securely using printable PDFs that contain encrypted QR-Codes. The e

null 46 Jan 26, 2023
PE Parsing, but blazing fast

PE Parser A blazing fast ?? PE Parser written in Rust Motivation The main goals of pe-parser is to write something suitable for a PE Loader. Is editin

Isaac Marovitz 8 Apr 21, 2023