Command line utility to remove duplicates from the given input.

Related tags

Command-line huniq
Overview

huniq version 2

Command line utility to remove duplicates from the given input. Note that huniq does not sort the input, it just removes duplicates.

SYNOPSIS: huniq -h # Shows help
SYNOPSIS: huniq [-c|--count] [-0|--null|-d DELIM|--delim DELIM]
$ echo -e "foo\nbar\nfoo\nbaz" | huniq
foo
bar
baz

$ echo -e "foo\nbar\nfoo\nbaz" | huniq -c
1 baz
2 foo
1 bar

huniq replaces sort | uniq (or sort -u with gnu sort) and huniq -c replaces sort | uniq -c, assuming the data is sorted just so it can be passed to uniq. If having sorted output is desired, sort | uniq should still be used.

The order of the output is stable when in normal mode, but it is not stable when in -c/count mode.

Installation

$ cargo install huniq

Motivation

Sorting is slow. By using hash tables/hash sets instead of sorting the input huniq is generally faster than sort -u or sort | uniq -c when testing with gnu sort/gnu uniq.

Version History

Version 1 can be found here.

Changes made in version 2:

  • The -d/-0 flags where added so you can specify custom delimiters
  • Completely rewritten in rust.
  • Version two is (depending on which benchmark you look at below) between 3.5x and 6.5x faster than version 1

Build

cargo build --release

To run the tests execute:

bash ./test.sh

Benchmark

You can use bash ./benchmark.sh to execute the benchmarks. They will execute until you manually abort them (e.g. by pressing Ctrl-C).

The benchmarks work by repeatedly feeding the implementations with data from /usr/share/dict/* and measuring memory usage and time needed to process the data with the unix time tool.

For the uniq algorithm, the results are posted below: We can see that the rust implementation blows pretty much anything else out of the water in terms of performance. Use sort only if you really need a coffee break, because you won't get it with huniq! It beats the C++ implementation by a factor of between 6.5 (for very few duplicates) and 3.5 (around 98% duplicates). Compared to sort -u: huniq is around 30 times faster.

If memory efficiency is what you are looking for, use datamash which is not as fast as huniq but uses the least memory (by a factor of around 3); failing that use sort|uniq which is a lot slower but uses just very slightly more memory than datamash.

repetitions  implementation  seconds  memory/kb
1            huniq2-rust        0.26      29524
1            huniq1-c++         1.67      26188
1            awk                1.63     321936
1            datamash           1.78       9644
1            shell              7.30       9736
2            huniq2-rust        0.84      29592
2            huniq1-c++         3.28      26180
2            awk                3.71     322012
2            datamash           4.60       9636
2            shell             16.68       9740
5            huniq2-rust        2.02      29648
5            huniq1-c++         6.21      26184
5            awk                7.69     322012
5            datamash           9.10       9992
5            shell             44.71      10184
10           huniq2-rust        3.40      29676
10           huniq1-c++        12.84      26172
10           awk               16.73     321940
10           datamash          24.44      10032
10           shell             93.75      10036
50           huniq2-rust       14.68      29612
50           huniq1-c++        55.32      26200
50           awk               74.91     321940
50           datamash         103.54      10936
50           shell            453.94      10956
100          huniq2-rust       43.65      29492
100          huniq1-c++       154.99      26180
100          awk              239.66     321956
100          datamash         285.94      12148
100          shell           1062.07      12208

For the counting huniq -c implementation, the speed advantage was less pronounced: Here the rust implementation is between 25% and 50% faster than the C++ implementation and between 5x and 10x faster than sort | uniq -c.

The increased memory usage of the rust implementation is much worse though: The rust implementation needs about 2.2x more memory than the C++ implementation and between 10x and 12x more memory than sort | uniq.

repetitions  implemetation  seconds  memory/kb
1            huniq2-rust       1.47     132096
1            huniq1-c++        1.85      60196
1            awk               2.79     362940
1            datamash          2.28       9636
1            shell             7.71      11716
2            huniq2-rust       2.32     132052
2            huniq1-c++        2.98      60156
2            awk               4.65     363016
2            datamash          5.27       9732
2            shell            16.37      11680
5            huniq2-rust       4.98     132092
5            huniq1-c++        7.54      60128
5            awk               9.37     363016
5            datamash         11.22       9964
5            shell            44.77      11948
10           huniq2-rust       8.81     132048
10           huniq1-c++       13.55      60196
10           awk              16.19     363032
10           datamash         25.12       9908
10           shell            90.01      11976
50           huniq2-rust      45.89     132092
50           huniq1-c++       74.04      60104
50           awk              85.43     362956
50           datamash        141.90      10996
50           shell           454.42      12876
100          huniq2-rust      90.80     132080
100          huniq1-c++      150.41      60196
100          awk             163.13     363008
100          datamash        322.70      12212
100          shell           933.67      14100

Future direction

Feature wise huniq is pretty much complete, but the performance and memory usage should be improved in the future.

This first of all involves a better benchmarking setup which will probably consist of an extra rust application that will use RNGs to generate test data for huniq and take parameters like the number of elements to create, the rate of duplicates (0-1) the length of strings to output and so on…

Then based on the improved benchmarking capabilities, some optimizations should be tried like short string optimization, arena allocation, different hash functions, using memory optimized hash tables, using an identity function for the uniq function (we already feed it with hashes, so a second round of hashing is not necessary).

License

Copyright © (C) 2020, Karolin Varner. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the Karolin Varner nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL Softwear, BV BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Comments
  • Rewrite split_read_zerocopy with combinators

    Rewrite split_read_zerocopy with combinators

    Since slice::split gives an iterator of slices, I wanted to see if I could write zero copy with it. No degrading performance in the current benchmark.

    Because slice::split gives slices without the delimiter I moved the delimiter printing into the closure handler.

    Just wanna show a fellow Rustacean another style. Happy coding! 👍

    opened by louy2 8
  • Suggestion: Use the Rust implementation of xxHash

    Suggestion: Use the Rust implementation of xxHash

    Using a C library complicates the compilation, since it requires that the Clang compiler be present.

    xxHash has already been implemented in Rust. If there is a good reason to not use this crate, then it would be great to provide precompiled binaries of huniq.

    opened by mardukbp 6
  • confusion about the description

    confusion about the description

    Hi, everywhere in the README it says and shows how huniq sorts too. But in the very first paragraph of the README file, right after the heading, there it says "Note that huniq does not sort the input, it just removes duplicates." Is this maybe a left-over sentence from an older version? thx

    opened by marcotrosi 5
  • Exit without freeing memory on success

    Exit without freeing memory on success

    fixes #17

    I haven't done the benchmarking yet. The speedups should show especially in repeated runs with -c for inputs with large amounts of unique lines.

    Also includes small cleanups for Cargo.lock, and uses std::hash::BuildHasherDefault. If I misunderstood the purpose of the custom implementation of BuildDefaultHasher I can drop the commit, it just looked identical to me.

    opened by alexmaco 5
  •  Remove extra BufReader, BufWriter, type fish

    Remove extra BufReader, BufWriter, type fish

    The title is what I set out to do but I set up my editor so it auto-formats the source and before I realized I made all my fixes based on rustfmt formatted code 😞

    I decided to at least separate out the formatting change to its own commit. If you don't like it let me know.

    opened by louy2 5
  • Work incorrectly

    Work incorrectly

    The command fc-list : family | huniq could not produce the expected result as fc-list : family | sort -u, and I can confirm that the piped lines are terminated by 0x0a, as the example of echo -e "foo\nbar\nfoo\nbaz" | huniq.

    opened by Alsan 4
  • Use bstr::for_byte_record for reading

    Use bstr::for_byte_record for reading

    This replaces the custom split_read_zerocopy with bstr's own minimally-copying record reader, which simplifies the code somewhat without sacrificing performance in any of my measurements (98% - 115% observed performance against baseline)

    This also removes the sysconf dependency, which caused a compile error here on FreeBSD.

    This also adds a --no-trailing-delimiter flag requested in #14, since this code needed to be moved anyway.

    Also switch use of write with write_all, since the former can succeed with partial writes.

    opened by Freaky 3
  • Building problems on macOS

    Building problems on macOS

    Documenting some problems I've had building this. In order of appearance.

    OS: macOS Catalina 10.15.3

    zsh: command not found: llvm-config
    

    Fix: brew install llvm then export PATH="/usr/local/opt/llvm/bin:$PATH"

    ./vendor/xxhash/xxhash.h:663:10: fatal error: 'stdlib.h' file not found
    

    Fix: Running xcode-select --install

    benchmark.sh: line 86: declare: -A: invalid option
    declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
    

    Fix: Use zsh ./benchmark.sh instead of bash ./benchmark.sh

    uniq 2 huniq2-rust time: illegal option -- f
    uniq 2 huniq2-rust usage: time [-lp] command.
    

    Infinite loop. Fix: Run brew install gnu-time then export PATH="/usr/local/opt/gnu-time/libexec/gnubin:$PATH"

    Benchmark Result

    uniq 1 huniq2-rust 0.09 10324
    uniq 1 huniq1-c++  0.76 10824
    uniq 1 awk-sys     0.66 23216
    uniq 1 shell       1.37 52812
    
    uniq 2 huniq2-rust 0.07 10324
    uniq 2 huniq1-c++  1.06 10196
    uniq 2 awk-sys     1.19 23224
    uniq 2 shell       3.29 101620
    
    uniq 5 huniq2-rust 0.25 10324
    uniq 5 huniq1-c++  2.69 11992
    uniq 5 awk-sys     3.12 23252
    uniq 5 shell       9.75 247944
    
    uniq 10 huniq2-rust 0.28 10344
    uniq 10 huniq1-c++  5.13 11200
    uniq 10 awk-sys     6.02 23212
    uniq 10 shell       20.89 491792
    
    uniq 50 huniq2-rust 1.34 10324
    uniq 50 huniq1-c++  24.28 11044
    uniq 50 awk-sys     29.55 23216
    uniq 50 shell       120.39 2411632
    

    Didn't wanna wait for the rest of it.

    opened by louy2 3
  • Use github actions to automatically create github release

    Use github actions to automatically create github release

    Added an example GitHub action that will build and upload the compiled binary as a GitHub release - as seen here: https://github.com/Phasip/huniq/releases/tag/latest

    The action executes on every push and can also be manually launched.

    opened by Phasip 1
  • Don't output trailing delimiter if the input doesn't contain one

    Don't output trailing delimiter if the input doesn't contain one

    When I run printf "1" | huniq -d ",", it prints 1,. When the input doesn't even contain the delimiter, I'd expect huniq to not change the input. But when the input is something like printf "1," | huniq -d ",", then printing 1, makes sense. I don't imagine it'd be that hard to fix this, but it would be really useful. A flag like --no-trailing-delimiter would also work as well.

    opened by kkysen 1
  • Search delimiter with memchr

    Search delimiter with memchr

    “The memchr crate provides heavily optimized routines for searching bytes” (quoting their README).

    On my machine it improves performance of ~15%. Also it's a minimal change and IMHO it actually simplifies the code.

    opened by dippi 1
  • Add an option for huniq -c to indent numbers like uniq -c

    Add an option for huniq -c to indent numbers like uniq -c

    In other words, use %7d format for numbers.

    That is, instead of printing

    1 world
    12345 hello
    

    print

    ******1 hello
    **12345 world
    

    (I've replaced spaces with *s for clarity).

    opened by opennota 1
  • Not much quicker than awk one-liner with numeric keys

    Not much quicker than awk one-liner with numeric keys

    I've compared speed with AWK, and it's not so much quicker.

    huniq

    $ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | huniq | pv -l | wc -l
    20.0M 0:00:30 [ 664k/s]
    20007130
    

    awk

    $ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | awk '!a[$0]++{print}' | pv -l | wc -l
    20.0M 0:00:32 [ 608k/s]
    20008984
    

    unique

    $ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | unique | pv -l | wc -l
    20.0M 0:00:38 [ 524k/s]
    20006340
    

    quniq

    $ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | quniq -max-workers 10 | pv -l | wc -l
    20.0M 0:00:45 [ 436k/s]
    20013324
    

    runiq

    $ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | runiq - | pv -l | wc -l
    20.0M 0:00:35 [ 571k/s]
    20007072
    

    perl

    $ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}' | pv -l | wc -l
    20.0M 0:00:55 [ 363k/s]
    20008104
    

    It's only 2 seconds quicker than awk when processing 20 millions lines.

    Is there any way to process the lines even quicker?

    opened by kenorb 4
  • Rust based benchmarks & Tests

    Rust based benchmarks & Tests

    Requirments

    • The benchmarks should be entirely written in rust.

    • The benchmarks should be portable and not rely on the presence of platform defined dictionary files.

    • The benchmarks should have the ability to be run with specific parameters

      1. Number of input lines
      2. Fraction of duplicates
      3. Distribution of input line length
      4. Char set (binary/text)
    • The benchmarks should still be able to run against all the preexisting commands (sort|uniq).

    Design

    A CLI application should be written that produces a set of random tokens according to the parameters specified on the CLI:

    genbench --charset ascii/binary --delim CHAR --number NUM --duplicates PERCENTAGE --short LEN --long LEN
    

    The short/long parameters each indicate the 90% percentile of string lengths, using a gaussian distribution.

    For the actual benchmark we should write a benchmark executor that runs each of the implementations with a variety of parameters handed to genbench.

    Tests

    We can reuse the same strategy for testing by generating test data with genbench and then comparing the output of the full huniq and a super naive, unoptimized huniq implementation. We should specifically make sure, that buffer growing is tested (supply some very long, >20kb strings).

    opened by koraa 0
Releases(latest)
Owner
Karolin Varner
https://twitter.com/dakoraa
Karolin Varner
A tool for adding new lines to files, skipping duplicates and write in Rust!

anew A tool for adding new lines to files written in Rust. The tool aids in appending lines from stdin to a file, but only if they don't already appea

z3r0yu 4 Nov 16, 2023
Checkline: checkbox line picker for stdin line input

checkline is a Unix command line interface (CLI) terminal user interface (TUI) that prompts you to check each line of stdin, to pick each line to output to stdout

SixArm 4 Dec 4, 2022
omekasy is a command line application that converts alphanumeric characters in your input to various styles defined in Unicode.

omekasy is a command line application that converts alphanumeric characters in your input to various styles defined in Unicode. omekasy means "dress up" in Japanese.

null 105 Nov 16, 2022
Small command-line tool to switch monitor inputs from command line

swmon Small command-line tool to switch monitor inputs from command line Installation git clone https://github.com/cr1901/swmon cargo install --path .

William D. Jones 5 Aug 20, 2022
Remove files or directories.

Wrm - Remove files or directories Installation Run the following Cargo command: cargo install wrm Usage To move files to trash($HOME/.local/share/wrm

null 41 Mar 4, 2024
fixred is a command line utility to fix outdated links in files with redirect URLs.

fixred fixred is a command line utility to fix outdated links in files with redirect URLs. Installation fixred is installed via cargo package manager.

Linda_pp 35 Aug 6, 2022
Rust command line utility to quickly display useful secrets in a Kubernetes namespace

kube-secrets This is a command line utility for quickly looking at secrets in a Kubernetes namespace that are typically looked at by humans. It specif

Frank Wiles 8 Feb 10, 2022
A command-line utility that creates project structure.

petridish A command-line utility that creates project structure. If you have heard of the cookiecutter project, petridish is a rust implementation of

null 11 Dec 29, 2022
Command line utility for controlling LIFX smart lights

lifxc is a command line utility for controlling LIFX smart lights. Currently, communication over the LIFX LAN protocol is supported.

Harrison Rigg 1 Nov 17, 2021
Gecko's trusty package manager and command-line utility.

Geckos have super-powers, ya'know? Grip is Gecko's trusty package manager and command-line utility. USAGE: grip [FLAGS] [file] [SUBCOMMAND] FLAGS

Gecko 2 Jan 3, 2022
A simplified recreation of the command-line utility grep written in Rust.

smolgrep A simplified recreation of the command-line utility grep written in Rust. Download and run Download Rust On Mac/Linux Open a terminal and ent

Thi Dinh 0 Dec 27, 2021
A utility for managing cargo dependencies from the command line.

cargo edit This tool extends Cargo to allow you to add, remove, and upgrade dependencies by modifying your Cargo.toml file from the command line. Curr

Pascal Hertleif 2.7k Jan 6, 2023
A command line utility to easily make dank memes

meme-cli A command line utility to easily make dank memes. Yes, really. Installation cargo install meme-cli Alternatively, install from source using g

null 196 Dec 26, 2022
🧠 A command-line utility for switching git branches more easily. Switch branches interactively or use a fuzzy search to find that long-forgotten branch name.

git-smart-checkout A git command extension for switching git branches more efficiently. About Interactively switch branches or fuzzy search for that f

Cezar Craciun 51 Dec 29, 2022
A small command-line utility for encoding and decoding bech32 strings

A small command-line utility for encoding and decoding bech32 strings.

Charlie Moog 5 Dec 26, 2022
A lightweight command line utility with some small functions for CTFs.

Ice Ice is a lightweight command line utility to help with simple problems encountered while playing CTFs. Extracted from graveyard NOTE: Most of the

Aquib 12 Dec 19, 2022
A command-line utility which aligns a block of text within the terminal (or a specified number of columns), written in Rust.

align: a command line utility for aligning text. ⭐ Overview Aligns text within the terminal (or a specified number of columns). The text is treated as

Khalil Ouali 6 Aug 11, 2023
This is a command line utility written in Rust that helps you plan factories in Satisfactory.

Satisfactory Factory Planning Utility This is a command line utility written in Rust that helps you plan factories in Satisfactory. Tell it what you w

Maurdekye 4 Nov 29, 2023
Command-line HTTP client for sending a POST request to specified URI on each stdin line.

line2httppost Simple tool to read lines from stdin and post each line as separate POST request to a specified URL (TCP connection is reused though). G

Vitaly Shukela 3 Jan 3, 2023