An efficient way to filter duplicate lines from input, à la uniq.

Overview

runiq

Crates.io Build Status

This project offers an efficient way (in both time and space) to filter duplicate entries (lines) from texual input. This project was born from neek, but optimized for both speed and memory. Several filtering options are supported depending on your data and tradeoffs you wish to make between speed and memory usage. For a more detailed explanation, see the relevant blog post.

Installation

This tool will be available via Crates.io, so you can install it directly with cargo:

$ cargo install runiq

If you'd rather just grab a pre-built binary, you might be able to download the correct binary for your architecture directly from the latest release on GitHub here. The list of binaries may not be complete, so please file an issue if your setup is missing (bonus points if you attach the appropriate binary).

Examples

$ cat << EOF >> input.txt
> this is a unique line
> this is a duplicate line
> this is another unique line
> this is a duplicate line
> this is a duplicate line
> EOF

$ cat input.txt
this is a unique line
this is a duplicate line
this is another unique line
this is a duplicate line
this is a duplicate line

$ runiq input.txt
this is a unique line
this is a duplicate line
this is another unique line

Comparisons

Here are some comparisons of runiq against other methods of filtering uniques:

Tool Flags Time Taken Peak Memory
neek N/A 55.8s 313MB
sort -u 595s 9.07GB
uq N/A 32.3s 1.66GB
runiq -f digest 17.8s 64.6MB
runiq -f naive 26.3s 1.62GB
runiq -f bloom 36.8s 13MB

The numbers above are based on filtering unique values out of the following file:

File size:     3,290,971,321 (~3.29GB)
Line count:        5,784,383
Unique count:      2,715,727
Duplicates:        3,068,656
Comments
  • Special characters crash the program

    Special characters crash the program

    thread 'main' panicked at 'called Result::unwrap() on an Err value: Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }', libcore\result.rs:945:5 note: Run with RUST_BACKTRACE=1 for a backtrace.

    caused by a "ü" character on the next line

    bug enhancement 
    opened by kasem123 6
  • Add a -c (count) flag?

    Add a -c (count) flag?

    Would it be possible to add a -c flag to output a count of each unique line, like uniq has? A significant proportion of my usage of uniq is in the form of sort | uniq -c | sort -n, and being able to use runiq to replace that initial pair of commands would be really nice.

    opened by pfmoore 5
  • Index out of bounds

    Index out of bounds

    Hello,

    I downloaded runiq's code and compiled myself for mac, and I'm getting the following panic:

    thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 18446744073709551615', /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/libcore/slice/mod.rs:2715:10
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
    

    When running runiq on the following text:

    Montrer les messages depuis: Tous les messages1 Jour7 Jours2 Semaines1 Mois3 Mois6 Mois1 An Le plus ancien en premierLe plus récent en premier
    Sauter vers: Sélectionner un forum----------------==Réglement== Règlement du forum==Général== Présentez-Vous Demandes de skin Administratif Demande de Partenariat Boite a Idées pour le Forum==téléchargements GTS== camions scania vovlo Daf Man Renault Mercedes-benz Iveco Camions tandem Autres camion Remorques Frigo Container Plateau bachée Citerne Benne autres remorque Maps Mods aide==Partie photo== Photo de ETS Photo de GTS Photo de ET Photo de UK Photo d'autre jeux==Discussion== Discussion de ETS/GTS... Espace Blabla==Aide== Demande d'aide Demande de TutoriauxPartie GTA San Andreas GTA Modding Les Projets de Nos Membres Constructions BrianK Disscussions Diverses sur Tous Les Grand Theft Auto Off-Topic--Autres
    Index | forum gratuit | Forum gratuit d’entraide | Annuaire des forums gratuits | Signaler une violation | Conditions générales d'utilisation
    
    Les lois concernant l'utilisation d'un logiciel varient d'un pays à l'autre. Nous n'encourageons pas l'utilisation de ce logiciel s'il est en violation avec l'une de ces lois.
    
    Blog hébergé par CanalBlog | Plan du site | Blog Cuisine et Gastronomie créé le 07/05/2006 | Contacter l'auteur | Signaler un abus
    
    telecharger cv en ligne gratuit a word mettre mon gratuitement telechargement,telecharger modele cv en ligne mettre mon pole emploi gratuitement,format word comment mettre mon cv en ligne sur pole emploi indeed,faire un cv en ligne et le telecharger gratuitement gratuit mettre mon pole emploi,mettre mon cv en ligne sur pole emploi word indeed ou,mettre mon cv en ligne sur indeed comment pole emploi telecharger word,a ement telecharger cv en ligne gratuit telechargement mettre mon pole emploi,telecharger mon cv en ligne word curriculum vitae mettre pole emploi modele,mettre mon cv en ligne sur pole emploi telecharger comment a format word,mettre mon cv en ligne gratuitement web comment sur pole emploi.
    
    Vous êtes à : Accueil > Label > Galerie des sites Web labellisés - Aucune correspondance pour ces critères
    

    The problem disappears if I remove the newlines. Also the problem is not present if I use runiq via rust install runiq.

    I want to download the code, because I want to introduce a simple modification to ignore empty lines, so that I can preserve new lines between paragraphs. Thus the vanilla version of runiq won't work for me.

    Thank you in advance for your help.

    opened by pjox 5
  • Suggestion: Use a Rust implementation of xxHash

    Suggestion: Use a Rust implementation of xxHash

    Depending on a C library complicates the compilation, since it requires that the Clang compiler be present.

    xxHash has already been implemented in Rust in the twox-hash crate.

    Using it would probably make it easy to publish pre-built binaries for macOS, Linux and Windows, which would improve the usability of runiq :)

    opened by mardukbp 1
  • Failed to compile xxhash-sys. Windows x64 MSVC 2019

    Failed to compile xxhash-sys. Windows x64 MSVC 2019

    error: failed to run custom build command for xxhash-sys v0.1.0

    Caused by: process didn't exit successfully: C:\Users\Admin\AppData\Local\Temp\cargo-installfAw9KD\release\build\xxhash-sys-696d500ada48d2a1\build-script-build (exit code: 101) --- stdout running: "cmake" "C:\Users\Admin\.cargo\registry\src\github.com-1ecc6299db9ec823\xxhash-sys-0.1.0\xxHash/cmake_unofficial" "-G" "Visual Studio 16 2019" "-Thost=x64" "-Ax64" "-DCMAKE_INSTALL_PREFIX=C:\Users\Admin\AppData\Local\Temp\cargo-installfAw9KD\release\build\xxhash-sys-fa8f6d7a1b4e633c\out" "-DCMAKE_C_FLAGS= -DXXH_NAMESPACE=_rust_xxhash_sys -nologo -MD -Brepro" "-DCMAKE_C_FLAGS_RELEASE= -DXXH_NAMESPACE=_rust_xxhash_sys -nologo -MD -Brepro" "-DCMAKE_CXX_FLAGS= -nologo -MD -Brepro" "-DCMAKE_CXX_FLAGS_RELEASE= -nologo -MD -Brepro" "-DCMAKE_ASM_FLAGS= -nologo -MD -Brepro" "-DCMAKE_ASM_FLAGS_RELEASE= -nologo -MD -Brepro" "-DCMAKE_BUILD_TYPE=Release"

    --- stderr thread 'main' panicked at ' failed to execute command: The system cannot find the file specified. (os error 2) is cmake not installed?

    build script failed, must exit now', C:\Users\Admin.cargo\registry\src\github.com-1ecc6299db9ec823\cmake-0.1.44\src\lib.rs:885:5 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

    opened by anotatta 0
  • handle broken pipe

    handle broken pipe

    runiq currently returns w/ error when the downstream pipe is broken.

    For example, I use a command:

    alias bstack='f() { git reflog | grep checkout | cut -d " " -f 8 | runiq - | head ${1} | cat -n };f'
    

    which exits with:

    Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
    ...
    

    This PR checks for a Broken Pipe error and exits the program with success code if found.

    Unfortunately, it uses a macro; there's 4 different writes in a loop, and I didn't want to have to match on the write result each time. If there's a better way, let me know.

    opened by hwchen 0
  • runiq as a library?

    runiq as a library?

    Hi,

    I've been using as a library to be able to provide multiple filter implementations depending on context, and while forking the reposity I noticed the following lines in the main.rs file:

    //! Runiq is only built as a command line tool, although it may be
    //! distributed as a core crate if the backing implementation becomes
    //! interesting for other use cases.
    

    While I'm currently happy with my solution, I wonder if it would benefit people to integrate changes in the project so that runiq could be used both as a deduplication tool and library.

    opened by Uinelj 4
  • No list of available filters

    No list of available filters

    Apart from looking at the code or at the Comparison section in README, there is no way to find out the list of filters available, and what is the default filter. I would expect it to show with runiq --help.

    opened by yangsheng6810 0
  • Feature Request: More unique uniqueness flag

    Feature Request: More unique uniqueness flag

    As it stands, both runiq and runiq --invert always include a single instance of each value that exists more than once within the inputs. There is not, however, an option to completely omit values that occur more than once. I would like to see some sort of '--no-duped' flag (the name is open for debate), probably mutually exclusive with --invert, that filters out all occurrences of data with duplicates, rather than the current default behavior of leaving a single instance. example:

    $ cat fileA
    a1
    b7
    c1
    d3
    $ cat fileB
    a7
    b3
    d8
    c1
    d3
    

    With the current behavior, runiq fileA fileB would produce:

    a1
    b7
    c1
    d3
    a7
    b3
    d8
    

    runiq --no-duped fileA fileB would then produce:

    a1
    b7
    a7
    b3
    d8
    
    opened by StaticPH 2
  • Support piped stdin

    Support piped stdin

    It is very common to pipe the output of cat or find to a filtering program. It would be great if runiq could be used as a cross-platform drop-in replacement for uniq.

    opened by mardukbp 2
Releases(v1.2.2)
Owner
Isaac Whitfield
Fan of all things automated. OSS when applicable. Author of Cachex for Elixir. Senior Software Engineer at Axway. Intelligence wanes without practice.
Isaac Whitfield
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
The fastest way to identify any mysterious text or analyze strings from a file, just ask `lemmeknow` !

The fastest way to identify anything lemmeknow ⚡ Identify any mysterious text or analyze strings from a file, just ask lemmeknow. lemmeknow can be use

Swanand Mulay 594 Dec 30, 2022
A quick way to decode a contract's transaction data with only the contract address and abi.

tx-decoder A quick way to decode a contract's transaction data with only the contract address and abi. E.g, let tx_data = "0xe70dd2fc00000000000000000

DeGatchi 15 Feb 13, 2023
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
Cloc - cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.

cloc Count Lines of Code cloc counts blank lines, comment lines, and physical lines of source code in many programming languages. Latest release: v1.9

null 15.3k Jan 8, 2023
Filter, Sort & Delete Duplicate Files Recursively

Deduplicator Find, Sort, Filter & Delete duplicate files Usage Usage: deduplicator [OPTIONS] [scan_dir_path] Arguments: [scan_dir_path] Run Dedupl

Sreedev Kodichath 108 Jan 27, 2023
A version of `sort | uniq -c` with output that updates in real-time as each line is parsed

uniqtoo A version of sort | uniq -c with output that updates in real-time as each line is parsed. Usage Pipe something line-based into uniqtoo the sam

Jake Wharton 62 Oct 14, 2022
Read input lines as byte slices for high efficiency

bytelines This library provides an easy way to read in input lines as byte slices for high efficiency. It's basically lines from the standard library,

Isaac Whitfield 53 Sep 24, 2022
A easy and declarative way to test JSON input in Rust.

assert_json A easy and declarative way to test JSON input in Rust. assert_json is a Rust macro heavily inspired by serde json macro. Instead of creati

Charles Vandevoorde 8 Dec 5, 2022
A fast duplicate file finder

The Directory Differential hTool DDH traverses input directories and their subdirectories. It also hashes files as needed and reports findings. The H

Jon Moroney 384 Dec 24, 2022
a rust library to find near-duplicate video files

Video Duplicate Finder vid_dup_finder finds near-duplicate video files on disk. It detects videos whose frames look similar, and where the videos are

null 12 Oct 28, 2022
A (aspiring to be fast) tool to find duplicate files.

find-duplicates A decenly fast tool to find duplicate files. Handles symbolic and hard links and treats them seperately to duplicates. Quickstart Inst

dylan 1 Jan 21, 2022
CLI tool to find duplicate files based on their hashes.

Dupper Dupper is a CLI tool that helps you identify duplicate files based on their hashes (using the Seahash hashing algorithm). Installation You can

Rubén J.R. 4 Dec 27, 2022
A crate to convert bytes to something more useable and the other way around in a way Compatible with the Confluent Schema Registry. Supporting Avro, Protobuf, Json schema, and both async and blocking.

#schema_registry_converter This library provides a way of using the Confluent Schema Registry in a way that is compliant with the Java client. The rel

Gerard Klijs 69 Dec 13, 2022
Learn-rust-the-hard-way - "Learn C The Hard Way" by Zed Shaw Converted to Rust

Learn Rust The Hard Way This is an implementation of Zed Shaw's Learn X The Hard Way for the Rust Programming Language. Installing Rust TODO: Instruct

Ryan Levick 309 Dec 8, 2022
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

Miles Granger 14 Oct 31, 2022
url parameter parser for rest filter inquiry

inquerest Inquerest can parse complex url query into a SQL abstract syntax tree. Example this url: /person?age=lt.42&(student=eq.true|gender=eq.'M')&

Jovansonlee Cesar 25 Nov 2, 2020
A filter proxy for StatsD

statsd-filter-proxy-rs statsd-filter-proxy-rs is efficient and lightweight StatsD proxy that filters out unwanted metrics to a StatsD server. Why "If

Alan Ning 20 Mar 20, 2022
Use enum to filter something, support | and & operator.

Filter Use enum to filter something, support | and & operator. Just need to implement Filter Trait with filter-macros crate. How to work Example #[add

上铺小哥 9 Feb 8, 2022
Test the interception/filter of UDP 53 of your local networks or hotspots.

udp53_lookup Test the interception/filter of UDP 53 of your local networks or hotspots. Inspired by BennyThink/UDP53-Filter-Type . What's the purpose?

null 1 Dec 6, 2021