An efficient way to filter duplicate lines from input, à la uniq.

Isaac Whitfield

Last update: Dec 24, 2022

Related tags

Overview

runiq

This project offers an efficient way (in both time and space) to filter duplicate entries (lines) from texual input. This project was born from neek, but optimized for both speed and memory. Several filtering options are supported depending on your data and tradeoffs you wish to make between speed and memory usage. For a more detailed explanation, see the relevant blog post.

Installation

This tool will be available via Crates.io, so you can install it directly with cargo:

$ cargo install runiq

If you'd rather just grab a pre-built binary, you might be able to download the correct binary for your architecture directly from the latest release on GitHub here. The list of binaries may not be complete, so please file an issue if your setup is missing (bonus points if you attach the appropriate binary).

Examples

$ cat << EOF >> input.txt
> this is a unique line
> this is a duplicate line
> this is another unique line
> this is a duplicate line
> this is a duplicate line
> EOF

$ cat input.txt
this is a unique line
this is a duplicate line
this is another unique line
this is a duplicate line
this is a duplicate line

$ runiq input.txt
this is a unique line
this is a duplicate line
this is another unique line

Comparisons

Here are some comparisons of runiq against other methods of filtering uniques:

Tool	Flags	Time Taken	Peak Memory
neek	N/A	55.8s	313MB
sort	-u	595s	9.07GB
uq	N/A	32.3s	1.66GB
runiq	-f digest	17.8s	64.6MB
runiq	-f naive	26.3s	1.62GB
runiq	-f bloom	36.8s	13MB

The numbers above are based on filtering unique values out of the following file:

File size:     3,290,971,321 (~3.29GB)
Line count:        5,784,383
Unique count:      2,715,727
Duplicates:        3,068,656

Comments

Special characters crash the program

thread 'main' panicked at 'called Result::unwrap() on an Err value: Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }', libcore\result.rs:945:5 note: Run with RUST_BACKTRACE=1 for a backtrace.

caused by a "ü" character on the next line
bug enhancement

opened by kasem123 6
Add a -c (count) flag?

Would it be possible to add a -c flag to output a count of each unique line, like uniq has? A significant proportion of my usage of uniq is in the form of sort | uniq -c | sort -n, and being able to use runiq to replace that initial pair of commands would be really nice.

opened by pfmoore 5

Index out of bounds

Hello,

I downloaded runiq's code and compiled myself for mac, and I'm getting the following panic:

thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 18446744073709551615', /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/libcore/slice/mod.rs:2715:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

When running runiq on the following text:

Montrer les messages depuis: Tous les messages1 Jour7 Jours2 Semaines1 Mois3 Mois6 Mois1 An Le plus ancien en premierLe plus récent en premier
Sauter vers: Sélectionner un forum----------------==Réglement== Règlement du forum==Général== Présentez-Vous Demandes de skin Administratif Demande de Partenariat Boite a Idées pour le Forum==téléchargements GTS== camions scania vovlo Daf Man Renault Mercedes-benz Iveco Camions tandem Autres camion Remorques Frigo Container Plateau bachée Citerne Benne autres remorque Maps Mods aide==Partie photo== Photo de ETS Photo de GTS Photo de ET Photo de UK Photo d'autre jeux==Discussion== Discussion de ETS/GTS... Espace Blabla==Aide== Demande d'aide Demande de TutoriauxPartie GTA San Andreas GTA Modding Les Projets de Nos Membres Constructions BrianK Disscussions Diverses sur Tous Les Grand Theft Auto Off-Topic--Autres
Index | forum gratuit | Forum gratuit d’entraide | Annuaire des forums gratuits | Signaler une violation | Conditions générales d'utilisation

Les lois concernant l'utilisation d'un logiciel varient d'un pays à l'autre. Nous n'encourageons pas l'utilisation de ce logiciel s'il est en violation avec l'une de ces lois.

Blog hébergé par CanalBlog | Plan du site | Blog Cuisine et Gastronomie créé le 07/05/2006 | Contacter l'auteur | Signaler un abus

telecharger cv en ligne gratuit a word mettre mon gratuitement telechargement,telecharger modele cv en ligne mettre mon pole emploi gratuitement,format word comment mettre mon cv en ligne sur pole emploi indeed,faire un cv en ligne et le telecharger gratuitement gratuit mettre mon pole emploi,mettre mon cv en ligne sur pole emploi word indeed ou,mettre mon cv en ligne sur indeed comment pole emploi telecharger word,a ement telecharger cv en ligne gratuit telechargement mettre mon pole emploi,telecharger mon cv en ligne word curriculum vitae mettre pole emploi modele,mettre mon cv en ligne sur pole emploi telecharger comment a format word,mettre mon cv en ligne gratuitement web comment sur pole emploi.

Vous êtes à : Accueil > Label > Galerie des sites Web labellisés - Aucune correspondance pour ces critères

The problem disappears if I remove the newlines. Also the problem is not present if I use runiq via rust install runiq.

I want to download the code, because I want to introduce a simple modification to ignore empty lines, so that I can preserve new lines between paragraphs. Thus the vanilla version of runiq won't work for me.

Thank you in advance for your help.

opened by pjox 5

Suggestion: Use a Rust implementation of xxHash

Depending on a C library complicates the compilation, since it requires that the Clang compiler be present.

xxHash has already been implemented in Rust in the twox-hash crate.

Using it would probably make it easy to publish pre-built binaries for macOS, Linux and Windows, which would improve the usability of runiq :)

opened by mardukbp 1
Failed to compile xxhash-sys. Windows x64 MSVC 2019

error: failed to run custom build command for xxhash-sys v0.1.0

Caused by: process didn't exit successfully: C:\Users\Admin\AppData\Local\Temp\cargo-installfAw9KD\release\build\xxhash-sys-696d500ada48d2a1\build-script-build (exit code: 101) --- stdout running: "cmake" "C:\Users\Admin\.cargo\registry\src\github.com-1ecc6299db9ec823\xxhash-sys-0.1.0\xxHash/cmake_unofficial" "-G" "Visual Studio 16 2019" "-Thost=x64" "-Ax64" "-DCMAKE_INSTALL_PREFIX=C:\Users\Admin\AppData\Local\Temp\cargo-installfAw9KD\release\build\xxhash-sys-fa8f6d7a1b4e633c\out" "-DCMAKE_C_FLAGS= -DXXH_NAMESPACE=_rust_xxhash_sys -nologo -MD -Brepro" "-DCMAKE_C_FLAGS_RELEASE= -DXXH_NAMESPACE=_rust_xxhash_sys -nologo -MD -Brepro" "-DCMAKE_CXX_FLAGS= -nologo -MD -Brepro" "-DCMAKE_CXX_FLAGS_RELEASE= -nologo -MD -Brepro" "-DCMAKE_ASM_FLAGS= -nologo -MD -Brepro" "-DCMAKE_ASM_FLAGS_RELEASE= -nologo -MD -Brepro" "-DCMAKE_BUILD_TYPE=Release"

--- stderr thread 'main' panicked at ' failed to execute command: The system cannot find the file specified. (os error 2) is cmake not installed?

build script failed, must exit now', C:\Users\Admin.cargo\registry\src\github.com-1ecc6299db9ec823\cmake-0.1.44\src\lib.rs:885:5 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

opened by anotatta 0
handle broken pipe
runiq currently returns w/ error when the downstream pipe is broken.

For example, I use a command:

alias bstack='f() { git reflog | grep checkout | cut -d " " -f 8 | runiq - | head ${1} | cat -n };f'

which exits with:

Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" } ...

This PR checks for a Broken Pipe error and exits the program with success code if found.

Unfortunately, it uses a macro; there's 4 different writes in a loop, and I didn't want to have to match on the write result each time. If there's a better way, let me know.
opened by hwchen 0
runiq as a library?
Hi,

I've been using as a library to be able to provide multiple filter implementations depending on context, and while forking the reposity I noticed the following lines in the main.rs file:

//! Runiq is only built as a command line tool, although it may be //! distributed as a core crate if the backing implementation becomes //! interesting for other use cases.

While I'm currently happy with my solution, I wonder if it would benefit people to integrate changes in the project so that runiq could be used both as a deduplication tool and library.
opened by Uinelj 4
No list of available filters

Apart from looking at the code or at the Comparison section in README, there is no way to find out the list of filters available, and what is the default filter. I would expect it to show with runiq --help.

opened by yangsheng6810 0
Feature Request: More unique uniqueness flag
As it stands, both runiq and runiq --invert always include a single instance of each value that exists more than once within the inputs. There is not, however, an option to completely omit values that occur more than once. I would like to see some sort of '--no-duped' flag (the name is open for debate), probably mutually exclusive with --invert, that filters out all occurrences of data with duplicates, rather than the current default behavior of leaving a single instance. example:

$ cat fileA a1 b7 c1 d3 $ cat fileB a7 b3 d8 c1 d3

With the current behavior, runiq fileA fileB would produce:

a1 b7 c1 d3 a7 b3 d8

runiq --no-duped fileA fileB would then produce:

a1 b7 a7 b3 d8
opened by StaticPH 2
Support piped stdin

It is very common to pipe the output of cat or find to a filtering program. It would be great if runiq could be used as a cross-platform drop-in replacement for uniq.

opened by mardukbp 2

Releases(v1.2.2)

v1.2.2(Apr 4, 2022)

Source code(tar.gz)
Source code(zip)
v1.2.0(Sep 26, 2020)

Source code(tar.gz)
Source code(zip)
v1.1.4(Oct 10, 2019)

Source code(tar.gz)
Source code(zip)
v1.1.3(Jan 4, 2019)

Source code(tar.gz)
Source code(zip)
v1.1.2(Jan 4, 2019)

Source code(tar.gz)
Source code(zip)
v1.1.1(Dec 14, 2018)

Source code(tar.gz)
Source code(zip)
v1.1.0(Oct 31, 2018)

Performance improvements and removed the requirement on UTF-8 based inputs.
Source code(tar.gz)
Source code(zip)
v1.0.0(May 7, 2018)

Source code(tar.gz)
Source code(zip)

Owner

Isaac Whitfield

Fan of all things automated. OSS when applicable. Author of Cachex for Elixir. Senior Software Engineer at Axway. Intelligence wanes without practice.

GitHub

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

81 Dec 6, 2022

The fastest way to identify any mysterious text or analyze strings from a file, just ask `lemmeknow` !

The fastest way to identify anything lemmeknow ⚡ Identify any mysterious text or analyze strings from a file, just ask lemmeknow. lemmeknow can be use

594 Dec 30, 2022

A quick way to decode a contract's transaction data with only the contract address and abi.

tx-decoder A quick way to decode a contract's transaction data with only the contract address and abi. E.g, let tx_data = "0xe70dd2fc00000000000000000

15 Feb 13, 2023

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

322 Dec 26, 2022

Cloc - cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.

cloc Count Lines of Code cloc counts blank lines, comment lines, and physical lines of source code in many programming languages. Latest release: v1.9

15.3k Jan 8, 2023

Filter, Sort & Delete Duplicate Files Recursively

Deduplicator Find, Sort, Filter & Delete duplicate files Usage Usage: deduplicator [OPTIONS] [scan_dir_path] Arguments: [scan_dir_path] Run Dedupl

108 Jan 27, 2023

A version of `sort | uniq -c` with output that updates in real-time as each line is parsed

uniqtoo A version of sort | uniq -c with output that updates in real-time as each line is parsed. Usage Pipe something line-based into uniqtoo the sam

62 Oct 14, 2022

Read input lines as byte slices for high efficiency

bytelines This library provides an easy way to read in input lines as byte slices for high efficiency. It's basically lines from the standard library,

53 Sep 24, 2022

A easy and declarative way to test JSON input in Rust.

assert_json A easy and declarative way to test JSON input in Rust. assert_json is a Rust macro heavily inspired by serde json macro. Instead of creati

8 Dec 5, 2022

A fast duplicate file finder

The Directory Differential hTool DDH traverses input directories and their subdirectories. It also hashes files as needed and reports findings. The H

384 Dec 24, 2022

a rust library to find near-duplicate video files

Video Duplicate Finder vid_dup_finder finds near-duplicate video files on disk. It detects videos whose frames look similar, and where the videos are

12 Oct 28, 2022

A (aspiring to be fast) tool to find duplicate files.

find-duplicates A decenly fast tool to find duplicate files. Handles symbolic and hard links and treats them seperately to duplicates. Quickstart Inst

1 Jan 21, 2022

CLI tool to find duplicate files based on their hashes.

Dupper Dupper is a CLI tool that helps you identify duplicate files based on their hashes (using the Seahash hashing algorithm). Installation You can

4 Dec 27, 2022

A crate to convert bytes to something more useable and the other way around in a way Compatible with the Confluent Schema Registry. Supporting Avro, Protobuf, Json schema, and both async and blocking.

#schema_registry_converter This library provides a way of using the Confluent Schema Registry in a way that is compliant with the Java client. The rel

69 Dec 13, 2022

An efficient way to filter duplicate lines from input, à la uniq.

Related tags

Overview

runiq

Installation

Examples

Comparisons

Comments

Releases(v1.2.2)

v1.2.2(Apr 4, 2022)

v1.2.0(Sep 26, 2020)

v1.1.4(Oct 10, 2019)

v1.1.3(Jan 4, 2019)

v1.1.2(Jan 4, 2019)

v1.1.1(Dec 14, 2018)

v1.1.0(Oct 31, 2018)

v1.0.0(May 7, 2018)

Owner

Isaac Whitfield

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

The fastest way to identify any mysterious text or analyze strings from a file, just ask `lemmeknow` !

A quick way to decode a contract's transaction data with only the contract address and abi.

An efficient and powerful Rust library for word wrapping text.

Cloc - cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.

Filter, Sort & Delete Duplicate Files Recursively

A version of `sort | uniq -c` with output that updates in real-time as each line is parsed

Read input lines as byte slices for high efficiency

A easy and declarative way to test JSON input in Rust.

A fast duplicate file finder

a rust library to find near-duplicate video files

A (aspiring to be fast) tool to find duplicate files.

CLI tool to find duplicate files based on their hashes.

A crate to convert bytes to something more useable and the other way around in a way Compatible with the Confluent Schema Registry. Supporting Avro, Protobuf, Json schema, and both async and blocking.

Learn-rust-the-hard-way - "Learn C The Hard Way" by Zed Shaw Converted to Rust

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

url parameter parser for rest filter inquiry

A filter proxy for StatsD

Use enum to filter something, support | and & operator.

Test the interception/filter of UDP 53 of your local networks or hotspots.