hck is a shortening of hack, a rougher form of cut.

Overview

🪓 hck

Build Status license Version info
A sharp cut(1) clone.

hck is a shortening of hack, a rougher form of cut.

A close to drop in replacement for cut that can use a regex delimiter instead of a fixed string. Additionally this tool allows for specification of the order of the output columns using the same column selection syntax as cut (see below for examples).

No single feature of hck on its own makes it stand out over awk, cut, xsv or other such tools. Where hck excels is making common things easy, such as reordering output fields, or splitting records on a weird delimiter. It is meant to be simple and easy to use while exploring datasets.

Features

  • Reordering of output columns! i.e. if you use -f4,2,8 the output columns will appear in the order 4, 2, 8
  • Delimiter treated as a regex (with -R), i.e. you can split on multiple spaces without and extra pipe to tr!
  • Specification of output delimiter
  • Selection of columns by header string literal with the -F option, or by regex by setting the -r flag
  • Input files will be automatically decompressed if their file extension is recognizable and a local binary exists to perform the decompression (similar to ripgrep). See Decompression.
  • Speed

Non-goals

  • hck does not aim to be a complete CSV / TSV parser a la xsv which will respect quoting rules. It acts similar to cut in that it will split on the delimiter no matter where in the line it is.
  • Delimiters cannot contain newlines... well they can, they will just never be seen. hck will always be a line-by-line tool where newlines are the standard \n \r\n.

Install

  • Homebrew / Linuxbrew
brew tap sstadick/hck
brew install hck

* Built with profile guided optimizations

  • MacPorts
sudo port selfupdate
sudo port install hck
  • Debian (Ubuntu)
curl -LO https://github.com/sstadick/hck/releases/download/<latest>/hck-linux-amd64.deb
sudo dpkg -i hck-linux-amd64.deb

* Built with profile guided optimizations

  • With the Rust toolchain:
export RUSTFLAGS='-C target-cpu=native'
cargo install hck
  • From the releases page (the binaries have been built with profile guided optimizations)

  • Or, if you want the absolute fastest possible build that makes use of profile guided optimizations AND native cpu features:

# Assumes you are on stable rust
# NOTE: this won't work on windows, see CI for linked issue
rustup component add llvm-tools-preview
git clone https://github.com/sstadick/hck
cd hck
bash pgo_local.sh
cp ./target/release/hck ~/.cargo/bin/hck
  • PRs are both welcome and encouraged for adding more packaging options and build types! I'd especially welcome PRs for the windows family of package managers / general making sure things are windows friendly.

Examples

Splitting with a string literal

">
❯ hck -Ld' ' -f1-3,5- ./README.md | head -n4
#       🪓      hck

<p      align="center">
                <a      src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build      Status">>

Splitting with a regex delimiter

# note, '\s+' is the default
❯ ps aux | hck -f1-3,5- | head -n4
USER    PID     %CPU    VSZ     RSS     TTY     STAT    START   TIME    COMMAND
root    1       0.0     169452  13472   ?       Ss      Jun21   0:19    /sbin/init      splash
root    2       0.0     0       0       ?       S       Jun21   0:00    [kthreadd]
root    3       0.0     0       0       ?       I<      Jun21   0:00    [rcu_gp]

Reordering output columns

❯ ps aux | hck -f2,1,3- | head -n4
PID     USER    %CPU    %MEM    VSZ     RSS     TTY     STAT    START   TIME    COMMAND
1       root    0.0     0.0     169452  13472   ?       Ss      Jun21   0:19    /sbin/init      splash
2       root    0.0     0.0     0       0       ?       S       Jun21   0:00    [kthreadd]
3       root    0.0     0.0     0       0       ?       I<      Jun21   0:00    [rcu_gp]

Changing the output record separator

❯ ps aux | hck -D'___' -f2,1,3 | head -n4
PID___USER___%CPU
1___root___0.0
2___root___0.0
3___root___0.0

Select columns with regex

# Note the order match the order of the -F args
ps aux | hck -r -F '^ST.*' -F '^USER$' | head -n4
STAT    START   USER
Ss      Jun21   root
S       Jun21   root
I<      Jun21   root

Automagic decompresion

">
❯ gzip ./README.md
❯ hck -Ld' ' -f1-3,5- -z ./README.md.gz | head -n4
#       🪓      hck

<p      align="center">
                <a      src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build      Status">>

Splitting on multiple characters

# with string literalprintf 'this$;$is$;$a$;$test\na$;$b$;$3$;$four\n' > test.txt
❯ hck -Ld'$;$' -f3,4 ./test.txt
a       test
3       four
# with an interesting regexprintf 'this123__is456--a789-test\na129_-b849-_3109_-four\n' > test.txt
❯ hck -d'\d{3}[-_]+' -f3,4 ./test.txt
a       test
3       four

Benchmarks

This set of benchmarks is simply meant to show that hck is in the same ballpark as other tools. These are meant to capture real world usage of the tools, so in the multi-space delimiter benchmark for gcut, for example, we use tr to convert the space runs to a single space and then pipe to gcut.

Note this is not meant to be an authoritative set of benchmarks, it is just meant to give a relative sense of performance of different ways of accomplishing the same tasks.

Hardware

Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive

Data

The all_train.csv data is used.

This is a CSV dataset with 7 million lines. We test it both using , as the delimiter, and then also using \s\s\s as a delimiter.

PRs are welcome for benchmarks with more tools, or improved (but still realistic) pipelines for commands.

Tools

cut:

mawk:

xsv:

tsv-utils:

choose:

Single character delimiter benchmark

Command Mean [s] Min [s] Max [s] Relative
hck -Ld, -f1,8,19 ./hyper_data.txt > /dev/null 1.387 ± 0.019 1.369 1.407 1.00
hck -Ld, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null 1.549 ± 0.002 1.547 1.552 1.12 ± 0.02
hck -d, -f1,8,19 ./hyper_data.txt > /dev/null 1.437 ± 0.001 1.436 1.438 1.04 ± 0.01
hck -d, -f1,8,19 --no-mmap ./hyper_data.txt > /dev/null 1.706 ± 0.017 1.694 1.735 1.23 ± 0.02
choose -f , -i ./hyper_data.txt 0 7 18 > /dev/null 4.333 ± 0.063 4.254 4.384 3.12 ± 0.06
tsv-select -d, -f 1,8,19 ./hyper_data.txt > /dev/null 1.708 ± 0.002 1.705 1.712 1.23 ± 0.02
xsv select -d, 1,8,19 ./hyper_data.txt > /dev/null 5.600 ± 0.010 5.589 5.615 4.04 ± 0.06
awk -F, '{print $1, $8, $19}' ./hyper_data.txt > /dev/null 4.933 ± 0.059 4.901 5.038 3.56 ± 0.06
cut -d, -f1,8,19 ./hyper_data.txt > /dev/null 7.421 ± 1.302 6.797 9.749 5.35 ± 0.94

Multi-character delimiter benchmark

Command Mean [s] Min [s] Max [s] Relative
hck -Ld' ' -f1,8,19 ./hyper_data_multichar.txt > /dev/null (note, that's three spaces) 1.827 ± 0.009 1.818 1.842 1.00
hck -Ld' ' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null (note, that's three spaces) 2.123 ± 0.013 2.105 2.133 1.16 ± 0.01
hck -d'[[:space:]]+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null 9.366 ± 0.202 9.009 9.506 5.13 ± 0.11
hck -d'[[:space:]]+' --no-mmap -f1,8,19 ./hyper_data_multichar.txt > /dev/null 9.636 ± 0.030 9.588 9.666 5.27 ± 0.03
hck -d'\s+' -f1,8,19 ./hyper_data_multichar.txt > /dev/null 10.038 ± 0.005 10.036 10.047 5.49 ± 0.03
hck -d'\s+' -f1,8,19 --no-mmap ./hyper_data_multichar.txt > /dev/null 9.913 ± 0.113 9.725 9.997 5.43 ± 0.07
choose -f ' ' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null 6.600 ± 0.071 6.555 6.723 3.61 ± 0.04
choose -f '[[:space:]]' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null 10.764 ± 0.041 10.703 10.809 5.89 ± 0.04
choose -f '\s' -i ./hyper_data_multichar.txt 0 7 18 > /dev/null 36.866 ± 0.144 36.682 37.076 20.18 ± 0.13
awk -F' ' '{print $1, $8 $19}' ./hyper_data_multichar.txt > /dev/null 6.602 ± 0.024 6.568 6.631 3.61 ± 0.02
awk -F' ' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null 5.894 ± 0.052 5.850 5.983 3.23 ± 0.03
awk -F'[:space:]+' '{print $1, $8, $19}' ./hyper_data_multichar.txt > /dev/null 10.962 ± 0.190 10.733 11.145 6.00 ± 0.11
< ./hyper_data_multichar.txt tr -s ' ' | cut -d ' ' -f1,8,19 > /dev/null 7.604 ± 0.096 7.521 7.730 4.16 ± 0.06
< ./hyper_data_multichar.txt tr -s ' ' | tail -n+2 | xsv select -d ' ' 1,8,19 --no-headers > /dev/null 6.757 ± 0.191 6.447 6.943 3.70 ± 0.11
< ./hyper_data_multichar.txt tr -s ' ' | hck -Ld' ' -f1,8,19 > /dev/null 6.313 ± 0.040 6.269 6.365 3.45 ± 0.03
< ./hyper_data_multichar.txt tr -s ' ' | tsv-select -d ' ' -f 1,8,19 > /dev/null 6.278 ± 0.036 6.238 6.328 3.44 ± 0.03

Decompression

The following table indicates the file extension / binary pairs that are used to try to decompress a file when the -z option is specified:

Extension Binary Type
*.gz gzip -d -c gzip
*.tgz gzip -d -c gzip
*.bz2 bzip2 -d -c bzip2
*.tbz2 bzip -d -c gzip
*.xz xz -d -c xz
*.txz xz -d -c xz
*.lz4 lz4 -d -c lz4
*.lzma xz --format=lzma -d -c lzma
*.br brotli -d -c brotli
*.zst zstd -d -c zstd
*.zstd zstd -q -d -c zstd
*.Z uncompress -c uncompress

When a file with one of the extensions above is found, hck will open a subprocess running the the decompression tool listed above and read from the output of that tool. If the binary can't be found then hck will try to read the compressed file as is. See grep_cli for source code. The end goal is to add a similar preprocessor as ripgrep.

Profile Guided Optimization

See the pgo*.sh scripts for how to build this with optimizations. You will need to install the llvm tools via rustup component add llvm-tools-preview for this to work. Building with PGO seems to improve performance anywhere from 5-30% depending on the platform and codepath. i.e. on mac os it seems to have a larger effect, and on the regex codepath it also seems to have a greater effect.

TODO

  • Add complement argument
  • Don't reparse fields / headers for each new file
  • figure out how to better reuse / share a vec
  • Support indexing from the end (unlikely though)
  • Bake in grep / filtering somehow (this will not be done at the expense of the primary utility of hck)
  • Move tests from main to core
  • Add more tests all around
  • Add pigz support
  • Add a greedy/non-greedy option that will ignore blank fields split.filter(|s| !s.is_empty() || config.opt.non_greedy)
  • Experiment with parallel parser as described here This should be very doable given we don't care about escaping quotes and such.

More packages and builds

https://github.com/sharkdp/bat/blob/master/.github/workflows/CICD.yml

References

Comments
  • Automatic gzip decompression seems not to work, unless pigz is in the path

    Automatic gzip decompression seems not to work, unless pigz is in the path

    hck compiled with

    just install-native
    

    This results in garbled characters:

    # gzip in PATH: This results in garbled characters.
    hck -z -f 1 test.gz
    
    # pigz in PATH: Works.
    hck -z -f 1 test.gz
    

    BTW for the fastest gzip decompression igzip of https://github.com/intel/isa-l can be used.

    See: https://github.com/zlib-ng/zlib-ng/issues/986 for an example decompression (gzip, pigz with zlib-ng as zlib library and igzip).

    opened by ghuls 10
  • Not compiling with LTO

    Not compiling with LTO

    Hi,

    If I enable in makepkg in Arch, to build package with LTO, it throw this:

    hck-0.7.0-1: parsing pkg list...
    ==> Making package: hck 0.7.0-1 (Mon 22 Nov 2021 09:45:12 AM CET)
    ==> Checking runtime dependencies...
    ==> Checking buildtime dependencies...
    ==> WARNING: Using existing $srcdir/ tree
    ==> Starting build()...
      Downloaded built v0.5.1
      Downloaded pin-project-internal v1.0.8
      Downloaded memmap2 v0.2.3
      Downloaded thiserror-impl v1.0.26
      Downloaded tinyvec v1.2.0
      Downloaded unicode-segmentation v1.7.1
      Downloaded syn v1.0.73
      Downloaded anyhow v1.0.41
      Downloaded thiserror v1.0.26
      Downloaded fnv v1.0.7
      Downloaded semver v1.0.3
      Downloaded flate2 v1.0.20
      Downloaded unicode-width v0.1.8
      Downloaded proc-macro2 v1.0.27
      Downloaded core_affinity v0.5.10
      Downloaded ripline v0.1.0
      Downloaded spin v0.9.2
      Downloaded futures-sink v0.3.16
      Downloaded jobserver v0.1.22
      Downloaded cc v1.0.68
      Downloaded nanorand v0.6.1
      Downloaded libdeflater v0.7.3
      Downloaded flume v0.10.9
      Downloaded bytecount v0.6.2
      Downloaded libdeflate-sys v0.7.3
      Downloaded futures-core v0.3.16
      Downloaded cmake v0.1.45
      Downloaded pkg-config v0.3.19
      Downloaded serde_derive v1.0.126
      Downloaded quote v1.0.9
      Downloaded memchr v2.4.0
      Downloaded globset v0.4.8
      Downloaded pin-project v1.0.8
      Downloaded grep-cli v0.1.6
      Downloaded matches v0.1.8
      Downloaded git2 v0.13.20
      Downloaded serde v1.0.126
      Downloaded gzp v0.9.2
      Downloaded libgit2-sys v0.12.21+1.1.0
      Downloaded cargo-lock v7.0.0
      Downloaded bstr v0.2.16
      Downloaded structopt v0.3.25
      Downloaded libc v0.2.97
      Downloaded structopt-derive v0.4.18
      Downloaded 44 crates (6.1 MB) in 0.62s (largest was `gzp` at 2.0 MB)
       Compiling libc v0.2.97
       Compiling proc-macro2 v1.0.27
       Compiling unicode-xid v0.2.2
       Compiling syn v1.0.73
       Compiling pkg-config v0.3.19
       Compiling log v0.4.14
       Compiling cfg-if v1.0.0
       Compiling bitflags v1.2.1
       Compiling matches v0.1.8
       Compiling serde_derive v1.0.126
       Compiling tinyvec_macros v0.1.0
       Compiling memchr v2.4.0
       Compiling serde v1.0.126
       Compiling version_check v0.9.3
       Compiling percent-encoding v2.1.0
       Compiling semver v1.0.3
       Compiling lazy_static v1.4.0
       Compiling scopeguard v1.1.0
       Compiling futures-core v0.3.16
       Compiling regex-automata v0.1.10
       Compiling regex-syntax v0.6.25
       Compiling crc32fast v1.2.1
       Compiling unicode-segmentation v1.7.1
       Compiling unicode-width v0.1.8
       Compiling anyhow v1.0.41
       Compiling fnv v1.0.7
       Compiling ansi_term v0.11.0
       Compiling futures-sink v0.3.16
       Compiling termcolor v1.1.2
       Compiling strsim v0.8.0
       Compiling vec_map v0.8.2
       Compiling humantime v2.1.0
       Compiling byteorder v1.4.3
       Compiling same-file v1.0.6
       Compiling bytecount v0.6.2
       Compiling bytes v1.1.0
       Compiling tinyvec v1.2.0
       Compiling unicode-bidi v0.3.5
       Compiling form_urlencoded v1.0.1
       Compiling lock_api v0.4.5
       Compiling textwrap v0.11.0
       Compiling heck v0.3.3
       Compiling spin v0.9.2
       Compiling proc-macro-error-attr v1.0.4
       Compiling proc-macro-error v1.0.4
       Compiling aho-corasick v0.7.18
       Compiling bstr v0.2.16
       Compiling unicode-normalization v0.1.19
       Compiling quote v1.0.9
       Compiling jobserver v0.1.22
       Compiling atty v0.2.14
       Compiling getrandom v0.2.3
       Compiling num_cpus v1.13.0
       Compiling memmap2 v0.2.3
       Compiling cc v1.0.68
       Compiling clap v2.33.3
       Compiling core_affinity v0.5.10
       Compiling nanorand v0.6.1
       Compiling ripline v0.1.0
       Compiling regex v1.5.4
       Compiling cmake v0.1.45
       Compiling idna v0.2.3
       Compiling url v2.2.2
       Compiling globset v0.4.8
       Compiling env_logger v0.9.0
       Compiling grep-cli v0.1.6
       Compiling libz-sys v1.1.3
       Compiling libgit2-sys v0.12.21+1.1.0
       Compiling libdeflate-sys v0.7.3
       Compiling libdeflater v0.7.3
       Compiling pin-project-internal v1.0.8
       Compiling thiserror-impl v1.0.26
       Compiling structopt-derive v0.4.18
       Compiling flate2 v1.0.20
       Compiling git2 v0.13.20
       Compiling thiserror v1.0.26
       Compiling structopt v0.3.25
       Compiling pin-project v1.0.8
       Compiling flume v0.10.9
       Compiling gzp v0.9.2
       Compiling toml v0.5.8
       Compiling cargo-lock v7.0.0
       Compiling built v0.5.1
       Compiling hck v0.7.0 (/home/roland/.cache/paru/clone/hck/src/hck-0.7.0)
    error: linking with `cc` failed: exit status: 1
      |
      = note: "cc" "-m64" "/home/roland/.cache/paru/clone/hck/src/hck-0.7.0/target/release/deps/hck-2fb13a3aa70533e1.hck.3cad90ea-cgu.0.rcgu.o" "-Wl,--as-needed" "-L" "/home/roland/.cache/paru/clone/hck/src/hck-0.7.0/target/release/deps" "-L" "/home/roland/.cache/paru/clone/hck/src/hck-0.7.0/target/release/build/libz-sys-f72b9bca0fe9a64d/out/lib" "-L" "/home/roland/.cache/paru/clone/hck/src/hck-0.7.0/target/release/build/libdeflate-sys-8a0eed30405bcf8f/out/lib" "-L" "/home/roland/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/tmp/rustcQmlgX7/liblibdeflate_sys-85dff7a5270dc40e.rlib" "/tmp/rustcQmlgX7/liblibz_sys-dbc238c55867ed1d.rlib" "-Wl,--start-group" "-Wl,--end-group" "/home/roland/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-02203e01b7df4fdd.rlib" "-Wl,-Bdynamic" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-L" "/home/roland/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/home/roland/.cache/paru/clone/hck/src/hck-0.7.0/target/release/deps/hck-2fb13a3aa70533e1" "-Wl,--gc-sections" "-pie" "-Wl,-zrelro" "-Wl,-znow" "-Wl,-O1" "-nodefaultlibs"
      = note: /usr/bin/ld: /tmp/rustcQmlgX7/liblibdeflate_sys-85dff7a5270dc40e.rlib: error adding symbols: file format not recognized
              collect2: error: ld returned 1 exit status
    
    
    error: could not compile `hck` due to previous error
    ==> ERROR: A failure occurred in build().
        Aborting...
    error: failed to build 'hck-0.7.0-1':
    
    opened by roland-rollo 9
  • Output delimiter should be same as input

    Output delimiter should be same as input

    cut automatically uses the input delimiter as the output delimiter. I find it quite annoying to have to specify the delimiter twice. I wonder if you would consider changing the default behaviour to do so? I appreciate this is a breaking change, but might be better to do this before hitting v1?

    opened by mbhall88 8
  • regex parser is faster than the literal one.

    regex parser is faster than the literal one.

    Not a big issue, but for a test 10G TSV file, the "regex" parser is faster than the literal one. (hck with PGO)

    # 10G TSV file.
    
    $ wc -l e.sorted.tsv
    71741554 e.sorted.tsv
    
    $ timeit hck -d '\t' -f 1,2 /dev/shm/e.sorted.tsv > /dev/null
    
    Time output:
    ------------
    
      * Command: hck -d \t -f 1,2 /dev/shm/e.sorted.tsv
      * Elapsed wall time: 0:12.01 = 12.01 seconds
      * Elapsed CPU time:
         - User: 8.82
         - Sys: 3.12
      * CPU usage: 99%
      * Context switching:
         - Voluntarily (e.g.: waiting for I/O operation): 34
         - Involuntarily (time slice expired): 14
      * Maximum resident set size (RSS: memory) (kiB): 9563064
      * Number of times the process was swapped out of main memory: 0
      * Filesystem:
         - # of inputs: 19840
         - # of outputs: 0
      * Exit status: 0
    
    $ timeit hck -d $'\t' -f 1,2 /dev/shm/e.sorted.tsv > /dev/null
    
    Time output:
    ------------
    
      * Command: hck -d 	 -f 1,2 /dev/shm/e.sorted.tsv
      * Elapsed wall time: 0:11.45 = 11.45 seconds
      * Elapsed CPU time:
         - User: 8.39
         - Sys: 3.00
      * CPU usage: 99%
      * Context switching:
         - Voluntarily (e.g.: waiting for I/O operation): 36
         - Involuntarily (time slice expired): 21
      * Maximum resident set size (RSS: memory) (kiB): 9563016
      * Number of times the process was swapped out of main memory: 0
      * Filesystem:
         - # of inputs: 21840
         - # of outputs: 0
      * Exit status: 0
    
    $ timeit hck -L -d $'\t' -f 1,2 /dev/shm/e.sorted.tsv > /dev/null
    
    Time output:
    ------------
    
      * Command: hck -L -d 	 -f 1,2 /dev/shm/e.sorted.tsv
      * Elapsed wall time: 0:13.61 = 13.61 seconds
      * Elapsed CPU time:
         - User: 10.56
         - Sys: 2.98
      * CPU usage: 99%
      * Context switching:
         - Voluntarily (e.g.: waiting for I/O operation): 40
         - Involuntarily (time slice expired): 19
      * Maximum resident set size (RSS: memory) (kiB): 9563132
      * Number of times the process was swapped out of main memory: 0
      * Filesystem:
         - # of inputs: 26352
         - # of outputs: 0
      * Exit status: 0
    
    
    $ timeit hck  -d 072989423308200b88dd3ce688a7dcff -f 1,2 /dev/shm/e.sorted.tsv > /dev/null
    
    Time output:
    ------------
    
      * Command: hck -d 072989423308200b88dd3ce688a7dcff -f 1,2 /dev/shm/e.sorted.tsv
      * Elapsed wall time: 0:13.28 = 13.28 seconds
      * Elapsed CPU time:
         - User: 9.80
         - Sys: 3.40
      * CPU usage: 99%
      * Context switching:
         - Voluntarily (e.g.: waiting for I/O operation): 46
         - Involuntarily (time slice expired): 20
      * Maximum resident set size (RSS: memory) (kiB): 9563176
      * Number of times the process was swapped out of main memory: 0
      * Filesystem:
         - # of inputs: 26304
         - # of outputs: 0
      * Exit status: 0
    
    $ timeit hck -L -d 072989423308200b88dd3ce688a7dcff -f 1,2 /dev/shm/e.sorted.tsv > /dev/null
    
    Time output:
    ------------
    
      * Command: hck -L -d 072989423308200b88dd3ce688a7dcff -f 1,2 /dev/shm/e.sorted.tsv
      * Elapsed wall time: 0:20.61 = 20.61 seconds
      * Elapsed CPU time:
         - User: 17.04
         - Sys: 3.44
      * CPU usage: 99%
      * Context switching:
         - Voluntarily (e.g.: waiting for I/O operation): 93
         - Involuntarily (time slice expired): 30
      * Maximum resident set size (RSS: memory) (kiB): 9563024
      * Number of times the process was swapped out of main memory: 0
      * Filesystem:
         - # of inputs: 53440
         - # of outputs: 0
      * Exit status: 0
    
    
    opened by ghuls 8
  • Crash when first line is empty and -L is specified with a one character separator.

    Crash when first line is empty and -L is specified with a one character separator.

    Crash when first line is empty and -L is specified with a one character separator.

    # File with empty first line.
    $ printf '\n1\t2\t3\n4\t5\t6\n'
    
    1       2       3
    4       5       6
    
    # Get first 2 columns with regex "\t".
    $ printf '\n1\t2\t3\n4\t5\t6\n' | hck -d '\t' -f 1,2 -
    
    1       2
    4       5
    
    # Get first 2 columns with regex code using a real TAB character (created by bash).
    $ printf '\n1\t2\t3\n4\t5\t6\n' | hck -d $'\t' -f 1,2 -
    
    1       2
    4       5
    
    
    # Get first 2 columns with literal separator option using a real TAB character (created by bash).
    $ printf '\n1\t2\t3\n4\t5\t6\n' | hck -L -d $'\t' -f 1,2 -
    thread 'main' panicked at 'attempted to index slice up to maximum usize', /software/hck/src/lib/core.rs:528:59
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    # Will print all columns as "\t" is not treated as a regex.
    $ printf '\n1\t2\t3\n4\t5\t6\n' | hck -L -d '\t' -f 1,2 -
    
    1       2       3
    4       5       6
    
    # With one space as delimiter, we get a crash too when using the literal separator option.
    $ printf '\n1\t2\t3\n4\t5\t6\n' | hck -L -d ' ' -f 1,2 -
    thread 'main' panicked at 'attempted to index slice up to maximum usize', /software/hck/src/lib/core.rs:528:59
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    # With two spaces as delimiter, it works.
    $ printf '\n1\t2\t3\n4\t5\t6\n' | hck -L -d '  ' -f 1,2 -
    
    1       2       3
    4       5       6
    
    
    opened by ghuls 5
  • Using `+` with `choose` field separator considered harmful

    Using `+` with `choose` field separator considered harmful

    Hey there, I just discovered this project from following the breadcrumbs from your issue over on choose! I noticed you had some great benchmark reports (really impressive performance btw!). It looks like you are testing a couple choice inputs to choose using regexes with a + at the end. By default, field separators in choose are greedy, and so using + is not necessary, and actually hurts performance (my understanding is that it is typical for regex engine's performance to be hurt by any repetition).

    So where you have choose -f '[[:space:]]+' -i ./hyper_data.txt 0 7 18 > /dev/null, I recommend you instead use choose -f '[[:space:]]' -i ./hyper_data.txt 0 7 18 > /dev/null (note the lack of +). It looks like there are a few other spots as well unnecessarily using +.

    Admittedly, this is something of a foot gun, so if you felt that it would be good to demonstrate both, that would also be fair. I am thinking of adding some docs explaining that this should be avoided, and possibly stripping repeat operators in code as well.

    For reference, with a quick test using time, it looks like adding the + makes choose take very roughly twice as long:

    ➜  choose git:(master) ✗ time choose -i input.txt -f '[[:space:]]' 3:5 > /dev/null
    choose -i test/long_long_long_long_long.txt -f '[[:space:]]' 3:5 > /dev/null  4.26s user 0.13s system 99% cpu 4.415 total
    ➜  choose git:(master) ✗ time choose -i input.txt -f '[[:space:]]+' 3:5 > /dev/null
    choose -i test/long_long_long_long_long.txt -f '[[:space:]]+' 3:5 > /dev/null  8.31s user 0.17s system 99% cpu 8.507 total
    
    opened by theryangeary 4
  • Values for parameters which start with

    Values for parameters which start with "-", give problems as they are parsed as parameters first.

    Values for parameters which start with "-", give problems as they are parsed as parameters first.

    ❯  printf '1- 2\n3- 4\n'
    1- 2
    3- 4
    
    ❯  printf '1- 2\n3- 4\n' | hck -f 2 -d '- '
    error: Found argument '- ' which wasn't expected, or isn't valid in this context
    
    USAGE:
        hck --delimiter <delimiter> --fields <fields>
    
    For more information try --help
    
    ❯  printf '1- 2\n3- 4\n' | hck -f 2 -d -- '- '
    [2021-11-15T09:40:14Z ERROR hck] No such file or directory (os error 2)
    
    ❯  printf '1- 2\n3- 4\n' | hck -f 2 -d '\- '
    2
    4
    
    # Try to print first 2 columns:
    ❯  printf '1- 2- 5\n3- 4- 9\n' | hck -f '-2' -d '\- '
    error: Found argument '-2' which wasn't expected, or isn't valid in this context
    
    USAGE:
        hck --fields <fields>
    
    For more information try --help
    
    ❯  printf '1- 2- 5\n3- 4- 9\n' | hck -f '\-2' -d '\- '
    [2021-11-15T09:50:00Z ERROR hck] Failed to parse field: \-2
    
    
    opened by ghuls 3
  • Issues with header based column selection

    Issues with header based column selection

    1. Giving a filename isn't working. Passing stdin works.
    $ printf 'a,b,c\n1,2,3\n' > test.csv
    
    # hangs
    $ hck -d, -F "a" test.csv
    ^C
    
    # stdin works
    $ printf 'a,b,c\n1,2,3\n' | hck -d, -F "a"
    a
    1
    
    1. If -F isn't the last option when giving filename, it works, but the header gets printed twice.
    $ hck -F "a" -F "b" -d, test.csv
    a	b
    a	b
    1	2
    
    # stdin works as expected again
    $ printf 'a,b,c\n1,2,3\n' | hck -F "a" -F "b" -d,
    a	b
    1	2
    
    1. Last field header isn't being recognized. Works if input is single line without newline.
    $ printf 'a,b,c\n1,2,3\n' | hck -d, -F "c"
    [2021-07-15T10:39:57Z ERROR hck] No headers matched
    
    $ printf 'a,b,c' | hck -d, -F "c"
    c
    
    1. There's no error if there are unmatched headers, but at least one header matched. If this is preferred as the default behavior, adding an option to get error would be useful.
    $ printf 'a,b,c\n1,2,3\n' | hck -d, -F "a" -F "xyz"
    a
    1
    
    opened by learnbyexample 3
  • Why restrict on single character delimiter (in non regex mode)

    Why restrict on single character delimiter (in non regex mode)

    (apologies for edit, fat thumbs)

    While cut doesn't support it, I don't see why multichar delimiters should not be allowed; yes it's available with regex (but needs escaping sometimes) but why have the restriction in the first place?

    It doesn't even really matter for cut compat, given that as cut doesn't support them, all it would do is not error for cut-invalid values — presumably not something a script would rely on

    opened by passcod 3
  • Windows released binary should have the extension

    Windows released binary should have the extension ".exe"

    Please change i.e. https://github.com/sstadick/hck/releases/download/v0.7.1/hck-windows-amd64 to https://github.com/sstadick/hck/releases/download/v0.7.1/hck-windows-amd64.exe ;)

    Thanks :)

    enhancement good first issue 
    opened by AntonOks 2
  • Limit default number of threads for bgzip compression to max 4 instead of max CPUs.

    Limit default number of threads for bgzip compression to max 4 instead of max CPUs.

    Limit default number of threads for bgzip compression to max 4 instead of max CPUs. 4 threads for compression give a quite good compression speed vs CPU usage ratio. Most servers nowadays have e.g. 40 or more CPU threads. Spawning that many threads by default is not great for performance.

    opened by ghuls 2
  • [feature request] select rightmost/trailing indices

    [feature request] select rightmost/trailing indices

    This might not be feasible, but I'd really like to be able to select rightmost/trailing fields from each line. My use-case is for flattening an arbitrarily nested tree of files, where I only care about the parent directory and the filename of the files.

    EX: Imagine this is the file structure

    a
    |-b
    | |-c.jpg
    | \-d
    |   |-e.jpg
    |   \-f.jpg
    |-g
    | |-h
    | | \-i.jpg
    | \-j.jpg
    \-k.jpg
    

    If I ran find ./a -type f -regex '.*\.jpg' on in the parent of a/ it would produce this output:

    ./a/b/c.jpg
    ./a/b/d/e.jpg
    ./a/b/d/f.jpg
    ./a/g/h/i.jpg
    ./a/g/j.jpg
    ./a/k.jpg
    

    I want to generate the following output by selecting the 2 rightmost fiels from this input stream when splitting on / and joining on _ (i.e. quivalent to hck -d/ -D_ ...)

    b_c.jpg
    d_e.jpg
    d_f.jpg
    h_i.jpg
    g_j.jpg
    a_k.jpg
    

    Would this be feasible? I can think of a couple signifiers for the potential option flag if it is, but my preference would be +f 2,1 (i.e. an inverted -f).

    enhancement help wanted 
    opened by jskrzypek 2
  • Add how to install HCK in Windows

    Add how to install HCK in Windows

    For now one will have to add my SCOOP repo to install HCK via SCOOP. Once #52 is fixed, I will send a pull request to https://github.com/ScoopInstaller/Extras to make HCK available with the default SCOOP installation

    opened by AntonOks 3
  • how things are decompressed should be explained

    how things are decompressed should be explained

    "Input files will be automatically decompressed if their file extension is recognizable and a local binary exists to perform the decompression (similar to ripgrep)"

    This is very vague for new users and annoying to cross-check how ripgrep does it. Please provide a more accurate description and/or link the functionality description.

    likely related https://github.com/BurntSushi/ripgrep/issues/539

    documentation 
    opened by matu3ba 5
Releases(v0.7.5)
Owner
Seth
Principal Bioinformatics Engineer @fulcrumgenomics
Seth
Random Cut Forest anomaly detection for C/C++

Random Cut Forest C/C++ Random Cut Forest (RCF) anomaly detection for C/C++ ?? Also available for Ruby and PHP, and as a CLI Installation Download the

Andrew Kane 4 Nov 8, 2022
Windows game hack helper utilities in rust.

⚗️ toy-arms Windows game hack helper utilities in rust. This crate has some useful macros, functions and traits. ?? How to use this crate? With this c

s3pt3mb3r 100 Jan 1, 2023
Kindelia: a hack-proof decentralized computer

Kindelia: a hack-proof decentralized computer Turing-complete blockchains, such as Ethereum, allow users to host applications in a decentralized netwo

null 399 Jan 2, 2023
Following "ZK HACK III - Building On-chain Apps Off-chain Using RISC Zero"

RISC Zero Rust Starter Template Welcome to the RISC Zero Rust Starter Template! This template is intended to give you a starting point for building a

drCathieSo.eth 3 Dec 22, 2022
Burrow is a tool for burrowing through firewalls, built by teenagers at Hack Club.

Burrow Burrow is a tool for burrowing through firewalls, built by teenagers at Hack Club. At its core, burrow is a command line utility written in Rus

Hack Club 44 Apr 20, 2023
a hack implementation of CCS generic arithmetization, won a prize at Zuzalu hackathon 2023 despite incompleteness

ccs-hack CCS (Customized Constraint System) is a generic constraints representation system can simultaneously capture R1CS, Plonkish, and AIR: $$\sum_

Thor 27 Jun 1, 2023
Play Hack The Box directly on your system.

HTB Toolkit HTB Toolkit allows you to play Hack The Box machines directly on your system. Usage To use HTB Toolkit, you need to retrieve an App Token

D3vil0p3r 9 Sep 5, 2023
Keybinder to type diacrytical characters without needing to hack the layout itself. Supports bindings to the left Alt + letter

Ďíáǩříťíǩád I just thought that it's a shame the word diakritika does not have any diacritics in it. Key points diakritika is a simple Windows daemon

null 4 Feb 26, 2024
Portals in form of a Mobius strip.

Portals in form of a Mobius strip Online demo: https://optozorax.github.io/mobius_portal/ If you have a low FPS, you can reduce window size of your br

ilya sheprut 14 Nov 14, 2021
a cheat-sheet for mathematical notation in Rust 🦀 code form

math-as-rust ?? Based on math-as-code This is a reference to ease developers into mathematical notation by showing comparisons with Rust code.

Eduardo Pereira 13 Jan 4, 2023
Explain semver requirements by converting them into less than, greater than, and/or equal to form.

semver-explain Convert SemVer requirements to their most-obvious equivalents. semver-explain is a CLI tool to explain Semantic Versioning requirements

Andrew Lilley Brinker 27 Oct 29, 2022
VSDB is a 'Git' in the form of a KV database.

VSDB VSDB is a 'Git' in the form of a KV database. Based on the powerful version control function of VSDB, you can easily give your data structure the

null 7 Oct 11, 2022
Add CLI & form interface to your program

Add CLI & form interface to your program

null 292 Nov 4, 2022
Type safe multipart/form-data handling for axum.

axum_typed_multipart Designed to seamlessly integrate with Axum, this crate simplifies the process of handling multipart/form-data requests in your we

Lorenzo Murarotto 10 Mar 28, 2023
This project returns Queried value from SOAP(XML) in form of JSON.

About This is project by team SSDD for HachNUThon (TechHolding). This project stores and allows updating SOAP(xml) data and responds to various querie

Sandipsinh Rathod 3 Apr 30, 2023
Holochain + ZKP usage experiment in the form of battleships

Battleships Circuits Circuits nabbed from https://github.com/kunalmodi/battlesnark/tree/master. Circuits require circom to compile. cd circuits sh bui

null 4 Aug 24, 2023
A tuple crate for Rust, which introduces a tuple type represented in recusive form.

tuplez This crate introduces a tuple type represented in recursive form rather than parallel form. Motivation The primitive tuple types are represente

Nihil 6 Feb 29, 2024