qsv: Ultra-fast CSV data-wrangling toolkit

Overview

qsv: Ultra-fast CSV data-wrangling toolkit

Linux build status Windows build status macOS build status Security audit Downloads Clones Discussions Docs Minimum supported Rust version Crates.io Crates.io downloads

  Table of Contents
qsv logo qsv is a command line program for
indexing, slicing, analyzing, splitting,
enriching, validating & joining CSV files.
Commands are simple, fast & composable.

* Available Commands
* Installation
* Whirlwind Tour
* Cookbook
* FAQ
* Changelog
* Performance Tuning
* Benchmarks
* NYC School of Data 2022 slides
* Sponsor

ℹ️ NOTE: qsv is a fork of the popular xsv utility, merging several pending PRs since xsv 0.13.0's May 2018 release. It also has numerous new features & 53 additional commands/subcommands/operations (for a total of 73). See FAQ for more details.

Available commands

Command Description
apply12 Apply series of string, date, currency & geocoding transformations to a CSV column. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex & language detection).
behead Drop headers from a CSV.
cat Concatenate CSV files by row or by column.
count3 Count the rows in a CSV file. (Instantaneous with an index.)
dedup42 Remove redundant rows.
enum Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value.
excel Exports a specified Excel/ODS sheet to a CSV file.
exclude3 Removes a set of CSV data from another set based on the specified columns.
explode Explode rows into multiple ones by splitting a column value based on the given separator.
extsort2 Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm.
fetch Fetches HTML/data from web pages or web services for every row. Comes with jql JSON query language support, dynamic throttling (RateLimit) & caching with optional Redis support for persistent caching.
fill Fill empty values.
fixlengths Force a CSV to have same-length records by either padding or truncating them.
flatten A flattened view of CSV records. Useful for viewing one record at a time.
e.g. qsv slice -i 5 data.csv | qsv flatten.
fmt Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
foreach1 Loop over a CSV to execute bash commands. (not available on Windows)
frequency35 Build frequency tables of each column. (Uses multithreading to go faster if an index is present.)
generate1 Generate test data by profiling a CSV using Markov decision process machine learning.
headers Show the headers of a CSV. Or show the intersection of all headers between many CSV files.
index Create an index for a CSV. This is very quick & provides constant time indexing into the CSV file. Also enables multithreading for frequency, split, stats and schema commands.
input3 Read CSV data with special quoting, trimming, line-skipping and UTF-8 transcoding rules. Typically used to "normalize" a CSV for further processing with other qsv commands.
join3 Inner, outer, cross, anti & semi joins. Uses a simple hash index to make it fast.
jsonl Convert newline-delimited JSON (JSONL/NDJSON) to CSV.
lua1 Execute a Lua 5.4.4 script over CSV lines to transform, aggregate or filter them.
partition Partition a CSV based on a column value.
pseudo Pseudonymise the value of the given column by replacing them with an incremental identifier.
py1 Evaluate a Python expression over CSV lines to transform, aggregate or filter them. Python's f-strings is particularly useful for extended formatting (Python 3.8+ required).
rename Rename the columns of a CSV efficiently.
replace Replace CSV data using a regex.
reverse4 Reverse order of rows in a CSV. Unlike the sort --reverse command, it preserves the order of rows with the same key.
sample3 Randomly draw rows (with optional seed) from a CSV using reservoir sampling (i.e., use memory proportional to the size of the sample).
schema5 Infer schema from CSV data and output in JSON Schema format. Uses multithreading to go faster if an index is present. See validate command.
search Run a regex over a CSV. Applies the regex to each field individually & shows only matching rows.
searchset Run multiple regexes over a CSV in a single pass. Applies the regexes to each field individually & shows only matching rows.
select Select, re-order, duplicate or drop columns.
slice34 Slice rows from any part of a CSV. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
sniff3 Quickly sniff CSV metadata (delimiter, header row, preamble rows, quote character, flexible, is_utf8, number of records, number of fields, field names & data types).
sort2 Sorts CSV data in alphabetical, numerical, reverse or random (with optional seed) order.
split35 Split one CSV file into many CSV files of N chunks. (Uses multithreading to go faster if an index is present.)
stats345 Infer data type & compute descriptive statistics for each column in a CSV (sum, min/max, min/max length, mean, stddev, variance, nullcount, quartiles, IQR, lower/upper fences, skewness, median, mode & cardinality). Uses multithreading to go faster if an index is present.
table4 Show aligned output of a CSV using elastic tabstops.
transpose4 Transpose rows/columns of a CSV.
validate32 Validate CSV data with JSON Schema (See schema command). If no jsonschema file is provided, validates if a CSV conforms to the RFC 4180 standard.

Installation

Pre-built binaries for Windows, Linux and macOS are available from GitHub, including binaries compiled with Rust Nightly/Unstable (click here for more info).

There are three versions of qsv. qsv enables all features valid for the target platform6; qsvlite has all features disabled (half the size of qsv); qsvdp is optimized for use with DataPusher+, with only DataPusher+ relevant commands and the self-update engine removed (a sixth of the size of qsv).

Alternatively, you can compile from source by installing Cargo (Rust's package manager) and installing qsv using Cargo:

cargo install qsv --features full

If you encounter compilation errors, ensure you're using the exact version of the dependencies qsv was built with by issuing:

cargo install qsv --locked --features full

Compiling from this repository also works similarly:

git clone [email protected]:jqnatividad/qsv.git
cd qsv
cargo build --release --features full
# or if you encounter compilation errors
cargo build --release --locked --features full

The compiled binary will end up in ./target/release/.

To enable optional features, use cargo --features (see Feature Flags for more info):

cargo install qsv --features apply,generate,lua,fetch,foreach,python,full
# or to build qsvlite
cargo install qsv --features lite
# or to build qsvdp
cargo install qsv --features datapusher_plus

# or when compiling from a local repo
cargo build --release --features apply,generate,lua,fetch,foreach,python,full
# for qsvlite
cargo build --release --features lite
# for qsvdp
cargo build --release --features datapusher_plus

Minimum Supported Rust Version

Building qsv requires Rust stable - currently version 1.62.0.

Tab Completion

qsv's command-line options are quite extensive. Thankfully, since it uses docopt for CLI processing, we can take advantage of docopt.rs' tab completion support to make it easier to use qsv at the command-line (currently, only bash shell is supported):

> $HOME/.bash_completion echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bash_completion echo "complete -F _docopt_wordlist_commands qsv" >> $HOME/.bash_completion">
# install docopt-wordlist
cargo install docopt

# IMPORTANT: run these commands from the root directory of your qsv git repository
# to setup bash qsv tab completion
echo "DOCOPT_WORDLIST_BIN=\"$(which docopt-wordlist)"\" >> $HOME/.bash_completion
echo "source \"$(pwd)/scripts/docopt-wordlist.bash\"" >> $HOME/.bash_completion
echo "complete -F _docopt_wordlist_commands qsv" >> $HOME/.bash_completion

File formats

qsv recognizes UTF-8/ASCII encoded, CSV (.csv) and TSV files (.tsv and .tab). CSV files are assummed to have "," (comma) as a delimiter, and TSV files, "\t" (tab) as a delimiter. The delimiter is a single ascii character that can be set either by the --delimiter command-line option or with the QSV_DEFAULT_DELIMITER environment variable or automatically detected when QSV_SNIFF_DELIMITER is set.

When using the --output option, note that qsv will UTF-8 encode the file and automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, tab for .tsv and .tab files.

JSONL/NDJSON files are also recognized and converted to CSV with the jsonl command.

The fetch command also produces JSONL files when its invoked without the --new-column option.

The sniff and validate commands produce JSON files with their --json and --pretty-json options.

The schema command produces JSON Schema Validation (Draft 7) files with the ".schema.json" file extension, which can be used with the validate command.

The excel command recognizes Excel and Open Document Spreadsheet(ODS) files (.xls, .xlsx, .xlsm, .xlsb and .ods files).

RFC 4180

qsv validates against the RFC 4180 CSV standard. However IRL, CSV formats vary significantly and qsv is actually not strictly compliant with the specification so it can process "real-world" CSV files. qsv leverages the awesome Rust CSV library, which in turn, is built on top of the csv-core library to read CSV files.

Click here to find out more about how qsv conforms to the standard with csv-core.

UTF-8 Encoding

qsv requires UTF-8 encoded (of which ASCII is a subset) input files. On startup, it scans the input if it's UTF-8 encoded (for files, the first 8k; for stdin, the entire buffer), and will abort if its not unless QSV_SKIPUTF8_CHECK is set. On Linux and macOS, UTF-8 encoding is the default.

Previously, several qsv commands worked with non utf-8 encoded files. But to increase performance, qsv versions since 0.38 make extensive use of from_utf8_unchecked so as not to pay the repetitive utf-8 validation penalty, no matter how small, even for already utf-8 encoded files.

Only two commands still work with non utf-8 encoded files - sniff and input. sniff is meant to screen files and detect if a file is utf-8 encoded; and input, to prepare files for further qsv processing.

Should you need to reencode CSV/TSV files, you can use the input command to transcode to UTF-8. It will replace all invalid UTF-8 sequences with . Alternatively, there are several utilities you can use to do so on Linux/macOS and Windows.

Windows Usage Note

Unlike other modern operating systems, Windows' default encoding is UTF16-LE. This will cause problems when redirecting qsv's output to a CSV file and trying to open it with Excel (which ignores the comma delimiter, with everything in the first column):

qsv stats wcp.csv > wcpstats.csv

Which is weird, since you would think Microsoft Excel would properly recognize UTF16-LE encoded CSV files. Regardless, to create a properly UTF-8 encoded file, use the --output option instead:

qsv stats wcp.csv --output wcpstats.csv

Environment Variables

Variable Description
QSV_DEFAULT_DELIMITER single ascii character to use as delimiter. Overrides --delimeter option. Defaults to "," (comma) for CSV files and "\t" (tab) for TSV files when not set. Note that this will also set the delimiter for qsv's output to stdout.
However, using the --output option, regardless of this environment variable, will automatically change the delimiter used in the generated file based on the file extension - i.e. comma for .csv, tab for .tsv and .tab files.
QSV_SNIFF_DELIMITER if set, the delimiter is automatically detected. Overrides QSV_DEFAULT_DELIMITER and --delimiter option. Note that this does not work with stdin.
QSV_NO_HEADERS if set, the first row will NOT be interpreted as headers. Supersedes QSV_TOGGLE_HEADERS.
QSV_TOGGLE_HEADERS if set to 1, toggles header setting - i.e. inverts qsv header behavior, with no headers being the default, and setting --no-headers will actually mean headers will not be ignored.
QSV_AUTOINDEX if set, automatically create an index when none is detected. Also automatically update stale indices.
QSV_COMMENT_CHAR set to an ascii character. If set, any lines(including the header) that start with this character are ignored.
QSV_MAX_JOBS number of jobs to use for multithreaded commands (currently apply, dedup, extsort, frequency, schema, sort, split, stats and validate). If not set, max_jobs is set to the detected number of logical processors. See Multithreading for more info.
QSV_NO_UPDATE if set, prohibit self-update version check for the latest qsv release published on GitHub.
QSV_PREFER_DMY if set, date parsing will use DMY format. Otherwise, use MDY format (used with apply datefmt, schema, sniff & stats commands).
QSV_REGEX_UNICODE if set, makes search, searchset and replace commands unicode-aware. For increased performance, these commands are not unicode-aware and will ignore unicode values when matching and will panic when unicode characters are used in the regex.
QSV_SKIPUTF8_CHECK if set, skip UTF-8 encoding check. Otherwise, qsv scans the first 8k of files. For stdin, it scans the entire buffer.
QSV_RDR_BUFFER_CAPACITY reader buffer size (default (bytes): 16384)
QSV_WTR_BUFFER_CAPACITY writer buffer size (default (bytes): 65536)
QSV_LOG_LEVEL desired level (default - off; error, warn, info, trace, debug).
QSV_LOG_DIR when logging is enabled, the directory where the log files will be stored. If the specified directory does not exist, qsv will attempt to create it. If not set, the log files are created in the directory where qsv was started. See Logging for more info.
QSV_REDIS_CONNECTION_STRING the fetch command can use Redis to cache responses. Set to connect to the desired Redis instance. (default: redis:127.0.0.1:6379)
QSV_REDIS_MAX_POOL_SIZE the maximum Redis connection pool size. (default: 20).
QSV_REDIS_TTL_SECONDS set time-to-live of Redis cached values (default (seconds): 2419200 (28 days)).
QSV_REDIS_TTL_REFRESH if set, enables cache hits to refresh TTL of cached values.

Several dependencies also have environment variables that influence qsv's performance & behavior:

ℹ️ NOTE: To get a list of all active qsv-relevant environment variables, run qsv --envlist.

Feature Flags

qsv has several features:

  • mimalloc (default) - use the mimalloc allocator (see Memory Allocator for more info).
  • apply - enable apply command. This swiss-army knife of CSV transformations is very powerful, but it has a lot of dependencies that increases both compile time and binary size.
  • fetch - enable fetch command.
  • generate - enable generate command.
  • full - enable to build qsv.
  • lite - enable to build qsvlite.
  • datapusher_plus - enable to build qsvdp - the DataPusher+ optimized qsv binary.
  • nightly - enable to turn on nightly/unstable features in the rand and regex creates when building with Rust nightly/unstable.

The following "power-user" commands can be abused and present "foot-shooting" scenarios.

  • lua - enable lua command. Embeds a Lua 5.4.4 interpreter into qsv.
  • foreach - enable foreach command (not valid for Windows).
  • python - enable py command (requires Python 3.8+). Note that qsv will automatically use the currently activated python version when running in a virtual environment.

ℹ️ NOTE: qsvlite, as the name implies, always has non-default features disabled. qsv can be built with any combination of the above features using the cargo --features & --no-default-features flags. The pre-built qsv binaries has all applicable features valid for the target platform6.

License

Dual-licensed under MIT or the UNLICENSE.

Sponsor

qsv was made possible by
datHere Logo
Standards-based, best-of-breed, open source solutions
to make your Data Useful, Usable & Used.

Naming Collision

This project is unrelated to Intel's Quick Sync Video.

Footnotes

  1. enabled by optional feature flag. Not available on qsvlite. 2 3 4 5

  2. multithreaded even without an index. 2 3 4 5

  3. uses an index when available. 2 3 4 5 6 7 8 9 10 11

  4. loads the entire CSV into memory. Note that dedup, stats & transpose have modes that do not load the entire CSV into memory. 2 3 4 5 6

  5. multithreaded when an index is available. 2 3 4

  6. The foreach feature is not available on Windows. The python feature is not enabled on cross-compiled pre-built binaries as we don't have access to a native python interpreter for those platforms (aarch64, i686, and arm) on GitHub's action runners. Compile natively on those platforms with Python 3.8+ installed, if you want to enable the python feature. 2

Comments
  • Create `schema` command

    Create `schema` command

    stats does a great job of not only getting descriptive stats about a CSV, it also infers the data type. frequency compiles a frequency table.

    The schema command will use the output of the stats, and optionally frequency (to specify the valid range of a field), to create a json schema file that can be used with the validate command (#46) to validate a CSV against the generated schema.

    With the combo addition of schema and validate, qsv can be used in a more bullet-proof automated data pipeline that can fail gracefully when there are data quality issues:

    • use schema to create a json schema from a representative CSV file for a feed
    • adjust the schema to fine-tune the validation rules
    • use validate at the beginning of a data pipeline and fail gracefully when validate fails
    • for extra large files, use sample to validate against a sample
    • or alternatively, partition the CSV to break down the pipeline into smaller jobs
    enhancement 
    opened by jqnatividad 38
  • `fetch` command

    `fetch` command

    fetch will allow qsv to fetch HTML or data from web pages or services, to enrich a CSV (e.g. geocoding, wikidata api, etc.)

    It will support authentication, concurrent requests, thresholds, etc.

    Reminiscent of OpenRefine's fetch url... (https://docs.openrefine.org/manual/columnediting#add-column-by-fetching-urls) , but optimized for the command line.

    enhancement 
    opened by jqnatividad 20
  • Auto-detect delimiter

    Auto-detect delimiter

    Is your feature request related to a problem? Please describe.

    In my daily work, a .csv file is comma-separated 50% of the time, and semi-colon-separated the other 50%. I use this command-line tool hundreds of times a day. It's incredibly frustrating to have to first figure out which separator character is being used, and then adjust my command appropriately.

    Describe the solution you'd like

    Ideally qsv would auto-detect a delimiter by default. However, this would break backwards compatibility, so I suggest having a environment variable to turn this on.

    $ export QSV_AUTO_DETECT_DELIMITER=1
    $ qsv table my_file.csv
    

    When this environment variable is set, any value of QSV_DELIMITER would be ignored.

    Describe alternatives you've considered

    I tried to achieve this using a bash wrapper, but it was a bit fiddly because I need to do different things depending on whether qsv is being passed a file (in which case I sniff the file and then pass the delimiter to the qsv command) or a stream (in which case I sniff the stream, and then pass the amount I've already sniffed plus the rest of the stream to qsv).

    Additional context

    Duplicate of https://github.com/BurntSushi/xsv/issues/294

    opened by harrybiddle 16
  • Feature Request: deduplicate columns/extract unique columns

    Feature Request: deduplicate columns/extract unique columns

    Cross reference https://github.com/BurntSushi/xsv/issues/283

    We can use qsv dedup or the Unix command line tools sort and uniq to remove duplicate rows in plain text table, but I find myself wanting to do something similar with duplicated columns.

    For example, after doing qsv join ... there will be at least one pair of duplicated columns (the values used for the join).

    I am hoping for something like a column based version of the row based qsv dedup command (see #26).

    I suspect I could workaround this via the qsv transpose command (see #3).

    no-issue-activity 
    opened by peterjc 16
  • Add a subrule to field inferencing

    Add a subrule to field inferencing

    Hi, daily I work with data where some columns have only numerical values in the cells, but which are actually string fields.

    A typical case is that of administrative codes, of regions, provinces and cities. In Italy (in example in the file below) these are some official regions code: the Lombardy has the code 03, and obviously it's a string. But if you run stats on it, the field state_code is mapped wrongly as integer.

    | field | type | sum | min | max | min_length | max_length | mean | stddev | variance | nullcount | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | state_code | Integer | 71 | 2 | 20 | 2 | 2 | 10.1429 | 7.7354 | 59.8367 | 0 | | label | String | | Aosta Valley | Veneto | 6 | 29 | | | | 0 |

    I know, octal numbers exist, but in 99% of cases (I'm talking about my experience), if I have fields with cells starting with zero they are not octal numbers, but string codes.

    If it were possible I would add a subrule for all fields mapped as numbers: if there are cells that start with a zero, not followed by a , or ., set that inferenced field as string and not as number.

    Thank you.

    Example file

    state_code,label
    19,Sicily
    18,Calabria
    20,Sardinia
    02,Aosta Valley
    03,Lombardy
    04,Trentino-Alto Adige/Südtirol
    05,Veneto
    
    enhancement 
    opened by aborruso 13
  • qsv excel only works with Little Endian?

    qsv excel only works with Little Endian?

    I have a number of spreadsheets written by the Perl module Spreadsheet::WriteExcel.

    When I extract with qsv excel I get something like ^@ for every character and further processing such as search becomes impossible.

    When I look at the spreadsheets with file I get something like

    a.xls: CDFV2 Microsoft Excel

    whereas the "normal" files (which extract properly) look like

    b.xls: Composite Document File V2 Document, Little Endian...

    Desktop (please complete the following information):

    • OS: Mac Ventura
    • qsv 0.72.0-standard-apply;fetch;foreach;generate;luau;self_update-8-8 (x86_64-apple-darwin compiled with Rust 1.65) prebuilt
    opened by ondohotola 13
  • Glibc & Python library not found

    Glibc & Python library not found

    Bug qsv binaries downloaded from release page of the repo are throwing the error as below. For linux-64bit-glibc

    [~]$ qsv
    qsv: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
    [~]$ qsvlite
    qsvlite: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by qsvlite)
    qsvlite: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by qsvlite)
    qsvlite: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by qsvlite)
    qsvlite: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by qsvlite)
    [~]$ qsvdp
    qsvdp: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by qsvdp)
    qsvdp: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by qsvdp)
    qsvdp: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by qsvdp)
    

    For linux-64bit-musl

    [~]$ qsv
    Segmentation fault (core dumped)
    

    To Reproduce Download latest (0.61.1) linux 64bit binaries.

    Expected behavior Should run

    Desktop:

    • OS: CentOS 7 x86_64
    • qsv Version: 0.61.1

    Additional context I believe the shared libraries aren't bundled with the release of this version. I have tried few older versions too but they are too are not working.

    bug 
    opened by useraccessdenied 13
  • Add the UNIX epoch timestamp to the list of time formats that qsv-dateparser can parse.

    Add the UNIX epoch timestamp to the list of time formats that qsv-dateparser can parse.

    Please add the UNIX epoch timestamp format (seconds since the UNIX epoch) to the list of time formats that qsv-dateparser can parse, including timestamps having fractional seconds to nanosecond or microsecond precision. As an alternative to fractional seconds, the timestamp might instead be an integer representing the number of nanoseconds or microseconds since the epoch.

    I would like to transform columns in CSV files that contain UNIX epoch timestamps to ISO 8601 format. Many of these timestamps include fractional seconds (milliseconds, microsecond, or nanoseconds).

    Alternatives:

    csvq

    $ csvq --format csv "SELECT UTC(DATETIME(1671673426.123456789)) AS timestamp;"
    timestamp
    2022-12-22T01:43:46.1234567Z
    

    DuckDB

    $ duckdb -csv -c "SELECT epoch_ms('1671673426.123456789'::BIGINT * 1000) + INTERVAL (string_split('1671673426.123456789', '.')[2]::BIGINT/1000) MICROSECONDS AS timestamp"
    timestamp
    "2022-12-22 01:43:46.123456"
    

    goawk and GNU date

    $ printf "timestamp\n1671673426.123456789\n" | goawk -H -i csv -o csv 'BEGIN { print "timestamp" } { system("date --date=@" @"timestamp" " --utc +%FT%T.%N") }'
    timestamp
    2022-12-22T01:43:46.123456789
    

    Miller

    $ echo timestamp=1671673426.123456789 | mlr put '$timestamp = strftime($timestamp, "%FT%H:%M:%3S");'
    timestamp=2022-12-22T01:43:46.123
    $ echo timestamp=1671673426.123456789 | mlr put '$timestamp = strftime($timestamp, "%FT%H:%M:%6S");'
    timestamp=2022-12-22T01:43:46.123456
    $ echo timestamp=1671673426.123456789 | mlr put '$timestamp = strftime($timestamp, "%FT%H:%M:%9S");'
    timestamp=2022-12-22T01:43:46.123456716
    

    Notice in the last example using format string %9S that function strftime in Miller suffers from a precision problem.

    SQLite

    $ echo | sqlite3 -csv -header -cmd "SELECT strftime('%Y-%m-%dT%H:%M:%f', 1671673426.123456789, 'unixepoch') AS timestamp;"
    timestamp
    2022-12-22T01:43:46.123
    

    Notice that function strftime in SQLite prints seconds to at most millisecond precision.

    enhancement 
    opened by derekmahar 12
  • BUG Python library not found on WSL2 Pengwin

    BUG Python library not found on WSL2 Pengwin

    I get this error on qsv from this zip file - qsv-0.64.0-x86_64-unknown-linux-gnu.zip

    qsv: error while loading shared libraries: libpython3.8.so.1.0: cannot open shared object file: No such file or directory
    
    $ python3 --version
    Python 3.9.2
    

    libpython3.9 is installed. qsvnp works fine.

    Desktop (please complete the following information):

    • OS: Pengwin on WSL2 (based on Debian 11)
    • qsv Version 0.64
    opened by eggbean 12
  • Validate: improve performance

    Validate: improve performance

    Validate currently runs under a single-thread and could become a bottleneck for data validation pipelines.

    • Would like to improve performance through higher concurrency
    • ~~Concurrency should be controlled via --jobs option or QSV_MAX_JOBS env var~~
    • ~~Concurrency should not exceed CPU count~~
    • ~~When jobs is set to 0, apply same rules as stats to calculate optimal concurrency~~
    • Use Rayon to automatically control concurrency
    • Include validate performance numbers in performance suite
    $ head -50000 NYC_311_SR_2010-2020-sample-1M.csv > NYC-short.csv
    $ qsvlite index NYC-short.csv
    $ time qsvlite schema NYC-short.csv --value-constraints --enum-threshold=25
    Schema written to NYC-short.csv.schema.json
    
    real	0m6.941s
    user	0m12.050s
    sys	0m3.960s
    $ time qsvlite validate NYC-short.csv NYC-short.csv.schema.json 
    [00:00:08] [==================== 100% validated 49,999 records.] (6,015/sec)
    0 out of 49,999 records invalid.
    
    real	0m8.424s
    user	0m8.202s
    sys	0m0.128s
    
    opened by mhuang74 12
  • filter feature

    filter feature

    Is your feature request related to a problem? Please describe. filter is part of data wrangling process and usually one of the steps

    Describe the solution you'd like qsv is very fast and capable. but it lacks (arithmetic) filtering. String filtering can be achieved with search command but simple filtering such as Population<100000 or Age>50 is not possible (as far as I know). So, qsv filter -f 'Age>50' should be possibility

    Describe alternatives you've considered

    • Obvious example is awk. awk '$3 > 50' will quickly filter according to third column values.
    • csvtk has both filter and filter2 (latter is awk-like filtering).
    • Finally, tsv-utils has tsv-filter has an comprehensive (checking null, checking is.numeric, etc.) filtering. I added the help section for an example:
    Operators:
    * Test if a field is empty (no characters) or blank (empty or whitespace only).
      Syntax:  --empty|not-empty|blank|not-blank  FIELD
      Example: --empty name               # True if the 'name' field is empty
    
    * Test if a field is numeric, finite, NaN, or infinity
      Syntax:  --is-numeric|is-finite|is-nan|is-infinity FIELD
      Example: --is-numeric 5 --gt 5:100  # Ensure field 5 is numeric before --gt test.
    
    * Compare a field to a number (integer or float)
      Syntax:  --eq|ne|lt|le|gt|ge  FIELD:NUM
      Example: --lt size:1000 --gt weight:0.5  # ('size' < 1000) and ('weight' > 0.5)
    
    * Compare a field to a string
      Syntax:  --str-eq|str-ne|istr-eq|istr-ne  FIELD:STR
      Example: --str-eq color:red         # True if 'color' field is "red"
    
    * Test if a field contains a string (substring search)
      Syntax:  --str-in-fld|str-not-in-fld|istr-in-fld|istr-not-in-fld  FIELD:STR
      Example: --str-in-fld color:dark    # True if 'color field contains "dark"
    
    * Test if a field matches a regular expression.
      Syntax:  --regex|iregex|not-regex|not-iregex  FIELD:REGEX
      Example: --regex '3:ab*c'     # True if field 3 contains "ac", "abc", "abbc", etc.
    
    * Test a field's character or byte length
      Syntax:  --char-len-[le|lt|ge|gt|eq|ne] FIELD:NUM
               --byte-len-[le|lt|ge|gt|eq|ne] FIELD:NUM
      Example: --char-len-lt 2:10   # True if field 2 is less than 10 characters long.
               --byte-len-gt 2:10   # True if field 2 is greater than 10 bytes long.
    
    * Field to field comparisons - Similar to field vs literal comparisons, but field vs field.
      Syntax:  --ff-eq|ff-ne|ff-lt|ff-le|ff-gt|ff-ge  FIELD1:FIELD2
               --ff-str-eq|ff-str-ne|ff-istr-eq|ff-istr-ne  FIELD1:FIELD2
      Example: --ff-eq 2:4          # True if fields 2 and 4 are numerically equivalent
               --ff-str-eq 2:4      # True if fields 2 and 4 are the same strings
    
    * Field to field difference comparisons - Absolute and relative difference
      Syntax:  --ff-absdiff-le|ff-absdiff-gt FIELD1:FIELD2:NUM
               --ff-reldiff-le|ff-reldiff-gt FIELD1:FIELD2:NUM
      Example: --ff-absdiff-lt 1:3:0.25   # True if abs(field1 - field2) < 0.25
    
    opened by alperyilmaz 9
  • `tojsonl` crash when convert csv contains both boolean and null

    `tojsonl` crash when convert csv contains both boolean and null

    To Reproduce

    echo "a,b\nfalse,1\n,2" | qsv tojsonl
    

    crash with error

    thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/cmd/tojsonl.rs:124:28
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    opened by binhunstatic 0
  • Auto Inferencing date(time) columns

    Auto Inferencing date(time) columns

    Hi, if I import in duckdb this input CSV file

    field1,date1
    4500233821,2018-11-29 12:15:06.353
    4500245641,2018-11-29 12:15:06.353
    4500242807,2018-11-29 12:55:03.097
    4500233852,2018-11-29 12:55:03.097
    4500234511,2018-11-29 13:09:25.697
    4500242786,2018-11-29 13:17:48.977
    4500242786,2018-11-29 13:17:48.977
    4500253356,2018-11-29 13:28:45.723
    

    I get this schema, by automatic inferencing:

    CREATE TABLE test(field1 BIGINT, date1 TIMESTAMP);
    

    And if I apply datefmt to this input CSV I have a right datetime field

    field1,date1
    4500233821,2018-11-29T12:15:06.353+00:00
    4500245641,2018-11-29T12:15:06.353+00:00
    4500242807,2018-11-29T12:55:03.097+00:00
    4500233852,2018-11-29T12:55:03.097+00:00
    4500234511,2018-11-29T13:09:25.697+00:00
    4500242786,2018-11-29T13:17:48.977+00:00
    4500242786,2018-11-29T13:17:48.977+00:00
    4500253356,2018-11-29T13:28:45.723+00:00
    

    But if I use schema or to (to create a parquet file) to the original file, the date1 field is mapped as string field.

    It would be great to have auto inferencing of datetime fields, also when they are not written perfectly (in my input I do not have the T and I have a space).

    Thank you

    opened by aborruso 3
  • currencytonum/numtocurrency still returns wrong results

    currencytonum/numtocurrency still returns wrong results

    IDescribe the bug #660 still has a corner case :-)-O

    To Reproduce

    cat  > test.csv<<EOF
    CUR,RATE
    EUR,1
    GBP,1.14088829562698
    NAD,18.3529
    EOF
    
    qsv table test.csv
    

    returns

    CUR  RATE
    EUR  1
    GBP  1.14088829562698
    NAD  18.3529
    
    qsv apply operations currencytonum RATE test.csv | qsv table
    

    returns

    CUR  RATE
    EUR  1.00
    GBP  319.79
    NAD  18.35
    

    as does

    qsv apply operations numtocurrency RATE test.csv | qsv table
    

    by the way

    Expected behavior

    The result should be something like

    CUR  RATE
    EUR  1
    GBP  1.14
    NAD  18.35
    

    Screenshots/Backtrace/Sample Data N/A

    Desktop (please complete the following information): macOS 13.1 qsv 0.80.0-standard-apply;fetch;foreach;generate;luau;to;self_update-8-8 (x86_64-apple-darwin compiled with Rust 1.66) prebuilt

    Additional context

    This behavior starts when there are more than 11 decimals :-)-O

    enhancement 
    opened by ondohotola 9
  • to and stat comment: field inferencing not coherent

    to and stat comment: field inferencing not coherent

    Hi, if I transform this CSV in parquet using to, I have that the luogo_esecuzione_istat field is mapped as integer in the output parquet file.

    This is wrong, it's a string, and it's properly mapped as string using stats.

    I think that to command should use the same field inferencing of stats.

    Thank you

    opened by aborruso 2
  • Add MySQL support and generate SQL script instead of/in addition to connecting

    Add MySQL support and generate SQL script instead of/in addition to connecting

    Is your feature request related to a problem? Please describe. I like the news PostgreSQL and SqlIte support, so of course one would like MySQL/MariaDB as well.

    I would (sometimes) like to be able to read the SQL script produced before connecting and adding.

    Describe the solution you'd like As above.

    Describe alternatives you've considered Nil

    Additional context Nil.

    enhancement 
    opened by ondohotola 3
  • add index support to `luau` and `py` commands

    add index support to `luau` and `py` commands

    This will allow random access of CSV files from luau and python scripts.

    This will be done by using a special variable _index that can be accessed from the scripts to get/set the underlying index position.

    If an index is not available/stale, and "_index" is referred to in the scripts, the index will be created/updated before executing the scripts proper and _rowcount will be set to the number of records.

    If _index is set to a negative value, starts from the last record.

    enhancement 
    opened by jqnatividad 0
Releases(0.81.0)
Owner
Joel Natividad
Let’s make Data Useful, Usable & Used.
Joel Natividad
qsv - Performant CLI tool to query CSVs through SQL

qsv Performant CLI tool to query CSVs through SQL Installation After cloning the repository, you can install a binary locally using cargo install --pa

Dermot Haughey 3 Oct 28, 2021
An ultra-fast CLI app that fixes JSON files in large codebase or folders

minosse An ultra fast CLI app that fixes json files in large codebase or folders USAGE: minosse [OPTIONS] <input-dir> FLAGS: -h, --help Prints

Antonino Bertulla 5 Oct 17, 2022
cpa is a cli tool for ultra fast setup of Rust & Python projects

CPA: Create-Python-App cpa is a cli tool for ultra fast setup of new Python & Rust projects. It automates the creation of config files like style & li

Yuki Sawa 56 Dec 3, 2023
🐚+🦞 Ultra-portable Rust game engine suited for offline 2D games powered by WebAssembly

pagurus ?? + ?? Ultra-portable Rust game engine suited for offline 2D games powered by WebAssembly. Examples Snake Traditional snake game: examples/sn

Takeru Ohta 20 Mar 7, 2023
📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.

??(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.

Alex Hallam 1.8k Jan 2, 2023
rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G)

csv, excel toolkit written in Rust rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G). rsv has following fe

Zhuang Dai 39 Jan 30, 2023
A tool for collecting rollup blocks from the Aztec Connect rollup, and exporting them to csv

Aztec Connect Data Gobbler The Aztec Connect Data gobbler is a tool made for extracting data from the Aztec Connect system using only L1 as its source

Lasse Herskind 6 Feb 17, 2023
Sniffer - a tool to quickly inspect csv and flat-file files for basic information

sniffer sniffer is a tool to quickly inspect csv and flat-file files for basic information. Need to see how many rows are in a csv file? Want to see t

Daniel B 10 Apr 4, 2023
A simple CLI tool for converting CSV file content to JSON.

fast-csv-to-json A simple CLI tool for converting CSV file content to JSON. 我花了一個小時搓出來,接著優化了兩天的快速 CSV 轉 JSON CLI 小工具 Installation Install Rust with ru

Ming Chang 3 Apr 5, 2023
A simple and efficient terminal UI implementation with ratatui.rs for getting quick insights from csv files right on the terminal

CSV-GREP csv-grep is an intuitive TUI application writting with ratatui.rs for reading, viewing and quickly analysing csv files right on the terminal.

Anthony Ezeabasili 16 Mar 10, 2024
CLI toolkit for GTD framework.

GTDF_Crabby CLI toolkit for GTD framework. How to use crabby 0. Parameters Crabby is a CLI toolkit and gets parameters as input. All the main options

akrck02 2 Feb 13, 2022
Oxygen is a voice journal and audio analysis toolkit for people who want to change the way their voice comes across.

Oxygen Voice Journal Oxygen is a voice journal and audio analysis toolkit for people who want to change the way their voice comes across. Or rather, i

Jocelyn Stericker 32 Oct 20, 2022
A toolkit for building your own interactive command-line tools in Rust

promkit A toolkit for building your own interactive command-line tools in Rust, utilizing crossterm. Getting Started Put the package in your Cargo.tom

null 70 Dec 18, 2022
A Rust 🦀️ font loading, positioning and rendering toolkit

Toolkit used to load, match, measure and render texts. NOTE: This project is a work in progress. Text measuring and positioning is a complex topic. A

Alibaba 63 Dec 27, 2022
Concurrent and multi-stage data ingestion and data processing with Rust+Tokio

TokioSky Build concurrent and multi-stage data ingestion and data processing pipelines with Rust+Tokio. TokioSky allows developers to consume data eff

DanyalMh 29 Dec 11, 2022
Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

drivel drivel is a command-line tool written in Rust for inferring a schema from an example JSON (or JSON lines) file, and generating synthetic data (

Daniël 36 Jul 5, 2024
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)

python-daachorse daachorse is a fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. This is a Python wrap

Koichi Akabe 11 Nov 30, 2022
Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

dbz A library (dbz-lib) and CLI tool (dbz-cli) for working with Databento Binary Encoding (DBZ) files. Python bindings for dbz-lib are provided in the

Databento, Inc. 15 Nov 4, 2022
A fast and robust MLOps tool for managing data and pipelines

xvc A Fast and Robust MLOps Swiss-Army Knife in Rust ⌛ When to use xvc? Machine Learning Engineers: When you manage large quantities of unstructured d

Emre Sahin 6 Dec 15, 2022