๐
ripline
This is not the greatest line reader in the world, this is just a tribute.
Fast line based iteration almost entirely lifted from ripgrep's grep_searcher.
All credit to Andrew Gallant and the ripgrep contributors.
Why?
- Doesn't rely on a clousre like the
bstr::for_line*
methods (useful in some awkward lifetime scenarios). - No silently capped line lengths unlike
rust-linereader
- Brings the
LineIter
with for working withmemmap
files
Not all of this functionality was exposed in the grep_searcher
crate, and rightly so as a lot of it had grep
specific configurations embeded into the logic (i.e. binary detection).
What have I changed?
Not much. I took out some of the ripgrep specific logic such as the binary detection, some search related configs, and consolidated a few of the helper stucts from the other grep_*
crates.
Example
See examples
for more.
use grep_cli::stdout;
use ripline::{
line_buffer::{LineBufferBuilder, LineBufferReader},
lines::LineIter,
LineTerminator,
};
use std::{env, error::Error, fs::File, io::Write, path::PathBuf};
use termcolor::ColorChoice;
fn main() -> Result<(), Box<dyn Error>> {
let path = PathBuf::from(env::args().nth(1).expect("Failed to provide input file"));
let mut out = stdout(ColorChoice::Never);
let reader = File::open(&path)?;
let terminator = LineTerminator::byte(b'\n');
let mut line_buffer = LineBufferBuilder::new().build();
let mut lb_reader = LineBufferReader::new(reader, &mut line_buffer);
while lb_reader.fill()? {
let lines = LineIter::new(terminator.as_byte(), lb_reader.buffer());
for line in lines {
out.write_all(line)?;
}
lb_reader.consume_all();
}
Ok(())
}
Crude and untrustworthy benchmarks
From examples/ripline_benchmarks.rs
. Initial benchmark script take from rust-linereader, which is also included in the benchmarks as LR:*
.
The input used was all_train.csv, unzipped can catted together five times createing a ~25G file.
Method | Time | Lines/sec | Bandwidth |
---|---|---|---|
read() | 2.01s | 17439155/s | 12303.42 MB/s |
LR::next_batch() | 2.11s | 16576174/s | 11694.59 MB/s |
LR::next_line() | 2.65s | 13196734/s | 9310.37 MB/s |
ripline_line_buffer() | 2.64s | 13277194/s | 9367.14 MB/s |
ripline_mmap() | 2.16s | 16183503/s | 11417.55 MB/s |
bstr_for_line() | 2.47s | 14174502/s | 10000.19 MB/s |
read_until() | 2.86s | 12230594/s | 8628.75 MB/s |
read_line() | 4.16s | 8417415/s | 5938.53 MB/s |
lines() | 5.05s | 6930901/s | 4889.79 MB/s |
Note that read
and next_batch
are not counting lines. read_until()
doesn't seem to perform as well in real-life scenarios as it does on this benchmark and I'm not sure why.
Hardware: Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive