⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

Overview

EasyReader

Build Status Latest Version Documentation Rustc Version

The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or getting random lines without having to consume an iterator.

Currently with Rust's standard library is possible to read a file line by line only through Lines (https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines), with which is impossible (or very expensive) to read backwards and to get random lines. Also, being an iterator, every line that has already been read is consumed and to get back to the same line you need to reinstantiate the reader and consume all the lines until the desired one (eg. in the case of the last line, all).

Notes:

EasyReader by default does not generate an index, it just searches for line terminators from time to time, this allows it to be used with very large files without "startup" times and excessive RAM consumption. However, the lack of an index makes the reading slower and does not allow to take random lines with a perfect distribution, for these reasons there's a method to generate it; the start time will be slower, but all the following readings will use it and will therefore be faster (excluding the index build time, reading times are a bit longer but still comparable to those of a sequential forward reading through Lines) and in the random reading case the lines will be taken with a perfect distribution. By the way, it's not advisable to generate the index for very large files, as an excessive RAM consumption could occur.

Example: basic usage

use easy_reader::EasyReader;
use std::{
    fs::File,
    io::{
        self,
        Error
    }
};

fn easy() -> Result<(), Error> {
    let file = File::open("resources/test-file-lf")?;
    let mut reader = EasyReader::new(file)?;

    // Generate index (optional)
    reader.build_index();

    // Move through the lines
    println!("First line: {}", reader.next_line()?.unwrap());
    println!("Second line: {}", reader.next_line()?.unwrap());
    println!("First line: {}", reader.prev_line()?.unwrap());
    println!("Random line: {}", reader.random_line()?.unwrap());

    // Iteration through the entire file (reverse)
    reader.eof();
    while let Some(line) = reader.prev_line()? {
        println!("{}", line);
    }

    // You can always start/restart reading from the end of file (EOF)
    reader.eof();
    println!("Last line: {}", reader.prev_line()?.unwrap());
    // Or the begin of file (BOF)
    reader.bof();
    println!("First line: {}", reader.next_line()?.unwrap());

    Ok(())
}

Example: read random lines endlessly

use easy_reader::EasyReader;
use std::{
    fs::File,
    io::{
        self,
        Error
    }
};

fn easy() -> Result<(), Error> {
    let file = File::open("resources/test-file-lf")?;
    let mut reader = EasyReader::new(file)?;

    // Generate index (optional)
    reader.build_index();

    loop {
        println!("{}", reader.random_line()?.unwrap());
    }
}
You might also like...
Difftastic is an experimental structured diff tool that compares files based on their syntax.
Difftastic is an experimental structured diff tool that compares files based on their syntax.

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Subtitles-rs - Use SRT subtitle files to study foreign languages

Rust subtitle utilities Are you looking for substudy? Try here. (substudy has been merged into the subtitles-rs project.) This repository contains a n

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Rust wrapper for the BlingFire tokenization library

BlingFire in Rust blingfire is a thin Rust wrapper for the BlingFire tokenization library. Add the library to Cargo.toml to get started cargo add blin

Comments
  • random_line is effected by the length of lines

    random_line is effected by the length of lines

    In a file with lines of differing length, lines which are twice as long are twice as likely to be choosen.

    I don't think there is any way to fix this with reasonable efficiency (I sometimes have to pick a random line, to do that I make a massive array with the position of the start of every line), but I think it should be mentioned in the docs fairly clearly, as I imagine most people would expect random_line to pick every line with equal chance?

    opened by ChrisJefferson 2
  • Subtract with overflow when first line is empty

    Subtract with overflow when first line is empty

    Given the following simple test program:

    use std::fs::File;
    
    fn main() {
        let mut reader = easy_reader::EasyReader::new(File::open("test.txt").unwrap()).unwrap();
        reader.eof();
    
        while let Some(line) = reader.prev_line().unwrap() {
            println!("{}", line);
        }
    }
    

    with the following test file test.txt:

    
    Blank line above!
    

    (Notice that the first line is blank!)

    The program panics in debug mode with a substraction overflow error:

    thread 'main' panicked at 'attempt to subtract with overflow', ~/.cargo/registry/src/github.com-1ecc6299db9ec823/easy_reader-0.5.1/src/lib.rs:392:57
    stack backtrace:
       0: rust_begin_unwind
                 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:517:5
       1: core::panicking::panic_fmt
                 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:100:14
       2: core::panicking::panic
                 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:50:5
       3: easy_reader::EasyReader<R>::find_end_line
                 at ~/.cargo/registry/src/github.com-1ecc6299db9ec823/easy_reader-0.5.1/src/lib.rs:392:57
       4: easy_reader::EasyReader<R>::read_line
                 at ~/.cargo/registry/src/github.com-1ecc6299db9ec823/easy_reader-0.5.1/src/lib.rs:258:44
       5: easy_reader::EasyReader<R>::prev_line
                 at ~/.cargo/registry/src/github.com-1ecc6299db9ec823/easy_reader-0.5.1/src/lib.rs:179:9
       6: easy_reader_empty_start_line::main
                 at ./src/main.rs:7:28
       7: core::ops::function::FnOnce::call_once
                 at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/ops/function.rs:227:5
    

    (In release mode, it will panic with thread 'main' panicked at 'calledResult::unwrap()on anErrvalue: Os { code: 22, kind: InvalidInput, message: "Invalid argument" }' since overflow checks are disabled)

    Tested with easy_reader 0.5.1

    opened by nthuemmel-scontain 0
  • Feature request: Use pre-computed offsets to build index

    Feature request: Use pre-computed offsets to build index

    Hey, I have to work with large files (~100GB) and to make my life easier, I write out the file offsets of new lines to a separate file while generating the corpus. Is there a way to use this list to build an index? This is for a CLI app and it makes building index everytime very painful.

    opened by harshasrisri 0
Owner
Michele Federici
no 💩coins
Michele Federici
A command line tool for renaming your ipa files quickly and easily.

ipa_renamer A command line tool for renaming your ipa files quickly and easily. Usage ipa_renamer 0.0.1 A command line tool for renaming your ipa file

Noah Hsu 31 Dec 31, 2022
An efficient way to filter duplicate lines from input, à la uniq.

runiq This project offers an efficient way (in both time and space) to filter duplicate entries (lines) from texual input. This project was born from

Isaac Whitfield 170 Dec 24, 2022
Read and modify constituency trees in Rust.

lumberjack Read and process constituency trees in various formats. Install: From crates.io: cargo install lumberjack-utils From GitHub: cargo install

Sebastian Pütz 10 Apr 28, 2022
(Read-only) Generate n-grams

N-grams Documentation This crate takes a sequence of tokens and generates an n-gram for it. For more information about n-grams, check wikipedia: https

Paul Woolcock 26 Dec 30, 2022
A naive (read: slow) implementation of Word2Vec. Uses BLAS behind the scenes for speed.

SloWord2Vec This is a naive implementation of Word2Vec implemented in Rust. The goal is to learn the basic principles and formulas behind Word2Vec. BT

Lloyd 2 Jul 5, 2018
Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

Omar MHAIMDAT 7 Mar 3, 2023
Solve hard constraints easily with Rust.

backtrack-rs ?? backtrack lets you solve backtracking problems simply and generically. Problems are defined by their scope and checks against possible

Alexander Hirner 12 Jul 24, 2022
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

Tanker 331 Dec 28, 2022
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Vishal Telangre 310 Dec 29, 2022
Splits test files into multiple groups to run tests in parallel nodes

split-test split-test splits tests into multiple groups based on timing data to run tests in parallel. Installation Download binary from GitHub releas

Fumiaki MATSUSHIMA 28 Dec 12, 2022