rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

Overview

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. rga wraps the awesome ripgrep and enables it to search in pdf, docx, sqlite, jpg, movie subtitles (mkv, mp4), etc.

github repo Crates.io fearless concurrency

For more detail, see this introductory blogpost: https://phiresky.github.io/blog/2019/rga--ripgrep-for-zip-targz-docx-odt-epub-jpg/

rga will recursively descend into archives and match text in every file type it knows.

Here is an example directory with different file types:

demo/
├── greeting.mkv
├── hello.odt
├── hello.sqlite3
└── somearchive.zip
├── dir
│ ├── greeting.docx
│ └── inner.tar.gz
│ └── greeting.pdf
└── greeting.epub

rga output

Integration with fzf

rga-fzf

You can use rga interactively via fzf. Add the following to your ~/.{bash,zsh}rc:

rga-fzf() {
	RG_PREFIX="rga --files-with-matches"
	local file
	file="$(
		FZF_DEFAULT_COMMAND="$RG_PREFIX '$1'" \
			fzf --sort --preview="[[ ! -z {} ]] && rga --pretty --context 5 {q} {}" \
				--phony -q "$1" \
				--bind "change:reload:$RG_PREFIX {q}" \
				--preview-window="70%:wrap"
	)" &&
	echo "opening $file" &&
	xdg-open "$file"
}

INSTALLATION

Linux x64, macOS and Windows binaries are available in GitHub Releases.

Linux

Arch Linux

simply install from AUR: yay -S ripgrep-all.

Nix

nix-env -iA nixpkgs.ripgrep-all

Debian-based

download the rga binary and get the dependencies like this:

apt install ripgrep pandoc poppler-utils ffmpeg

If ripgrep is not included in your package sources, get it from here.

rga will search for all binaries it calls in $PATH and the directory itself is in.

Windows

Install ripgrep-all via Chocolatey:

choco install ripgrep-all

Note that installing via chocolatey or scoop is the only supported download method. If you download the binary from releases manually, you will not get the dependencies (for example pdftotext from poppler).

If you get an error like VCRUNTIME140.DLL could not be found, you need to install vc_redist.x64.exe.

Homebrew/Linuxbrew

rga can be installed with Homebrew:

brew install rga

To install the dependencies that are each not strictly necessary but very useful:

brew install pandoc poppler tesseract ffmpeg

Compile from source

rga should compile with stable Rust (v1.36.0+, check with rustc --version). To build it, run the following (or the equivalent in your OS):

   ~$ apt install build-essential pandoc poppler-utils ffmpeg ripgrep cargo
   ~$ cargo install ripgrep_all
   ~$ rga --version    # this should work now

Available Adapters

rga --rga-list-adapters

Adapters:

  • ffmpeg Uses ffmpeg to extract video metadata/chapters and subtitles
    Extensions: .mkv, .mp4, .avi
  • pandoc Uses pandoc to convert binary/unreadable text documents to plain markdown-like text
    Extensions: .epub, .odt, .docx, .fb2, .ipynb
  • poppler Uses pdftotext (from poppler-utils) to extract plain text from PDF files
    Extensions: .pdf
    Mime Types: application/pdf

  • zip Reads a zip file as a stream and recurses down into its contents
    Extensions: .zip
    Mime Types: application/zip

  • decompress Reads compressed file as a stream and runs a different extractor on the contents.
    Extensions: .tgz, .tbz, .tbz2, .gz, .bz2, .xz, .zst
    Mime Types: application/gzip, application/x-bzip, application/x-xz, application/zstd

  • tar Reads a tar file as a stream and recurses down into its contents
    Extensions: .tar

  • sqlite Uses sqlite bindings to convert sqlite databases into a simple plain text format
    Extensions: .db, .db3, .sqlite, .sqlite3
    Mime Types: application/x-sqlite3

The following adapters are disabled by default, and can be enabled using '--rga-adapters=+pdfpages,tesseract':

  • pdfpages Converts a pdf to its individual pages as png files. Only useful in combination with tesseract
    Extensions: .pdf
    Mime Types: application/pdf

  • tesseract Uses tesseract to run OCR on images to make them searchable. May need -j1 to prevent overloading the system. Make sure you have tesseract installed.
    Extensions: .jpg, .png

USAGE:

rga [RGA OPTIONS] [RG OPTIONS] PATTERN [PATH ...]

FLAGS:

--rga-accurate

Use more accurate but slower matching by mime type

By default, rga will match files using file extensions. Some programs, such as sqlite3, don't care about the file extension at all, so users sometimes use any or no extension at all. With this flag, rga will try to detect the mime type of input files using the magic bytes (similar to the `file` utility), and use that to choose the adapter. Detection is only done on the first 8KiB of the file, since we can't always seek on the input (in archives).

-h, --help

Prints help information

--rga-list-adapters

List all known adapters

--rga-no-cache

Disable caching of results

By default, rga caches the extracted text, if it is small enough, to a database in ~/.cache/rga on Linux, ~/Library/Caches/rga on macOS, or C:\Users\username\AppData\Local\rga on Windows. This way, repeated searches on the same set of files will be much faster. If you pass this flag, all caching will be disabled.

--rg-help

Show help for ripgrep itself

--rg-version

Show version of ripgrep itself

-V, --version

Prints version information

OPTIONS:

--rga-adapters=<adapters>...

Change which adapters to use and in which priority order (descending)

"foo,bar" means use only adapters foo and bar. "-bar,baz" means use all default adapters except for bar and baz. "+bar,baz" means use all default adapters and also bar and baz.

--rga-cache-compression-level=<cache-compression-level>

ZSTD compression level to apply to adapter outputs before storing in cache db

Ranges from 1 - 22 [default: 12]

--rga-cache-max-blob-len=<cache-max-blob-len>

Max compressed size to cache

Longest byte length (after compression) to store in cache. Longer adapter outputs will not be cached and recomputed every time. Allowed suffixes: k M G [default: 2000000]

--rga-max-archive-recursion=<max-archive-recursion>

Maximum nestedness of archives to recurse into [default: 4]

-h shows a concise overview, --help shows more detail and advanced options.

All other options not shown here are passed directly to rg, especially [PATTERN] and [PATH ...]

Development

To enable debug logging:

export RUST_LOG=debug
export RUST_BACKTRACE=1

Also remember to disable caching with --rga-no-cache or clear the cache (~/Library/Caches/rga on macOS, ~/.cache/rga on other Unixes, or C:\Users\username\AppData\Local\rga on Windows) to debug the adapters.

Comments
  • Can't install ripgrep_all with cargo from crates

    Can't install ripgrep_all with cargo from crates

    I tried to update ripgrep_all, but it failed on what appears to be a dependency issue. If this has been resolved on master, it’s probably worth cutting a new version to crates.io that fixes this.

    $ cargo install ripgrep_all
        Updating crates.io index
      Installing ripgrep_all v0.9.6
    error: failed to compile `ripgrep_all v0.9.6`, intermediate artifacts can be found at `/var/folders/6t/2xp3hp2j1vd5yyklq0b6_lbw0000gn/T/cargo-install3jDwt5`
    
    Caused by:
      failed to select a version for the requirement `cachedir = "^0.1.1"`
      candidate versions found which didn't match: 0.2.0
      location searched: crates.io index
    required by package `ripgrep_all v0.9.6`
    
    $ rustc --version
    rustc 1.44.1 (c7087fe00 2020-06-17)
    $ cargo --version
    cargo 1.44.1 (88ba85757 2020-06-11)
    
    opened by halostatue 12
  • error: failed to compile `ripgrep_all v0.9.6`

    error: failed to compile `ripgrep_all v0.9.6`

    Is it necessary to use cachedir = "^0.1.1"?

    $ cargo install ripgrep_all
        Updating crates.io index
      Downloaded ripgrep_all v0.9.6
      Downloaded 1 crate (113.1 KB) in 0.62s
      Installing ripgrep_all v0.9.6
    error: failed to compile `ripgrep_all v0.9.6`, intermediate artifacts can be found at `/var/folders/_p/36jhfrg52ld1643fftf08x740000gp/T/cargo-installuJ8iui`
    
    Caused by:
      failed to select a version for the requirement `cachedir = "^0.1.1"`
      candidate versions found which didn't match: 0.3.0, 0.2.0
      location searched: crates.io index
      required by package `ripgrep_all v0.9.6`
    
    $ rustup -V
    rustup 1.23.1 (3df2264a9 2020-11-30)
    info: This is the version for the rustup toolchain manager, not the rustc compiler.
    info: The currently active `rustc` version is `rustc 1.48.0 (7eac88abb 2020-11-16)`
    
    $ cargo -V
    cargo 1.48.0 (65cbdd2dc 2020-10-14)
    
    opened by slowkow 11
  • `Ripgrep-all` runs very slowly for the first time after the computer starts.

    `Ripgrep-all` runs very slowly for the first time after the computer starts.

    On Ubuntu 20.04.3 LTS, I'm using the self-compiled git master version of ripgrep-all. I noticed that it runs very slowly for the first time after the computer starts. Therefore, I think it must work based on caching mechanism. The problem is how to maintain the cache after the computer is restarted to maximize the operation efficiency.

    Any hints for this problem will be highly appreciated.

    Regards, HZ

    opened by hongyi-zhao 8
  • Fix installation and CI

    Fix installation and CI

    • Fixes installation with the stable toolchain. Essentially it's just cargo update
    • Fixes the push pipeline, now it fails on a test
    • Fixes the release pipeline
    opened by TriplEight 8
  • error running pdf search on windows 10 - 64bit

    error running pdf search on windows 10 - 64bit

    I tried running the pdf search with the adapter "poppler" on both version 0.9.2 and 0.9.3 and I get the following error message. What am I missing here?

    Reference.pdf: preprocessor command failed: '"rga-preproc" "Reference.pdf"':
    -------------------------------------------------------------------------------
    adapter: poppler
    pdftotext version 4.00
    Copyright 1996-2017 Glyph & Cog, LLC
    Usage: pdftotext [options] <PDF-file> [<text-file>]
      -f <int>             : first page to convert
      -l <int>             : last page to convert
      -layout              : maintain original physical layout
      -simple              : simple one-column page layout
      -table               : similar to -layout, but optimized for tables
      -lineprinter         : use strict fixed-pitch/height layout
      -raw                 : keep strings in content stream order
      -fixed <number>      : assume fixed-pitch (or tabular) text
      -linespacing <number>: fixed line spacing for LinePrinter mode
      -clip                : separate clipped text
      -nodiag              : discard diagonal text
      -enc <string>        : output text encoding name
      -eol <string>        : output end-of-line convention (unix, dos, or mac)
      -nopgbrk             : don't insert page breaks between pages
      -bom                 : insert a Unicode BOM at the start of the text file
      -opw <string>        : owner password (for encrypted files)
      -upw <string>        : user password (for encrypted files)
      -q                   : don't print any messages or errors
      -cfg <string>        : configuration file to use in place of .xpdfrc
      -v                   : print copyright and version info
      -h                   : print usage information
      -help                : print usage information
      --help               : print usage information
      -?                   : print usage information
    Error: The pipe has been ended. (os error 109)
    
    opened by neelabalan 8
  • preprocessor command failed: '

    preprocessor command failed: '"rga-preproc" "/Users/user/Desktop/test/test.pdf.zip"

    I am getting this error while executing:

    rga "hello" ~/Desktop/test/

    where I have a zip file. I don't understand from the documentation whether ZIP files need an extra argument or not. Thanks in advance.

    opened by AtomicNess123 7
  • brew troubles?

    brew troubles?

    I did an install on macOS Catalina (10.15.7) using brew install rga and for all files I test this on I get this error when I try rga testing:

    ------------------------------------------------------------------------------- ./some file.pdf: preprocessor command failed: '"/usr/local/bin/rga-preproc" "./some file.pdf"': ------------------------------------------------------------------------------- adapter: poppler Error: Couldn't open file '-' Error: Broken pipe (os error 32)

    So then I went ahead and installed all the additional libraries mentioned in the readme with brew install pandoc poppler tesseract ffmpeg but this didn't seem to help at all. Even tried reinstalling rga after that.

    opened by hyperjeff 6
  • Respect .rgignore

    Respect .rgignore

    Hi, first of all thanks for this - this is incredibly useful for me. Caching makes all the difference comparing to the slow pdfgrep!

    It seems rga doesn't respect .rgignore file? Would it be possible to add it please?

    opened by rsuhada 6
  • Build fails with

    Build fails with "unstable feature" error in rkv dependency

    I tried doing cargo build (of master at commit ef2e4ebf28f) and got this error:

      $ cargo build 
          Updating crates.io index
       Downloading crates ...
        Downloaded chrono v0.4.6
        Downloaded encoding_rs v0.8.17
        [...]
         Compiling zip v0.5.2
         Compiling serde_json v1.0.39
         Compiling rkv v0.9.6
      error[E0658]: use of unstable library feature 'try_from' (see issue #33417)
         --> /home/kfogel/.cargo/registry/src/github.com-1ecc6299db9ec823/rkv-0.9.6/src/error.rs:166:11
          |
      166 | impl From<::std::num::TryFromIntError> for MigrateError {
          |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
      
      error[E0658]: use of unstable library feature 'try_from' (see issue #33417)
        --> /home/kfogel/.cargo/registry/src/github.com-1ecc6299db9ec823/rkv-0.9.6/src/migrate.rs:78:5
         |
      78 |     convert::TryFrom,
         |     ^^^^^^^^^^^^^^^^
      
      [...many more similar error lines...]
      
      error: aborting due to 12 previous errors
      
      For more information about this error, try `rustc --explain E0658`.
      error: Could not compile `rkv`.
      warning: build failed, waiting for other jobs to finish...
      error: build failed
      $ 
    

    I don't know much Rust, but it looks like rkv is using an unstable feature (rust bug 33417 has more about it), and that since rga depends on rkv, this affects the rga build too. I ran rustc --explain E0658 and got some information about how to solve the problem -- presumably those solutions would have to be implemented upstream in rkv, if we wanted to solve this for everyone, or else I'd have either build a modified rkv locally or get the nightly version of rustc to do the build I just tried to do.

    I'm not sure what ways might be available to solve this within rga. Ideas welcome; like I said, I don't know Rust that well.

    Anyway, this was all along the way to submitting a PR for README.md to add installation instructions. I'll submit that PR, and then in its commentary mention this issue.

    opened by kfogel 6
  • Opening a xml file and ran code inside when it wasn't supposed to (security?)

    Opening a xml file and ran code inside when it wasn't supposed to (security?)

    I used rga-fzf to search for a xml file. That file had a powershell script in it. When clicking on enter to open the file, the powershell script got executed which wasn't intended as it was malicious 😅

    I am using Manjaro (Arch Linux) with zsh and powershell+wine installed.

    Did anyone else observed that?

    Some screenshots:

    Screenshot from 2021-01-25 13-10-27 Screenshot from 2021-01-25 13-10-44

    opened by evilcel3ri 5
  • feature_request(books): detect incorrect and poor quality text

    feature_request(books): detect incorrect and poor quality text

    1. Summary

    It would be nice, if ripgrep-all will show warning, if text in the book is not written incorrect or have a bad quality.

    2. Problem

    2.1. Summary

    Some books have bad OCR layer. It is impossible to search for normal words in them. It would be nice, if ripgrep-all will detect these books.

    2.2. Details

    Books may have bad quality of searchable text. Reasons:

    1. The user who added OCR layer for the book, add incorrect language for OCR. For example, user may added English OCR layer for Russian text as in my 4.2 example.
    2. Bad quality of scanned book and/or tool which was used to add the OCR layer. See my example 3.

    I couldn't find, how I can automatically detect these books in my books list. Currently, I need manually check OCR layer quality for every book. It takes a lot of time.

    3. Compact Language Detector

    Possibly, Compact Language Detector can solve this problem.

    I installed cld2-cffi (yes, CLD3 exists, but I have problems in its installation on my Windows) → I ran this code in my Python interpreter:

    >>> import cld2
    >>> isReliable, textBytesFound, details = cld2.detect("Here text from examples 4.1—4.3")
    >>> print('  details: %s' % str(details))
    

    Possibly, would be possible get similar behavior use Rust tools. For example, see Whatlang and CLD3 langdetect.

    4. Example texts

    4.1. Normal Russian text

    Например, название Полтавы связано с названием речки Лтавы (так раньше называлась Ворскла) и означает, соответственно, «город на Лтаве». Название города Ужгород также образовано от названия реки Уж. Винница обязана своим названием речке Винничке, которая протекает через город. Название реки, в свою очередь, происходит от слова «венок»: когда-то молодые девушки собирались на ее берегу и пускали на воду венки, чтобы узнать о своем будущем. Луганск назван в честь речки Луганки.
    
    • cld2-cffi output:
    details: (Detection(language_name='RUSSIAN', language_code='ru', percent=99, score=709.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))
    

    4.2. English OCR language for Russian text

    IIepBhle HeCKOJI:bKO COT MHJIJIHOHOB JIeT 6h1JIH nOHCTHHe KOWMapHhlMH ,n;JIH nJIaHeThI: OHa HenpephlBHO COTpHC8.JIaC:b no,n; y,n;apaMH KpynHhlx MeTeopHTOB, ChlnaBWHXCH Ha Hee H3 KOCMoca. IIoBepxHOCT:b COBpeMeHHOH JIYHhI, nOKpLITaSi MeTeopHTHhlMH KpaTepaMH, n03BOJIHeT HaM npe,n;CTaBHT:b, KaK MOrJIa BhlrJIH,n;eT:b 3eMJIH npHMepHO 4 MJIp,n; JIeT Ha3. OqeH:b CKOpO BHyrpH HaweH nJIaHeThl3apa60T8.JI tTenJIOBOH ABHraTeJI:b., rOplOqHM ,n;JIH KOToporo CJIymHJI pacn pHoaKTHBHhlX SJIeMeHTOB. B He,n;pax 3eMJIH HaqaJIOCh Me,n;JIeHHOe ,n;BHmeHHe BeeCTBa, HarpeThle CTPYH KOToporo nOAHHM8.JIHC:b BBepx, a XOJIO.D;Hhle onYCK8.JIHCh BHH3. IIJIaHeTa CT8.JIa noxoma Ha CneJIhlH nepCHK.
    
    details: (Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))
    

    Note: I remove Information Separator One gremlin characters from this text for cld2-cffi, otherwise I get traceback:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python38\lib\site-packages\cld2\__init__.py", line 393, in detect
        raise ValueError("input contains invalid UTF-8 around byte " +
    ValueError: input contains invalid UTF-8 around byte 348 (of 792534779)
    

    4.3. Bad OCR

    1(этрин ска3ала' что знакома с книгой его )кены 3леоноры 8ирек по лекарстве[1пым расте|{ия!| аляски. ёа мой в3дох по поводу тогц что у нас в библиотеке тодько од'!а книга на эту фамилию, (этрин пообещала прислать книгу о лекарстве!тных расте]|иях аляски. !! действительно' не прошло и месяца' как у меня на столе появилась небольшая по объему эффектвого дизайва книга <а|-а5'(а'5 ш||овпшп$5 мвр1с]ш85> с изображе1|и₠ м ца обложке такого 3вакомого ка}<дому х{ителю нашей о6ласти ольховника. правда, в книге он значился 11од другим видовым названием' чем у ;1ас
    
    details: (Detection(language_name='RUSSIAN', language_code='ru', percent=59, score=503.0), Detection(language_name='SERBIAN', language_code='sr', percent=40, score=468.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))
    

    5. Example of expected behavior

    1. ripgrep-all adapters extract text from books.

    2. CLD (or similar tool) check 2 (4 may be better) random pages for every book.

    3. If percent value is 95 (maybe another value is better; need practical tests) or more → do nothing. Else it below 95 → ripgrep-all user get a warning. Example warning text:

      WARNING! Possibly, file {Filename} have a text written not in natural language. The reason for this may be incorrect or poor quality OCR layer. Please, check your {Filename}.
      

    6. Note

    Some tools for language recognition may not solve this problem. They don't detect that the text written not in the real natural language.

    For example, I tried langdetect, TextBlob, guess_language and langid examples from this Stack Overflow answer → they show, that my 4.2 and 4.3 examples written on the real natural languages.

    Thanks.

    opened by Kristinita 5
  • Ignore Feature

    Ignore Feature

    I've noticed that when I'm running RGA on a folder it's hitting the lock files created by Word when a file is open and the preprocessor fails.

    Chapter 4/~$cture4.docx: preprocessor command failed: '"/usr/local/bin/rga-preproc" "Chapter 4/~$cture4.docx"':
    -------------------------------------------------------------------------------
    adapter: pandoc
    [WARNING] Deprecated: --atx-headers. Use --markdown-headings=atx instead.
    couldn't unpack docx container: Did not find end of central directory signature
    Error: subprocess failed: ExitStatus(unix_wait_status(16128))
    -------------------------------------------------------------------------------
    
    opened by pu-238 0
  • Can't find `rga-preproc` under GitBash (Windows)

    Can't find `rga-preproc` under GitBash (Windows)

    rga works ok under PowerShell and WSL (Ubuntu under Windows), but when I run rga from GiBash it fails by saying preprocessor command failed: '"C:\\MyPrograms\\bin\\rga-preproc" "file.pdf" followed by

    adapter: poppler
    pdftotext version 4.00
    Copyright 1996-2017 Glyph & Cog, LLC
    Usage: pdftotext [options] <PDF-file> [<text-file>]
      -f <int>             : first page to convert
      -l <int>             : last page to convert
      -layout              : maintain original physical layout
      -simple              : simple one-column page layout
      -table               : similar to -layout, but optimized for tables
      -lineprinter         : use strict fixed-pitch/height layout
      -raw                 : keep strings in content stream order
      -fixed <number>      : assume fixed-pitch (or tabular) text
      -linespacing <number>: fixed line spacing for LinePrinter mode
      -clip                : separate clipped text
      -nodiag              : discard diagonal text
      -enc <string>        : output text encoding name
      -eol <string>        : output end-of-line convention (unix, dos, or mac)
      -nopgbrk             : don't insert page breaks between pages
      -bom                 : insert a Unicode BOM at the start of the text file
      -opw <string>        : owner password (for encrypted files)
      -upw <string>        : user password (for encrypted files)
      -q                   : don't print any messages or errors
      -cfg <string>        : configuration file to use in place of .xpdfrc
      -v                   : print copyright and version info
      -h                   : print usage information
      -help                : print usage information
      --help               : print usage information
      -?                   : print usage information
    Error: The pipe has been ended. (os error 109)
    
    opened by kpym 0
  • Project state and Help Wanted: rga 1.0 with configurable external adapters and async rust

    Project state and Help Wanted: rga 1.0 with configurable external adapters and async rust

    The current version of rga is 0.9.6, released in 2020.

    This is a small side project for me, so I've only spent very little time on this project even though I've regularily been using this tool myself.

    For the next version the focus is on being able to configure custom preprocessors in addition to the internal ones.

    For example, the integrated PDF adapter is rewritten and would look pretty much like this in ~/.config/ripgrep-all/config.jsonc:

    {
        "custom_adapters": [
            {
                "name": "poppler",
                "version": 1,
                "description": "Uses pdftotext (from poppler-utils) to extract plain text from PDF files",
    
                "extensions": ["pdf"],
                "mimetypes": ["application/pdf"],
    
                "binary": "pdftotext",
                "args": ["-", "-"],
                "disabled_by_default": false,
                "match_only_by_mime": false,
                "postprocessors": [{"name": "add_page_numbers_by_pagebreaks"}]
            }
        ]
    }
    

    While implementing this, I hit some issues with threading though that exceeded my Rust, so I stopped working on it for a while.

    More recently, I converted the core of the code to async rust (now passing around Box<dyn AsyncRead + Send>).

    The following work still needs to be done:

    • Fixing / Converting the postprocessors to async. Specifically postproc_encoding and postproc_pagebreaks in https://github.com/phiresky/ripgrep-all/blob/master/src/adapters/postproc.rs . postproc_prefix already works with async.
    • Reenabling and converting the other internal adapters to async (https://github.com/phiresky/ripgrep-all/blob/54799f14528c35f6b2bfe5f05cb2e05e58ff5d10/src/adapters.rs#L120-L126)
    • Fixing all the failing tests and possibly adding new ones.
    • Making sure recursion into archives works with any combination of adapters

    I'll implement these myself at some point, but at a trickling rate that may take a long time until the next release. So I'm happy for PRs that help.

    opened by phiresky 1
  • --auto-hybrid-regex bad offset into UTF string

    --auto-hybrid-regex bad offset into UTF string

    --auto-hybrid-regex should produce the same output as --pcre2 if one passes in a PCRE2 pattern. However, it throws an error on some files.

    I've included the following examples and files.

    • Offending pdf file to reproduce the rror can be downloaded here: https://onlinelibrary.wiley.com/doi/10.1111/j.1472-4642.2008.00521.x
    • Another file which does not present this problem and shows identical output: https://onlinelibrary.wiley.com/doi/10.1111/j.1365-2486.2011.02549.x

    With --auto-hybrid-regex:

    rga '(?=.*biotic interaction)(?=.*plant)' 'Diversity and Distributions - 2008 - Catford - Reducing redundancy in invasion ecology by integrating hypotheses into a.pdf' --auto-hybrid-regex
    

    Faulty output:

    Diversity and Distributions - 2008 - Catford - Reducing redundancy in invasion ecology by integrating hypotheses into a.pdf: preprocessor command failed: '"/home/aj/.cargo/bin/rga-preproc" "Diversity and Distributions - 2008 - Catford - Reducing redundancy in invasion ecology by integrating hypotheses into a.pdf"': PCRE2: error matching: bad offset into UTF string
    

    With --pcre2:

    rga '(?=.*biotic interaction)(?=.*plant)' 'Diversity and Distributions - 2008 - Catford - Reducing redundancy in invasion ecology by integrating hypotheses into a.pdf' --pcre2
    

    Correct output:

    Page 18: Vázquez, D.P. (2006) Biotic interactions and plant invasions.
    
    opened by InvisOn 0
  • Searching encrypted files

    Searching encrypted files

    I have some password protected PDF files that I'd like to search with rga. How do I specify the password?

    More general, is it possible to pass options to the underlying adapters on invocation?

    opened by dideler 0
Releases(v0.9.6)
Owner
CS Student. ML Researcher. Fan of FOSS.
null
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

WriteForAll: tips to make text better WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

Joel Parker Henderson 2 Dec 27, 2022
Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents.

Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents. The library provides strategies to act on objects that implement certain document traits (NaiveDocument, ProcessedDocument, ExpandableDocument).

Ferris Tseng 13 Oct 31, 2022
Ferrugem is Rust but in Portuguese-BR

ferrugem Aren't you pistola from writing Rust programs in English? Do you like saying "caralho" a lot? Would you like to try something different, in a

Sturdy Robot 6 Oct 5, 2023
Crates.io library that provides high-level APIs for obtaining information on various entertainment media such as books, movies, comic books, anime, manga, and so on.

Crates.io library that provides high-level APIs for obtaining information on various entertainment media such as books, movies, comic books, anime, manga, and so on.

consumet-rs 5 Aug 13, 2023
A simple rust library to read and write Zip archives, which is also my pet project for learning Rust

rust-zip A simple rust library to read and write Zip archives, which is also my pet project for learning Rust. At the moment you can list the files in

Kang Seonghoon 2 Jan 5, 2022
Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents.

Schema 2000 Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents. Currently, Schema2000 is

REWE Digital GmbH 12 Dec 6, 2022
Super-simple, fully Rust powered "memory" (doc store + semantic search) for LLM projects, semantic search, etc.

memex Super simple "memory" for LLM projects, semantic search, etc. Running the service Note that if you're running on Apple silicon (M1/M2/etc.), it'

Spyglass Search 15 Jun 19, 2023
PDF Structure Viewer, This tool is useful for when working with PDFs and/or lopdf.

PDF Structure Viewer Inspect how the PDF's structure looks. This tool is useful for when working with PDFs and/or lopdf. This application is used lopd

Ralph Bisschops 13 Nov 21, 2022
Tool written in Rust to perform Password Spraying attacks against Azure/Office 365 accounts

AzurePasswordSprayer Tool written in Rust to perform Password Spraying attacks against Azure/Office 365 accounts. It is multi threaded and makes no co

Pierre 7 Feb 27, 2024
Tool written in Rust to enumerate the valid email addresses of an Azure/Office 365 Tenant

AzureEmailChecker Tool written in Rust to enumerate the valid email addresses of an Azure/Office 365 Tenant. It is multi threaded and makes no connect

Pierre 11 Feb 27, 2024
Search through millions of documents in milliseconds ⚡️

a concurrent indexer combined with fast and relevant search algorithms Introduction This repository contains the core engine used in MeiliSearch. It c

MeiliSearch 433 Dec 20, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Tar file reading/writing for Rust

tar-rs Documentation A tar archive reading/writing library for Rust. # Cargo.toml [dependencies] tar = "0.4" Reading an archive extern crate tar; use

Alex Crichton 490 Dec 30, 2022
List of Rust books

Rust Books Books Starter Books Advanced Books Resources Books Starter Books The Rust Programming Language Free Welcome! This book will teach you about

Spiros Gerokostas 2.2k Jan 9, 2023
mdBook is a utility to create modern online books from Markdown files.

Create book from markdown files. Like Gitbook but implemented in Rust

The Rust Programming Language 11.6k Jan 4, 2023
Fast line based iteration almost entirely lifted from ripgrep's grep_searcher.

?? ripline This is not the greatest line reader in the world, this is just a tribute. Fast line based iteration almost entirely lifted from ripgrep's

Seth 11 Feb 18, 2022
ripgrep recursively searches directories for a regex pattern while respecting your gitignore

ripgrep (rg) ripgrep is a line-oriented search tool that recursively searches the current directory for a regex pattern. By default, ripgrep will resp

Andrew Gallant 35k Jan 2, 2023
tar analysis tool

alquitran Inspects tar archives and tries to spot portability issues in regard to POSIX 2017 pax specification and common tar implementations. Usage R

null 16 Aug 12, 2022