rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

Last update: Jan 2, 2023

Related tags

Text processing ripgrep-all

Overview

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. rga wraps the awesome ripgrep and enables it to search in pdf, docx, sqlite, jpg, movie subtitles (mkv, mp4), etc.

For more detail, see this introductory blogpost: https://phiresky.github.io/blog/2019/rga--ripgrep-for-zip-targz-docx-odt-epub-jpg/

rga will recursively descend into archives and match text in every file type it knows.

Here is an example directory with different file types:

demo/
├── greeting.mkv
├── hello.odt
├── hello.sqlite3
└── somearchive.zip
├── dir
│ ├── greeting.docx
│ └── inner.tar.gz
│ └── greeting.pdf
└── greeting.epub

Integration with fzf

You can use rga interactively via fzf. Add the following to your ~/.{bash,zsh}rc:

rga-fzf() {
	RG_PREFIX="rga --files-with-matches"
	local file
	file="$(
		FZF_DEFAULT_COMMAND="$RG_PREFIX '$1'" \
			fzf --sort --preview="[[ ! -z {} ]] && rga --pretty --context 5 {q} {}" \
				--phony -q "$1" \
				--bind "change:reload:$RG_PREFIX {q}" \
				--preview-window="70%:wrap"
	)" &&
	echo "opening $file" &&
	xdg-open "$file"
}

INSTALLATION

Linux x64, macOS and Windows binaries are available in GitHub Releases.

Linux

Arch Linux

simply install from AUR: yay -S ripgrep-all.

Nix

nix-env -iA nixpkgs.ripgrep-all

Debian-based

download the rga binary and get the dependencies like this:

apt install ripgrep pandoc poppler-utils ffmpeg

If ripgrep is not included in your package sources, get it from here.

rga will search for all binaries it calls in $PATH and the directory itself is in.

Windows

Install ripgrep-all via Chocolatey:

choco install ripgrep-all

Note that installing via chocolatey or scoop is the only supported download method. If you download the binary from releases manually, you will not get the dependencies (for example pdftotext from poppler).

If you get an error like VCRUNTIME140.DLL could not be found, you need to install vc_redist.x64.exe.

Homebrew/Linuxbrew

rga can be installed with Homebrew:

brew install rga

To install the dependencies that are each not strictly necessary but very useful:

brew install pandoc poppler tesseract ffmpeg

Compile from source

rga should compile with stable Rust (v1.36.0+, check with rustc --version). To build it, run the following (or the equivalent in your OS):

   ~$ apt install build-essential pandoc poppler-utils ffmpeg ripgrep cargo
   ~$ cargo install ripgrep_all
   ~$ rga --version    # this should work now

Available Adapters

rga --rga-list-adapters

Adapters:

ffmpeg Uses ffmpeg to extract video metadata/chapters and subtitles
Extensions: .mkv, .mp4, .avi

pandoc Uses pandoc to convert binary/unreadable text documents to plain markdown-like text
Extensions: .epub, .odt, .docx, .fb2, .ipynb

poppler Uses pdftotext (from poppler-utils) to extract plain text from PDF files
Extensions: .pdf
Mime Types: application/pdf
zip Reads a zip file as a stream and recurses down into its contents
Extensions: .zip
Mime Types: application/zip
decompress Reads compressed file as a stream and runs a different extractor on the contents.
Extensions: .tgz, .tbz, .tbz2, .gz, .bz2, .xz, .zst
Mime Types: application/gzip, application/x-bzip, application/x-xz, application/zstd
tar Reads a tar file as a stream and recurses down into its contents
Extensions: .tar

sqlite Uses sqlite bindings to convert sqlite databases into a simple plain text format
Extensions: .db, .db3, .sqlite, .sqlite3
Mime Types: application/x-sqlite3

The following adapters are disabled by default, and can be enabled using '--rga-adapters=+pdfpages,tesseract':

pdfpages Converts a pdf to its individual pages as png files. Only useful in combination with tesseract
Extensions: .pdf
Mime Types: application/pdf
tesseract Uses tesseract to run OCR on images to make them searchable. May need -j1 to prevent overloading the system. Make sure you have tesseract installed.
Extensions: .jpg, .png

USAGE:

rga [RGA OPTIONS] [RG OPTIONS] PATTERN [PATH ...]

FLAGS:

--rga-accurate

Use more accurate but slower matching by mime type

By default, rga will match files using file extensions. Some programs, such as sqlite3, don't care about the file extension at all, so users sometimes use any or no extension at all. With this flag, rga will try to detect the mime type of input files using the magic bytes (similar to the `file` utility), and use that to choose the adapter. Detection is only done on the first 8KiB of the file, since we can't always seek on the input (in archives).

-h, --help

Prints help information

--rga-list-adapters

List all known adapters

--rga-no-cache

Disable caching of results

By default, rga caches the extracted text, if it is small enough, to a database in ~/.cache/rga on Linux, ~/Library/Caches/rga on macOS, or C:\Users\username\AppData\Local\rga on Windows. This way, repeated searches on the same set of files will be much faster. If you pass this flag, all caching will be disabled.

--rg-help

Show help for ripgrep itself

--rg-version

Show version of ripgrep itself

-V, --version

Prints version information

OPTIONS:

--rga-adapters=<adapters>...

Change which adapters to use and in which priority order (descending)

"foo,bar" means use only adapters foo and bar. "-bar,baz" means use all default adapters except for bar and baz. "+bar,baz" means use all default adapters and also bar and baz.

--rga-cache-compression-level=<cache-compression-level>

ZSTD compression level to apply to adapter outputs before storing in cache db

Ranges from 1 - 22 [default: 12]

--rga-cache-max-blob-len=<cache-max-blob-len>

Max compressed size to cache

Longest byte length (after compression) to store in cache. Longer adapter outputs will not be cached and recomputed every time. Allowed suffixes: k M G [default: 2000000]

--rga-max-archive-recursion=<max-archive-recursion>

Maximum nestedness of archives to recurse into [default: 4]

-h shows a concise overview, --help shows more detail and advanced options.

All other options not shown here are passed directly to rg, especially [PATTERN] and [PATH ...]

Development

To enable debug logging:

export RUST_LOG=debug
export RUST_BACKTRACE=1

Also remember to disable caching with --rga-no-cache or clear the cache (~/Library/Caches/rga on macOS, ~/.cache/rga on other Unixes, or C:\Users\username\AppData\Local\rga on Windows) to debug the adapters.

Comments

Can't install ripgrep_all with cargo from crates

I tried to update ripgrep_all, but it failed on what appears to be a dependency issue. If this has been resolved on master, it’s probably worth cutting a new version to crates.io that fixes this.

$ cargo install ripgrep_all
    Updating crates.io index
  Installing ripgrep_all v0.9.6
error: failed to compile `ripgrep_all v0.9.6`, intermediate artifacts can be found at `/var/folders/6t/2xp3hp2j1vd5yyklq0b6_lbw0000gn/T/cargo-install3jDwt5`

Caused by:
  failed to select a version for the requirement `cachedir = "^0.1.1"`
  candidate versions found which didn't match: 0.2.0
  location searched: crates.io index
required by package `ripgrep_all v0.9.6`

$ rustc --version
rustc 1.44.1 (c7087fe00 2020-06-17)
$ cargo --version
cargo 1.44.1 (88ba85757 2020-06-11)

opened by halostatue 12

error: failed to compile `ripgrep_all v0.9.6`

Is it necessary to use cachedir = "^0.1.1"?

$ cargo install ripgrep_all
    Updating crates.io index
  Downloaded ripgrep_all v0.9.6
  Downloaded 1 crate (113.1 KB) in 0.62s
  Installing ripgrep_all v0.9.6
error: failed to compile `ripgrep_all v0.9.6`, intermediate artifacts can be found at `/var/folders/_p/36jhfrg52ld1643fftf08x740000gp/T/cargo-installuJ8iui`

Caused by:
  failed to select a version for the requirement `cachedir = "^0.1.1"`
  candidate versions found which didn't match: 0.3.0, 0.2.0
  location searched: crates.io index
  required by package `ripgrep_all v0.9.6`

$ rustup -V
rustup 1.23.1 (3df2264a9 2020-11-30)
info: This is the version for the rustup toolchain manager, not the rustc compiler.
info: The currently active `rustc` version is `rustc 1.48.0 (7eac88abb 2020-11-16)`

$ cargo -V
cargo 1.48.0 (65cbdd2dc 2020-10-14)

opened by slowkow 11

`Ripgrep-all` runs very slowly for the first time after the computer starts.

On Ubuntu 20.04.3 LTS, I'm using the self-compiled git master version of ripgrep-all. I noticed that it runs very slowly for the first time after the computer starts. Therefore, I think it must work based on caching mechanism. The problem is how to maintain the cache after the computer is restarted to maximize the operation efficiency.

Any hints for this problem will be highly appreciated.

Regards, HZ

opened by hongyi-zhao 8
Fix installation and CI
Fixes installation with the stable toolchain. Essentially it's just cargo update

Fixes the push pipeline, now it fails on a test

Fixes the release pipeline
opened by TriplEight 8

error running pdf search on windows 10 - 64bit

I tried running the pdf search with the adapter "poppler" on both version 0.9.2 and 0.9.3 and I get the following error message. What am I missing here?

Reference.pdf: preprocessor command failed: '"rga-preproc" "Reference.pdf"':
-------------------------------------------------------------------------------
adapter: poppler
pdftotext version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -layout              : maintain original physical layout
  -simple              : simple one-column page layout
  -table               : similar to -layout, but optimized for tables
  -lineprinter         : use strict fixed-pitch/height layout
  -raw                 : keep strings in content stream order
  -fixed <number>      : assume fixed-pitch (or tabular) text
  -linespacing <number>: fixed line spacing for LinePrinter mode
  -clip                : separate clipped text
  -nodiag              : discard diagonal text
  -enc <string>        : output text encoding name
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bom                 : insert a Unicode BOM at the start of the text file
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)
  -q                   : don't print any messages or errors
  -cfg <string>        : configuration file to use in place of .xpdfrc
  -v                   : print copyright and version info
  -h                   : print usage information
  -help                : print usage information
  --help               : print usage information
  -?                   : print usage information
Error: The pipe has been ended. (os error 109)

opened by neelabalan 8

preprocessor command failed: '"rga-preproc" "/Users/user/Desktop/test/test.pdf.zip"

I am getting this error while executing:

rga "hello" ~/Desktop/test/

where I have a zip file. I don't understand from the documentation whether ZIP files need an extra argument or not. Thanks in advance.

opened by AtomicNess123 7
brew troubles?

I did an install on macOS Catalina (10.15.7) using brew install rga and for all files I test this on I get this error when I try rga testing:

------------------------------------------------------------------------------- ./some file.pdf: preprocessor command failed: '"/usr/local/bin/rga-preproc" "./some file.pdf"': ------------------------------------------------------------------------------- adapter: poppler Error: Couldn't open file '-' Error: Broken pipe (os error 32)

So then I went ahead and installed all the additional libraries mentioned in the readme with brew install pandoc poppler tesseract ffmpeg but this didn't seem to help at all. Even tried reinstalling rga after that.

opened by hyperjeff 6
Respect .rgignore

Hi, first of all thanks for this - this is incredibly useful for me. Caching makes all the difference comparing to the slow pdfgrep!

It seems rga doesn't respect .rgignore file? Would it be possible to add it please?

opened by rsuhada 6

Build fails with "unstable feature" error in rkv dependency

I tried doing cargo build (of master at commit ef2e4ebf28f) and got this error:

  $ cargo build 
      Updating crates.io index
   Downloading crates ...
    Downloaded chrono v0.4.6
    Downloaded encoding_rs v0.8.17
    [...]
     Compiling zip v0.5.2
     Compiling serde_json v1.0.39
     Compiling rkv v0.9.6
  error[E0658]: use of unstable library feature 'try_from' (see issue #33417)
     --> /home/kfogel/.cargo/registry/src/github.com-1ecc6299db9ec823/rkv-0.9.6/src/error.rs:166:11
      |
  166 | impl From<::std::num::TryFromIntError> for MigrateError {
      |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
  error[E0658]: use of unstable library feature 'try_from' (see issue #33417)
    --> /home/kfogel/.cargo/registry/src/github.com-1ecc6299db9ec823/rkv-0.9.6/src/migrate.rs:78:5
     |
  78 |     convert::TryFrom,
     |     ^^^^^^^^^^^^^^^^
  
  [...many more similar error lines...]
  
  error: aborting due to 12 previous errors
  
  For more information about this error, try `rustc --explain E0658`.
  error: Could not compile `rkv`.
  warning: build failed, waiting for other jobs to finish...
  error: build failed
  $

I don't know much Rust, but it looks like rkv is using an unstable feature (rust bug 33417 has more about it), and that since rga depends on rkv, this affects the rga build too. I ran rustc --explain E0658 and got some information about how to solve the problem -- presumably those solutions would have to be implemented upstream in rkv, if we wanted to solve this for everyone, or else I'd have either build a modified rkv locally or get the nightly version of rustc to do the build I just tried to do.

I'm not sure what ways might be available to solve this within rga. Ideas welcome; like I said, I don't know Rust that well.

Anyway, this was all along the way to submitting a PR for README.md to add installation instructions. I'll submit that PR, and then in its commentary mention this issue.

opened by kfogel 6

Opening a xml file and ran code inside when it wasn't supposed to (security?)

I used rga-fzf to search for a xml file. That file had a powershell script in it. When clicking on enter to open the file, the powershell script got executed which wasn't intended as it was malicious 😅

I am using Manjaro (Arch Linux) with zsh and powershell+wine installed.

Did anyone else observed that?

Some screenshots:

opened by evilcel3ri 5

feature_request(books): detect incorrect and poor quality text

1. Summary

It would be nice, if ripgrep-all will show warning, if text in the book is not written incorrect or have a bad quality.

2. Problem

2.1. Summary

Some books have bad OCR layer. It is impossible to search for normal words in them. It would be nice, if ripgrep-all will detect these books.

2.2. Details

Books may have bad quality of searchable text. Reasons:

The user who added OCR layer for the book, add incorrect language for OCR. For example, user may added English OCR layer for Russian text as in my 4.2 example.
Bad quality of scanned book and/or tool which was used to add the OCR layer. See my example 3.

I couldn't find, how I can automatically detect these books in my books list. Currently, I need manually check OCR layer quality for every book. It takes a lot of time.

3. Compact Language Detector

Possibly, Compact Language Detector can solve this problem.

I installed cld2-cffi (yes, CLD3 exists, but I have problems in its installation on my Windows) → I ran this code in my Python interpreter:

>>> import cld2
>>> isReliable, textBytesFound, details = cld2.detect("Here text from examples 4.1—4.3")
>>> print('  details: %s' % str(details))

Possibly, would be possible get similar behavior use Rust tools. For example, see Whatlang and CLD3 langdetect.

4. Example texts

4.1. Normal Russian text

Например, название Полтавы связано с названием речки Лтавы (так раньше называлась Ворскла) и означает, соответственно, «город на Лтаве». Название города Ужгород также образовано от названия реки Уж. Винница обязана своим названием речке Винничке, которая протекает через город. Название реки, в свою очередь, происходит от слова «венок»: когда-то молодые девушки собирались на ее берегу и пускали на воду венки, чтобы узнать о своем будущем. Луганск назван в честь речки Луганки.

cld2-cffi output:

details: (Detection(language_name='RUSSIAN', language_code='ru', percent=99, score=709.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))

4.2. English OCR language for Russian text

IIepBhle HeCKOJI:bKO COT MHJIJIHOHOB JIeT 6h1JIH nOHCTHHe KOWMapHhlMH ,n;JIH nJIaHeThI: OHa HenpephlBHO COTpHC8.JIaC:b no,n; y,n;apaMH KpynHhlx MeTeopHTOB, ChlnaBWHXCH Ha Hee H3 KOCMoca. IIoBepxHOCT:b COBpeMeHHOH JIYHhI, nOKpLITaSi MeTeopHTHhlMH KpaTepaMH, n03BOJIHeT HaM npe,n;CTaBHT:b, KaK MOrJIa BhlrJIH,n;eT:b 3eMJIH npHMepHO 4 MJIp,n; JIeT Ha3. OqeH:b CKOpO BHyrpH HaweH nJIaHeThl3apa60T8.JI tTenJIOBOH ABHraTeJI:b., rOplOqHM ,n;JIH KOToporo CJIymHJI pacn pHoaKTHBHhlX SJIeMeHTOB. B He,n;pax 3eMJIH HaqaJIOCh Me,n;JIeHHOe ,n;BHmeHHe BeeCTBa, HarpeThle CTPYH KOToporo nOAHHM8.JIHC:b BBepx, a XOJIO.D;Hhle onYCK8.JIHCh BHH3. IIJIaHeTa CT8.JIa noxoma Ha CneJIhlH nepCHK.

details: (Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))

Note: I remove Information Separator One gremlin characters from this text for cld2-cffi, otherwise I get traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python38\lib\site-packages\cld2\__init__.py", line 393, in detect
    raise ValueError("input contains invalid UTF-8 around byte " +
ValueError: input contains invalid UTF-8 around byte 348 (of 792534779)

4.3. Bad OCR

1(этрин ска3ала' что знакома с книгой его )кены 3леоноры 8ирек по лекарстве[1пым расте|{ия!| аляски. ёа мой в3дох по поводу тогц что у нас в библиотеке тодько од'!а книга на эту фамилию, (этрин пообещала прислать книгу о лекарстве!тных расте]|иях аляски. !! действительно' не прошло и месяца' как у меня на столе появилась небольшая по объему эффектвого дизайва книга <а|-а5'(а'5 ш||овпшп$5 мвр1с]ш85> с изображе1|и₠ м ца обложке такого 3вакомого ка}<дому х{ителю нашей о6ласти ольховника. правда, в книге он значился 11од другим видовым названием' чем у ;1ас

details: (Detection(language_name='RUSSIAN', language_code='ru', percent=59, score=503.0), Detection(language_name='SERBIAN', language_code='sr', percent=40, score=468.0), Detection(language_name='Unknown', language_code='un', percent=0, score=0.0))

5. Example of expected behavior

ripgrep-all adapters extract text from books.
CLD (or similar tool) check 2 (4 may be better) random pages for every book.

If percent value is 95 (maybe another value is better; need practical tests) or more → do nothing. Else it below 95 → ripgrep-all user get a warning. Example warning text:

WARNING! Possibly, file {Filename} have a text written not in natural language. The reason for this may be incorrect or poor quality OCR layer. Please, check your {Filename}.

6. Note

Some tools for language recognition may not solve this problem. They don't detect that the text written not in the real natural language.

For example, I tried langdetect, TextBlob, guess_language and langid examples from this Stack Overflow answer → they show, that my 4.2 and 4.3 examples written on the real natural languages.

Thanks.

opened by Kristinita 5

Ignore Feature

I've noticed that when I'm running RGA on a folder it's hitting the lock files created by Word when a file is open and the preprocessor fails.

Chapter 4/~$cture4.docx: preprocessor command failed: '"/usr/local/bin/rga-preproc" "Chapter 4/~$cture4.docx"':
-------------------------------------------------------------------------------
adapter: pandoc
[WARNING] Deprecated: --atx-headers. Use --markdown-headings=atx instead.
couldn't unpack docx container: Did not find end of central directory signature
Error: subprocess failed: ExitStatus(unix_wait_status(16128))
-------------------------------------------------------------------------------

opened by pu-238 0

Can't find `rga-preproc` under GitBash (Windows)

rga works ok under PowerShell and WSL (Ubuntu under Windows), but when I run rga from GiBash it fails by saying preprocessor command failed: '"C:\\MyPrograms\\bin\\rga-preproc" "file.pdf" followed by

adapter: poppler
pdftotext version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -layout              : maintain original physical layout
  -simple              : simple one-column page layout
  -table               : similar to -layout, but optimized for tables
  -lineprinter         : use strict fixed-pitch/height layout
  -raw                 : keep strings in content stream order
  -fixed <number>      : assume fixed-pitch (or tabular) text
  -linespacing <number>: fixed line spacing for LinePrinter mode
  -clip                : separate clipped text
  -nodiag              : discard diagonal text
  -enc <string>        : output text encoding name
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bom                 : insert a Unicode BOM at the start of the text file
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)
  -q                   : don't print any messages or errors
  -cfg <string>        : configuration file to use in place of .xpdfrc
  -v                   : print copyright and version info
  -h                   : print usage information
  -help                : print usage information
  --help               : print usage information
  -?                   : print usage information
Error: The pipe has been ended. (os error 109)

opened by kpym 0

Project state and Help Wanted: rga 1.0 with configurable external adapters and async rust
The current version of rga is 0.9.6, released in 2020.

This is a small side project for me, so I've only spent very little time on this project even though I've regularily been using this tool myself.

For the next version the focus is on being able to configure custom preprocessors in addition to the internal ones.

For example, the integrated PDF adapter is rewritten and would look pretty much like this in ~/.config/ripgrep-all/config.jsonc:

{ "custom_adapters": [ { "name": "poppler", "version": 1, "description": "Uses pdftotext (from poppler-utils) to extract plain text from PDF files", "extensions": ["pdf"], "mimetypes": ["application/pdf"], "binary": "pdftotext", "args": ["-", "-"], "disabled_by_default": false, "match_only_by_mime": false, "postprocessors": [{"name": "add_page_numbers_by_pagebreaks"}] } ] }

While implementing this, I hit some issues with threading though that exceeded my Rust, so I stopped working on it for a while.

More recently, I converted the core of the code to async rust (now passing around Box<dyn AsyncRead + Send>).

The following work still needs to be done:

Fixing / Converting the postprocessors to async. Specifically postproc_encoding and postproc_pagebreaks in https://github.com/phiresky/ripgrep-all/blob/master/src/adapters/postproc.rs . postproc_prefix already works with async.

Reenabling and converting the other internal adapters to async (https://github.com/phiresky/ripgrep-all/blob/54799f14528c35f6b2bfe5f05cb2e05e58ff5d10/src/adapters.rs#L120-L126)

Fixing all the failing tests and possibly adding new ones.

Making sure recursion into archives works with any combination of adapters

I'll implement these myself at some point, but at a trickling rate that may take a long time until the next release. So I'm happy for PRs that help.
opened by phiresky 1

--auto-hybrid-regex bad offset into UTF string

--auto-hybrid-regex should produce the same output as --pcre2 if one passes in a PCRE2 pattern. However, it throws an error on some files.

I've included the following examples and files.

Offending pdf file to reproduce the rror can be downloaded here: https://onlinelibrary.wiley.com/doi/10.1111/j.1472-4642.2008.00521.x
Another file which does not present this problem and shows identical output: https://onlinelibrary.wiley.com/doi/10.1111/j.1365-2486.2011.02549.x

With --auto-hybrid-regex:

rga '(?=.*biotic interaction)(?=.*plant)' 'Diversity and Distributions - 2008 - Catford - Reducing redundancy in invasion ecology by integrating hypotheses into a.pdf' --auto-hybrid-regex

Faulty output:

Diversity and Distributions - 2008 - Catford - Reducing redundancy in invasion ecology by integrating hypotheses into a.pdf: preprocessor command failed: '"/home/aj/.cargo/bin/rga-preproc" "Diversity and Distributions - 2008 - Catford - Reducing redundancy in invasion ecology by integrating hypotheses into a.pdf"': PCRE2: error matching: bad offset into UTF string

With --pcre2:

rga '(?=.*biotic interaction)(?=.*plant)' 'Diversity and Distributions - 2008 - Catford - Reducing redundancy in invasion ecology by integrating hypotheses into a.pdf' --pcre2

Correct output:

Page 18: Vázquez, D.P. (2006) Biotic interactions and plant invasions.

opened by InvisOn 0

Searching encrypted files

I have some password protected PDF files that I'd like to search with rga. How do I specify the password?

More general, is it possible to pass options to the underlying adapters on invocation?

opened by dideler 0

Releases(v0.9.6)

v0.9.6(May 19, 2020)
0.9.6 (2020-05-19)

Fix windows builds

Case insensitive file extension matching

Move to Github Actions instead of Travis

Fix searching for words that are hyphenated in PDFs (#44)

Always load rga-preproc binary from location where rga is

Source code(tar.gz)
Source code(zip)
ripgrep_all-v0.9.6-arm-unknown-linux-gnueabihf.tar.gz(4.37 MB)
ripgrep_all-v0.9.6-x86_64-apple-darwin.tar.gz(4.94 MB)
ripgrep_all-v0.9.6-x86_64-pc-windows-msvc.zip(4.34 MB)
ripgrep_all-v0.9.6-x86_64-unknown-linux-musl.tar.gz(4.83 MB)
v0.9.5(Apr 8, 2020)
Allow search in pdf files without extension (https://github.com/phiresky/ripgrep-all/issues/39)

Prefer shipped binaries to system-installed ones (https://github.com/phiresky/ripgrep-all/issues/32)

Upgrade dependencies

If you use Windows, use version 0.9.3 (see #41)
Source code(tar.gz)
Source code(zip)
ripgrep_all-v0.9.5-x86_64-apple-darwin.tar.gz(5.00 MB)
ripgrep_all-v0.9.5-x86_64-unknown-linux-musl.tar.gz(6.46 MB)

Owner

CS Student. ML Researcher. Fan of FOSS.

GitHub

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

75 Jan 8, 2023

WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

WriteForAll: tips to make text better WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

2 Dec 27, 2022

Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents.

Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents. The library provides strategies to act on objects that implement certain document traits (NaiveDocument, ProcessedDocument, ExpandableDocument).

13 Oct 31, 2022

Ferrugem is Rust but in Portuguese-BR

ferrugem Aren't you pistola from writing Rust programs in English? Do you like saying "caralho" a lot? Would you like to try something different, in a

6 Oct 5, 2023

Crates.io library that provides high-level APIs for obtaining information on various entertainment media such as books, movies, comic books, anime, manga, and so on.

5 Aug 13, 2023

This tool is for those who often want to search for a string deeply into a directory in recursive mode, but not with the great tool: grep, ack, ripgrep .........一个工具最大的价值不是它有多少功能，而是它能够让你以多快的速度达成所愿......

SSS - so stupid search tool <阿Q的哥锐普> English Documentation install install from source code 1.install rust toolchain curl --proto '=https' --tlsv1.2 -

136 Dec 11, 2022

A simple rust library to read and write Zip archives, which is also my pet project for learning Rust

rust-zip A simple rust library to read and write Zip archives, which is also my pet project for learning Rust. At the moment you can list the files in

2 Jan 5, 2022

Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents.

Schema 2000 Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents. Currently, Schema2000 is

12 Dec 6, 2022

Super-simple, fully Rust powered "memory" (doc store + semantic search) for LLM projects, semantic search, etc.

memex Super simple "memory" for LLM projects, semantic search, etc. Running the service Note that if you're running on Apple silicon (M1/M2/etc.), it'

15 Jun 19, 2023

PDF Structure Viewer, This tool is useful for when working with PDFs and/or lopdf.

PDF Structure Viewer Inspect how the PDF's structure looks. This tool is useful for when working with PDFs and/or lopdf. This application is used lopd

13 Nov 21, 2022

Tool written in Rust to perform Password Spraying attacks against Azure/Office 365 accounts

AzurePasswordSprayer Tool written in Rust to perform Password Spraying attacks against Azure/Office 365 accounts. It is multi threaded and makes no co

7 Feb 27, 2024

Tool written in Rust to enumerate the valid email addresses of an Azure/Office 365 Tenant

AzureEmailChecker Tool written in Rust to enumerate the valid email addresses of an Azure/Office 365 Tenant. It is multi threaded and makes no connect

11 Feb 27, 2024

Search through millions of documents in milliseconds ⚡️

a concurrent indexer combined with fast and relevant search algorithms Introduction This repository contains the core engine used in MeiliSearch. It c

433 Dec 20, 2022

EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents into Algolia's search index.

crate link EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents

3 Mar 20, 2024

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

75 Jan 8, 2023

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

Related tags

Overview

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

Integration with fzf

INSTALLATION

Linux

Arch Linux

Nix

Debian-based

Windows

Homebrew/Linuxbrew

Compile from source

Available Adapters

USAGE:

FLAGS:

OPTIONS:

Development

Comments

1. Summary

2. Problem

2.1. Summary

2.2. Details

3. Compact Language Detector

4. Example texts

4.1. Normal Russian text

4.2. English OCR language for Russian text

4.3. Bad OCR

5. Example of expected behavior

6. Note

Releases(v0.9.6)

v0.9.6(May 19, 2020)

0.9.6 (2020-05-19)

v0.9.5(Apr 8, 2020)

Owner

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

WriteForAll is a text file style checker, that compares text documents with editorial tips to make text better.

Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents.

Ferrugem is Rust but in Portuguese-BR

Crates.io library that provides high-level APIs for obtaining information on various entertainment media such as books, movies, comic books, anime, manga, and so on.

This tool is for those who often want to search for a string deeply into a directory in recursive mode, but not with the great tool: grep, ack, ripgrep .........一个工具最大的价值不是它有多少功能，而是它能够让你以多快的速度达成所愿......

A simple rust library to read and write Zip archives, which is also my pet project for learning Rust

Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents.

Super-simple, fully Rust powered "memory" (doc store + semantic search) for LLM projects, semantic search, etc.

PDF Structure Viewer, This tool is useful for when working with PDFs and/or lopdf.

Tool written in Rust to perform Password Spraying attacks against Azure/Office 365 accounts

Tool written in Rust to enumerate the valid email addresses of an Azure/Office 365 Tenant

Search through millions of documents in milliseconds ⚡️

EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents into Algolia's search index.

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

Tar file reading/writing for Rust

List of Rust books

mdBook is a utility to create modern online books from Markdown files.

Fast line based iteration almost entirely lifted from ripgrep's grep_searcher.

ripgrep recursively searches directories for a regex pattern while respecting your gitignore