Lens crawler & cacher

Overview

netrunner

netrunner is a tool to help build, validate, & create archives for Spyglass lenses.

Lenses are a simple set of rules that tell a crawler which URLs it should crawl. Combined w/ data from sitemaps and/or the Internet Archive, netrunner can crawl and created an archive of the pages represented by the lens.

In Spyglass, this is used to create a personalized search engine that only crawls & indexes sites/pages/data that you're interested in.

Installing the CLI

From crates.io:

cargo install spyglass-netrunner

or from source:

cargo install --path .

Running the CLI

netrunner 0.1.0
Andrew Huynh <[email protected]>
A small CLI utility to help build lenses for spyglass

USAGE:
    netrunner [OPTIONS] <SUBCOMMAND>

OPTIONS:
    -h, --help                Print help information
    -l, --lens-file <FILE>    Lens file
    -V, --version             Print version information

SUBCOMMANDS:
    check-urls    Grabs all the URLs represented by <lens-file> for review
    clean         Removes temporary directories/files
    crawl         Crawls & creates a web archive for the pages represented by <lens-file>
    help          Print this message or the help of the given subcommand(s)
    validate      Validate the lens file and, if available, the cached web archive for
                      <lens-file>

Commands in depth

check-urls

This command will grab all the urls represented by this lens. This will be gathered in either of two ways:

  1. Sitemap(s) for the domain(s) represented. If available, the tool will prioritize using sitemaps or

  2. Using data from the Internet Archive to determine what URLs are represented by the rules in the lens.

The list will then be sorted alphabetically and outputted to stdout. This is a great way to check whether the URLs that will be crawled & indexed are what you're expecting for a lens.

crawl

This will use the rules defined in the lens to crawl & archive the pages within.

This is primarily used as way to create cached web archives that can be distributed w/ community created lenses so that others don't have to crawl the same pages.

validate

This will validate the rules inside a lens file and, if previouisly crawled, the cached web archive for this lens.

You might also like...
Comments
  • feature: handle remote lens file

    feature: handle remote lens file

    • Handle URLs as a lens file argument
    • Add Dockerfile to build a small docker image
    • Better handling of invalid WARC files during validation
    • Add a check for missing URLs in an archive
    opened by a5huynh 0
Owner
Spyglass Search
Spyglass Search
Crusty - polite && scalable broad web crawler

Broad web crawling is an activity of going through practically boundless web by starting from a set of locations(urls) and following outgoing links. Usually it doesn't matter where you start from as long as it has outgoing links to external domains.

Sergey F. 72 Jan 2, 2023
Rebuilderd debian buildinfo crawler

Sponsored by: rebuilderd-debian-buildinfo-crawler This program parses the Packages.xz debian package index, attempts to discover the right buildinfo f

null 4 Feb 14, 2022
A CLI tool based on the crypto-crawler-rs library to crawl trade, level2, level3, ticker, funding rate, etc.

carbonbot A CLI tool based on the crypto-crawler-rs library to crawl trade, level2, level3, ticker, funding rate, etc. Run To quickly get started, cop

null 8 Dec 21, 2022
The parser library to parse messages from crypto-crawler.

crypto-msg-parser The parser library to parse messages from crypto-crawler. Architecture crypto-msg-parser is the parser library to parse messages fro

null 5 Jan 2, 2023
🌊 ~ seaward is a crawler which searches for links or a specified word in a website.

?? seaward Installation cargo install seaward On NetBSD a pre-compiled binary is available from the official repositories. To install it, simply run:

null 3 Jul 16, 2023
A lightweight async Web crawler in Rust, optimized for concurrent scraping while respecting `robots.txt` rules.

??️ crawly A lightweight and efficient web crawler in Rust, optimized for concurrent scraping while respecting robots.txt rules. ?? Features Concurren

CrystalSoft 5 Aug 29, 2023
A small, memory efficient crawler written in Rust.

Atra - The smaller way to crawl !!This read me will we reworked in a few days. Currently I am working on a better version and a wiki for the config fi

Felix Engl 3 Mar 23, 2024