Rust crate for configurable parallel web crawling, designed to crawl for content

Pop!_OS

Last update: Aug 22, 2021

Related tags

Network programming url-crawler

Overview

url-crawler

A configurable parallel web crawler, designed to crawl a website for content.

Example

extern crate url_crawler;
use std::sync::Arc;
use url_crawler::*;

/// Function for filtering content in the crawler before a HEAD request.
///
/// Only allow directory entries, and files that have the `deb` extension.
fn apt_filter(url: &Url) -> bool {
    let url = url.as_str();
    url.ends_with("/") || url.ends_with(".deb")
}

pub fn main() {
    // Create a crawler designed to crawl the given website.
    let crawler = Crawler::new("http://apt.pop-os.org/".to_owned())
        // Use four threads for fetching
        .threads(4)
        // Check if a URL matches this filter before performing a HEAD request on it.
        .pre_fetch(Arc::new(apt_filter))
        // Initialize the crawler and begin crawling. This returns immediately.
        .crawl();

    // Process url entries as they become available
    for file in crawler {
        println!("{:#?}", file);
    }
}

Output

The folowing includes two snippets from the combined output.

...
Html {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/s/system76-cudnn-9.2/"
}
Html {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/t/tensorflow-1.9-cuda-9.2/"
}
Html {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/t/tensorflow-1.9-cpu/"
}
...
File {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/binary-amd64/a/atom/atom_1.30.0_amd64.deb",
    content_type: "application/octet-stream",
    length: 87689398,
    modified: Some(
        2018-09-25T17:54:39+00:00
    )
}
File {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/binary-amd64/a/atom/atom_1.31.1_amd64.deb",
    content_type: "application/octet-stream",
    length: 90108020,
    modified: Some(
        2018-10-03T22:29:15+00:00
    )
}
...

An efficient web server for TiddlyWikis.

Tiddlywiki Server This is a web backend for TiddlyWiki. It uses TiddlyWiki's web server API to save tiddlers in a [SQLite database]. It should come wi

17 Nov 19, 2022

A simple web server(and library) to display server stats over HTTP and Websockets/SSE or stream it to other systems.

x-server-stats A simple web server(and library) to display server stats over HTTP and Websockets/SSE or stream it to other systems. x-server(in x-serv

11 Oct 17, 2022

The open source distributed web search engine that searches by meaning.

DawnSearch DawnSearch is an open source distributed web search engine that searches by meaning. It uses semantic search (searching on meaning), using

4 Aug 8, 2023

Rust crate for scraping URLs from HTML pages

url-scraper Rust crate for scraping URLs from HTML pages. Example extern crate url_scraper; use url_scraper::UrlScraper; fn main() { let director

35 Aug 18, 2022

Safe Rust crate for creating socket servers and clients with ease.

bitsock Safe Rust crate for creating socket servers and clients with ease. Description This crate can be used for Client -- Server applications of e

3 Nov 25, 2021

Dav-server-rs - Rust WebDAV server library. A fork of the webdav-handler crate.

dav-server-rs A fork of the webdav-handler-rs project. Generic async HTTP/Webdav handler Webdav (RFC4918) is defined as HTTP (GET/HEAD/PUT/DELETE) plu

30 Dec 29, 2022

The netns-rs crate provides an ultra-simple interface for handling network namespaces in Rust.

netns-rs The netns-rs crate provides an ultra-simple interface for handling network namespaces in Rust. Changing namespaces requires elevated privileg

7 Dec 15, 2022

Rust utility crate for parsing, encoding and generating x25519 keys used by WireGuard

WireGuard Keys This is a utility crate for parsing, encoding and generating x25519 keys that are used by WireGuard. It exports custom types that can b

3 Aug 9, 2022

Rust crate providing a variety of automotive related libraries, such as communicating with CAN interfaces and diagnostic APIs

The Automotive Crate Welcome to the automotive crate documentation. The purpose of this crate is to help you with all things automotive related. Most

29 Mar 11, 2024

Rust crate for configurable parallel web crawling, designed to crawl for content

Related tags

Overview

url-crawler

Example

Output

You might also like...

An efficient web server for TiddlyWikis.

A simple web server(and library) to display server stats over HTTP and Websockets/SSE or stream it to other systems.

The open source distributed web search engine that searches by meaning.

Rust crate for scraping URLs from HTML pages

Safe Rust crate for creating socket servers and clients with ease.

Dav-server-rs - Rust WebDAV server library. A fork of the webdav-handler crate.

The netns-rs crate provides an ultra-simple interface for handling network namespaces in Rust.

Rust utility crate for parsing, encoding and generating x25519 keys used by WireGuard

Rust crate providing a variety of automotive related libraries, such as communicating with CAN interfaces and diagnostic APIs

Owner

Pop!_OS

A Multitask Parallel Concurrent Executor for ns-3 (network simulator)

Tachyon is a performant and highly parallel reliable udp library that uses a nack based model

Hopper - Fast, configurable, lightweight Reverse Proxy for Minecraft

📡 Rust mDNS library designed with user interfaces in mind

axum-server is a hyper server implementation designed to be used with axum framework.

A multiplayer web based roguelike built on Rust and WebRTC

Multithreaded Web Server Made with Rust

MASQ combines the benefits of VPN and Tor technology to create a superior next-generation privacy software, where users are rewarded for supporting an uncensored global web. Users gain privacy and anonymity online, while helping promote Internet Freedom.

Crusty - polite && scalable broad web crawler

Getting the token's holder info and pushing to a web server.