Rust crate for configurable parallel web crawling, designed to crawl for content

Overview

url-crawler

A configurable parallel web crawler, designed to crawl a website for content.

Example

extern crate url_crawler;
use std::sync::Arc;
use url_crawler::*;

/// Function for filtering content in the crawler before a HEAD request.
///
/// Only allow directory entries, and files that have the `deb` extension.
fn apt_filter(url: &Url) -> bool {
    let url = url.as_str();
    url.ends_with("/") || url.ends_with(".deb")
}

pub fn main() {
    // Create a crawler designed to crawl the given website.
    let crawler = Crawler::new("http://apt.pop-os.org/".to_owned())
        // Use four threads for fetching
        .threads(4)
        // Check if a URL matches this filter before performing a HEAD request on it.
        .pre_fetch(Arc::new(apt_filter))
        // Initialize the crawler and begin crawling. This returns immediately.
        .crawl();

    // Process url entries as they become available
    for file in crawler {
        println!("{:#?}", file);
    }
}

Output

The folowing includes two snippets from the combined output.

...
Html {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/s/system76-cudnn-9.2/"
}
Html {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/t/tensorflow-1.9-cuda-9.2/"
}
Html {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/t/tensorflow-1.9-cpu/"
}
...
File {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/binary-amd64/a/atom/atom_1.30.0_amd64.deb",
    content_type: "application/octet-stream",
    length: 87689398,
    modified: Some(
        2018-09-25T17:54:39+00:00
    )
}
File {
    url: "http://apt.pop-os.org/proprietary/pool/bionic/main/binary-amd64/a/atom/atom_1.31.1_amd64.deb",
    content_type: "application/octet-stream",
    length: 90108020,
    modified: Some(
        2018-10-03T22:29:15+00:00
    )
}
...
You might also like...
An efficient web server for TiddlyWikis.

Tiddlywiki Server This is a web backend for TiddlyWiki. It uses TiddlyWiki's web server API to save tiddlers in a [SQLite database]. It should come wi

A simple web server(and library) to display server stats over HTTP and Websockets/SSE or stream it to other systems.

x-server-stats A simple web server(and library) to display server stats over HTTP and Websockets/SSE or stream it to other systems. x-server(in x-serv

The open source distributed web search engine that searches by meaning.

DawnSearch DawnSearch is an open source distributed web search engine that searches by meaning. It uses semantic search (searching on meaning), using

Rust crate for scraping URLs from HTML pages

url-scraper Rust crate for scraping URLs from HTML pages. Example extern crate url_scraper; use url_scraper::UrlScraper; fn main() { let director

Safe Rust crate for creating socket servers and clients with ease.

bitsock Safe Rust crate for creating socket servers and clients with ease. Description This crate can be used for Client -- Server applications of e

Dav-server-rs - Rust WebDAV server library. A fork of the webdav-handler crate.

dav-server-rs A fork of the webdav-handler-rs project. Generic async HTTP/Webdav handler Webdav (RFC4918) is defined as HTTP (GET/HEAD/PUT/DELETE) plu

The netns-rs crate provides an ultra-simple interface for handling network namespaces in Rust.

netns-rs The netns-rs crate provides an ultra-simple interface for handling network namespaces in Rust. Changing namespaces requires elevated privileg

Rust utility crate for parsing, encoding and generating x25519 keys used by WireGuard

WireGuard Keys This is a utility crate for parsing, encoding and generating x25519 keys that are used by WireGuard. It exports custom types that can b

Rust crate providing a variety of automotive related libraries, such as communicating with CAN interfaces and diagnostic APIs

The Automotive Crate Welcome to the automotive crate documentation. The purpose of this crate is to help you with all things automotive related. Most

Owner
Pop!_OS
An Operating System by System76
Pop!_OS
A Multitask Parallel Concurrent Executor for ns-3 (network simulator)

A Multitask Parallel Concurrent Executor for ns-3 (network simulator)

BobAnkh 9 Oct 27, 2022
Tachyon is a performant and highly parallel reliable udp library that uses a nack based model

Tachyon Tachyon is a performant and highly parallel reliable udp library that uses a nack based model. Strongly reliable Reliable fragmentation Ordere

Chris Ochs 47 Oct 15, 2022
Hopper - Fast, configurable, lightweight Reverse Proxy for Minecraft

Hopper Hopper is a lightweight reverse proxy for minecraft. It allows you to connect multiple servers under the same IP and port, with additional func

Pietro 174 Jun 29, 2023
📡 Rust mDNS library designed with user interfaces in mind

?? Searchlight Searchlight is an mDNS server & client library designed to be simple, lightweight and easy to use, even if you just have basic knowledg

William 5 Jan 8, 2023
axum-server is a hyper server implementation designed to be used with axum framework.

axum-server axum-server is a hyper server implementation designed to be used with axum framework. Features Conveniently bind to any number of addresse

null 79 Jan 4, 2023
A multiplayer web based roguelike built on Rust and WebRTC

Gorgon A multiplayer web-based roguelike build on Rust and WebRTC. License This project is licensed under either of Apache License, Version 2.0, (LICE

RICHΛRD ΛNΛYΛ 2 Sep 19, 2022
Multithreaded Web Server Made with Rust

Multithreaded Web Server Made with Rust The server listens for TCP connections at address 127.0.0.1:7878. Several pages can be accessed: 127.0.0.1:787

null 2 May 28, 2022
MASQ Network 121 Dec 20, 2022
Crusty - polite && scalable broad web crawler

Broad web crawling is an activity of going through practically boundless web by starting from a set of locations(urls) and following outgoing links. Usually it doesn't matter where you start from as long as it has outgoing links to external domains.

Sergey F. 72 Jan 2, 2023
Getting the token's holder info and pushing to a web server.

Purpose of this program I've made this web scraper so you can use it to get the holder's amount from BSCscan and it will upload for you in JSON format

null 3 Jul 7, 2022