A lightweight async Web crawler in Rust, optimized for concurrent scraping while respecting `robots.txt` rules.

Related tags

Command-line crawly
Overview

πŸ•·οΈ crawly

A lightweight and efficient web crawler in Rust, optimized for concurrent scraping while respecting robots.txt rules.

Crates.io License: MIT Version Repository Homepage

πŸš€ Features

  • Concurrent crawling: Takes advantage of concurrency for efficient scraping across multiple cores;
  • Respects robots.txt: Automatically fetches and adheres to website scraping guidelines;
  • DFS algorithm: Uses a depth-first search algorithm to crawl web links;
  • Customizable with Builder Pattern: Tailor the depth of crawling, rate limits, and other parameters effortlessly;
  • Cloudflare's detection: If the destination URL is hosted with Cloudflare and a mitigation is found, the URL will be skipped;
  • Built with Rust: Guarantees memory safety and top-notch speed.

πŸ“¦ Installation

Add crawly to your Cargo.toml:

[dependencies]
crawly = "^0.1"

πŸ› οΈ Usage

A simple usage example:

use anyhow::Result;
use crawly::Crawler;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = Crawler::new()?;
    let results = crawler.crawl_url("https://example.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

Using the Builder

For more refined control over the crawler's behavior, the CrawlerBuilder comes in handy:

use anyhow::Result;
use crawly::CrawlerBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = CrawlerBuilder::new()
        .with_max_depth(10)
        .with_max_pages(100)
        .with_max_concurrent_requests(50)
        .with_rate_limit_wait_seconds(2)
        .with_robots(true)
        .build()?;
    
    let results = crawler.crawl_url("https://www.example.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

πŸ›‘οΈ Cloudflare

This crate will detect Cloudflare hosted sites and if the header cf-mitigated is found, the URL will be skipped without throwing any error.

πŸ“œ Tracing

Every function is instrumented, also this crate will emit some DEBUG messages for better comprehending the crawling flow.

🀝 Contributing

Contributions, issues, and feature requests are welcome!

Feel free to check issues page. You can also take a look at the contributing guide.

πŸ“ License

This project is MIT licensed.

πŸ’Œ Contact

You might also like...
A Rust synchronisation primitive for "Multiplexed Concurrent Single-Threaded Read" access

exit-left verb; 1. To exit or disappear in a quiet, non-dramatic fashion, making way for more interesting events. 2. (imperative) Leave the scene, and

Concurrent and multi-stage data ingestion and data processing with Rust+Tokio

TokioSky Build concurrent and multi-stage data ingestion and data processing pipelines with Rust+Tokio. TokioSky allows developers to consume data eff

Spawn multiple concurrent unix terminals in Discord

Using this bot can be exceedingly dangerous since you're basically granting people direct access to your shell.

A concurrent, append-only vector
A concurrent, append-only vector

The vector provided by this crate suports concurrent get and push operations. Reads are always lock-free, as are writes except when resizing is required.

Implementation of CSP for concurrent programming.
Implementation of CSP for concurrent programming.

CSPLib Communicating Sequential Processes (CSP) Background Communicating Sequential Processes (CSP) is a way of writing a concurrent application using

⚑ Blazing fast async/await HTTP client for Python written on Rust using reqwests

Reqsnaked Reqsnaked is a blazing fast async/await HTTP client for Python written on Rust using reqwests. Works 15% faster than aiohttp on average RAII

Async filesystem facade for Rust!

floppy-disk floppy disk is a WIP, async-only filesystem facade for Rust. What? Have you ever worked with std::fs? tokio::fs? Then you've probably real

A simple Rust library for OpenAI API, free from complex async operations and redundant dependencies.

OpenAI API for Rust A community-maintained library provides a simple and convenient way to interact with the OpenAI API. No complex async and redundan

⚑️ Blazing fast terminal file manager written in Rust, based on async I/O.

Yazi - ⚑️ Blazing Fast Terminal File Manager Yazi ("duck" in Chinese) is a terminal file manager written in Rust, based on non-blocking async I/O. It

Owner
CrystalSoft
We are more than a software house company.
CrystalSoft
🌊 ~ seaward is a crawler which searches for links or a specified word in a website.

?? seaward Installation cargo install seaward On NetBSD a pre-compiled binary is available from the official repositories. To install it, simply run:

null 3 Jul 16, 2023
rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G)

csv, excel toolkit written in Rust rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G). rsv has following fe

Zhuang Dai 39 Jan 30, 2023
'apk-yara-checker' is a little CLI tool written in Rust to check Yara rules against a folder of APK files.

apk-yara-checker 'apk-yara-checker' is a little CLI tool written in Rust to check Yara rules against a folder of APK files. You have to pass the folde

alberto__segura 15 Oct 5, 2022
Rust derive-based argument parsing optimized for code size

Argh Argh is an opinionated Derive-based argument parser optimized for code size Derive-based argument parsing optimized for code size and conformance

Google 1.3k Dec 28, 2022
Interpreted, optimized, JITed and compiled implementations of the Brainfuck lang.

Interpreted, Optimized, JITed and Compiled Brainfuck implementations This repo contains a series of brainfuck implementations based on Eli Bendersky b

Rodrigo Batista de Moraes 5 Jan 6, 2023
Solutions for exact and optimized best housing chains in BDO using popjumppush and MIP.

Work in progress. About This project is an implementation of the pop_jump_push algorithm. It uses graph data from the MMORPG Black Desert Online's tow

Thell 'Bo' Fowler 3 May 2, 2023
A simple CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal.

Welcome to rust-qrcode-cli ?? A CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal. Install git clon

Dhravya Shah 2 Mar 2, 2022
Firefox used to have this feature a while back (from Firefox 11 to 46) and it is so good, that I feel it needs revival.

3D WebPage Inspector By: Seanpm2001, Et; Al. Top README.md Read this article in a different language Sorted by: A-Z Sorting options unavailable ( af A

Sean P. Myrick V19.1.7.2 3 Nov 10, 2022
πŸ… A command-line tool to get and set values in toml files while preserving comments and formatting

tomato Get, set, and delete values in TOML files while preserving comments and formatting. That's it. That's the feature set. I wrote tomato to satisf

C J Silverio 15 Dec 23, 2022
zkPoEX enables white hat hackers to report live vulnerabilities in smart contracts while maintaining the confidentiality of the exploit

zkPoEX enables white hat hackers to report live vulnerabilities in smart contracts while maintaining the confidentiality of the exploit, facilitating efficient communication and collaboration between hackers and project owners for a more secure DeFi ecosystem.

zkoranges 135 Apr 16, 2023