A lightweight async Web crawler in Rust, optimized for concurrent scraping while respecting `robots.txt` rules.

Related tags

Command-line crawly
Overview

🕷️ crawly

A lightweight and efficient web crawler in Rust, optimized for concurrent scraping while respecting robots.txt rules.

Crates.io License: MIT Version Repository Homepage

🚀 Features

  • Concurrent crawling: Takes advantage of concurrency for efficient scraping across multiple cores;
  • Respects robots.txt: Automatically fetches and adheres to website scraping guidelines;
  • DFS algorithm: Uses a depth-first search algorithm to crawl web links;
  • Customizable with Builder Pattern: Tailor the depth of crawling, rate limits, and other parameters effortlessly;
  • Cloudflare's detection: If the destination URL is hosted with Cloudflare and a mitigation is found, the URL will be skipped;
  • Built with Rust: Guarantees memory safety and top-notch speed.

📦 Installation

Add crawly to your Cargo.toml:

[dependencies]
crawly = "^0.1"

🛠️ Usage

A simple usage example:

use anyhow::Result;
use crawly::Crawler;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = Crawler::new()?;
    let results = crawler.crawl_url("https://example.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

Using the Builder

For more refined control over the crawler's behavior, the CrawlerBuilder comes in handy:

use anyhow::Result;
use crawly::CrawlerBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = CrawlerBuilder::new()
        .with_max_depth(10)
        .with_max_pages(100)
        .with_max_concurrent_requests(50)
        .with_rate_limit_wait_seconds(2)
        .with_robots(true)
        .build()?;
    
    let results = crawler.crawl_url("https://www.example.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

🛡️ Cloudflare

This crate will detect Cloudflare hosted sites and if the header cf-mitigated is found, the URL will be skipped without throwing any error.

📜 Tracing

Every function is instrumented, also this crate will emit some DEBUG messages for better comprehending the crawling flow.

🤝 Contributing

Contributions, issues, and feature requests are welcome!

Feel free to check issues page. You can also take a look at the contributing guide.

📝 License

This project is MIT licensed.

💌 Contact

You might also like...
Cloud-optimized GeoTIFF ... Parallel I/O 🦀

cog3pio Cloud-optimized GeoTIFF ... Parallel I/O Yet another attempt at creating a GeoTIFF reader, in Rust, with Python bindings. Installation Rust ca

⚡🚀 Content Delivery Network written in Rustlang, optimized for speed and latency.
⚡🚀 Content Delivery Network written in Rustlang, optimized for speed and latency.

Supported Formats HTML Javscript Css Image PNG JPG JPEG GIF SVG Video MP4 WEBM FLV Audio OGG ACC MP3 Archives ZIP RAR Feeds & Data JSON YAML XML Docum

A simple CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal.
A simple CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal.

Welcome to rust-qrcode-cli 👋 A CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal. Install git clon

A Rust synchronisation primitive for "Multiplexed Concurrent Single-Threaded Read" access

exit-left verb; 1. To exit or disappear in a quiet, non-dramatic fashion, making way for more interesting events. 2. (imperative) Leave the scene, and

Concurrent and multi-stage data ingestion and data processing with Rust+Tokio

TokioSky Build concurrent and multi-stage data ingestion and data processing pipelines with Rust+Tokio. TokioSky allows developers to consume data eff

Firefox used to have this feature a while back (from Firefox 11 to 46) and it is so good, that I feel it needs revival.
Firefox used to have this feature a while back (from Firefox 11 to 46) and it is so good, that I feel it needs revival.

3D WebPage Inspector By: Seanpm2001, Et; Al. Top README.md Read this article in a different language Sorted by: A-Z Sorting options unavailable ( af A

🍅 A command-line tool to get and set values in toml files while preserving comments and formatting

tomato Get, set, and delete values in TOML files while preserving comments and formatting. That's it. That's the feature set. I wrote tomato to satisf

zkPoEX enables white hat hackers to report live vulnerabilities in smart contracts while maintaining the confidentiality of the exploit
zkPoEX enables white hat hackers to report live vulnerabilities in smart contracts while maintaining the confidentiality of the exploit

zkPoEX enables white hat hackers to report live vulnerabilities in smart contracts while maintaining the confidentiality of the exploit, facilitating efficient communication and collaboration between hackers and project owners for a more secure DeFi ecosystem.

Spawn multiple concurrent unix terminals in Discord

Using this bot can be exceedingly dangerous since you're basically granting people direct access to your shell.

Owner
CrystalSoft
We are more than a software house company.
CrystalSoft
A CLI tool based on the crypto-crawler-rs library to crawl trade, level2, level3, ticker, funding rate, etc.

carbonbot A CLI tool based on the crypto-crawler-rs library to crawl trade, level2, level3, ticker, funding rate, etc. Run To quickly get started, cop

null 8 Dec 21, 2022
🌊 ~ seaward is a crawler which searches for links or a specified word in a website.

?? seaward Installation cargo install seaward On NetBSD a pre-compiled binary is available from the official repositories. To install it, simply run:

null 3 Jul 16, 2023
rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G)

csv, excel toolkit written in Rust rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G). rsv has following fe

Zhuang Dai 39 Jan 30, 2023
FUSE filesystem that provides FizzBuzz.txt(8 Exabyte)

FizzBuzzFS root@8a2db3fc6292:/# cd /mnt/FizzBuzz/ root@8a2db3fc6292:/mnt/FizzBuzz# ls -l total 9007199254740992 -rw-r--r-- 1 501 dialout 9223372036854

todesking 8 Oct 1, 2023
'apk-yara-checker' is a little CLI tool written in Rust to check Yara rules against a folder of APK files.

apk-yara-checker 'apk-yara-checker' is a little CLI tool written in Rust to check Yara rules against a folder of APK files. You have to pass the folde

alberto__segura 15 Oct 5, 2022
An m,n,k-game with Connect Four rules

Description A simple m,n,k-game with Connect Four rules (i.e. every token must be placed in the lowest possible position). The size of the board (m *

Elias 3 Nov 21, 2023
Rust derive-based argument parsing optimized for code size

Argh Argh is an opinionated Derive-based argument parser optimized for code size Derive-based argument parsing optimized for code size and conformance

Google 1.3k Dec 28, 2022
Interpreted, optimized, JITed and compiled implementations of the Brainfuck lang.

Interpreted, Optimized, JITed and Compiled Brainfuck implementations This repo contains a series of brainfuck implementations based on Eli Bendersky b

Rodrigo Batista de Moraes 5 Jan 6, 2023
Solutions for exact and optimized best housing chains in BDO using popjumppush and MIP.

Work in progress. About This project is an implementation of the pop_jump_push algorithm. It uses graph data from the MMORPG Black Desert Online's tow

Thell 'Bo' Fowler 3 May 2, 2023
Supporting code for the paper "Optimized Homomorphic Evaluation of Boolean Functions" submitted to Eurocrypt 2024

This repository contains the code related to the paper Optimized Homomorphic Evaluation of Boolean Functions. The folder search_algorithm contains the

CryptoExperts 3 Oct 23, 2023