Crusty - polite && scalable broad web crawler

Overview

crates.io Dependency status

Crusty - polite && scalable broad web crawler

Introduction

Broad web crawling is an activity of going through practically boundless web by starting from a set of locations(urls) and following outgoing links. Usually it doesn't matter where you start from as long as it has outgoing links to external domains.

It presents unique set of challenges one must overcome to get a stable and scalable system, Crusty is an attempt to tackle on some of those challenges to see what's out here while having fun with Rust ;)

This particular implementation could be used to quickly fetch a subset of all observable internet and for example, discover most popular domains/links

Built on top of crusty-core which handles all low-level aspects of web crawling

Key features

  • Configurability && extensibility

    see a typical config file with some explanations regarding available options

  • Fast single node performance

    Crusty is written in Rust on top of green threads running on tokio, so it can achieve quite impressive single-node performance even on a moderate PC

    Additional optimizations are possible to further improve this(mostly better html parsing, there are tasks that do not require full DOM parsing, this implementation does full DOM parsing mostly for the sake of extensibility and configurability)

    Crusty has small, stable and predictable memory footprint and is usually cpu/network bound. There is no GC pressure and no war over memory.

  • Scalability

    Each Crusty node is essentially an independent unit which we can run hundreds of in parallel(on different machines of course), the tricky part is job delegation and domain discovery which is solved by a high performance sharded queue-like structure built on top of clickhouse(huh!).

    One might think "clickhouse? wtf?!" but this DB is so darn fast(while providing rich querying capabilities, indexing, filtering), so it seems like a good fit.

    The idea is basically a huge sharded table where each domain(actually IP derivative it was resolved to) belongs to some shard(crc32(addr) % number_of_shards), now each Crusty instance can read from a unique subset of all those shards while can write to all of them(so-called domain discovery).

    On moderate installments(~ <16 nodes) such system is viable as is, although if someone tries to take this to a mega-scale dynamic shard manager might be required...

    There is additional challenge of domain discovery deduplication in multi-node setups, - right now we dedup locally and on clickhouse(AggregatingMergeTree) but the more nodes we add the less efficient local deduplication becomes

    In big setups a dedicated dedup layer might be required, alternatively one might try to simply push quite some of deduplication work on clickhouse by ensuring there are enough shards and enough clickhouse instances to satisfy the desired performance

  • Basic politeness

    While we can crawl thousands of domains in parallel - we should absolutely limit concurrency on per-domain level to avoid any stress to crawled sites, see job_reader.default_crawler_settings.concurrency. More over testing shows that A LOT of totally different domains can live on the same physical IP... so we never try to fetch more than job_reader.domain_top_n domains from the same IP

    It's also a good practice to introduce delays between visiting pages, see job_reader.default_crawler_settings.delay.

    robots.txt is supported!

  • Observability

    Crusty uses tracing and stores multiple metrics in clickhouse that we can observe with grafana - giving a real-time insight in crawling performance

example

Getting started

  • before you start

install docker && docker-compose, follow instructions at

https://docs.docker.com/get-docker/

https://docs.docker.com/compose/install/

  • play with it
git clone https://github.com/let4be/crusty
cd crusty
# might take some time
docker-compose build
# can use ANY or even several(separated by a comma), example.com works too just has one external link ;)
CRUSTY_SEEDS=https://example.com docker-compose up -d

additionally

  • study config file and adapt to your needs, there are sensible defaults for a 100mbit channel, if you have more/less bandwidth/cpu you might need to adjust concurrency_profile

  • to stop background run and retain crawling data docker-compose down

  • to run && attach and see live logs from all containers (can abort with ctrl+c) CRUSTY_SEEDS=https://example.com docker-compose up

  • to see running containers docker ps(should be 3 - crusty-grafana, crusty-clickhouse and crusty)

  • to see logs: docker logs crusty


if you decide to build manually via cargo build, remember - release build is a lot faster(and default is debug)

In the real world usage scenario on high bandwidth channel docker might become a bit too expensive, so it might be a good idea either to run directly or at least in network_mode = host

External service dependencies - clickhouse and grafana

just use docker-compose, it's the recommended way to play with Crusty

however...

to create / clean db use this sql(must be fed to clickhouse client -in context- of clickhouse docker container)

grafana dashboard is exported as json model

Development

  • make sure rustup is installed: https://rustup.rs/

  • make sure pre-commit is installed: https://pre-commit.com/

  • run ./go setup

  • run ./go check to run all pre-commit hooks and ensure everything is ready to go for git

  • run ./go release minor to release a next minor version for crates.io

Contributing

I'm open to discussions/contributions, - use github issues,

pull requests are welcomed

Comments
  • Attaching a database

    Attaching a database

    Hello, let4be

    First of all I want to say it is really impressive what you have built, I am really amazed, so congratulations. Furthermore, I see that you wrote in the README file that one could attach a graph database to save the crawled data, but I can't quite understand how to do it and how would it fit in the dataflow, because I understand that crusty already saves the crawled data in some database.

    I am interested in broad crawling, particularly with Rust, because I've been working on a peer2peer search engine, and thus I need a low-resource broad crawler. I have a (untidy) Python prototype which I would like to convert to Rust.

    I would greatly appreciate if you could help me with this, so I could solve this problem for the search engine project. Thank you very much in advance. Kind regards.

    opened by rtrevinnoc 2
  • Investigate how we track errors and what is considered an error in grafana dashboard

    Investigate how we track errors and what is considered an error in grafana dashboard

    Right now it does not make any sense whatsoever... blocked by

    https://github.com/let4be/crusty-core/issues/9 https://github.com/let4be/crusty-core/issues/8

    bug 
    opened by let4be 2
  • Implement a first approximation of PageRank for Domains

    Implement a first approximation of PageRank for Domains

    Right now this broad crawler is completely emtpy, I think it would be cool if we had something to show off ;) A good candidate for such task could be page rank...

    Now calculating URL page rank is a whole mega-task in it's own, proper implementation of which(that scales) could take months because of requirements on throughput, memory, speed, scalability. Such system most likely needs a sophisticated URL -> ID mapping

    On the other hand we could easily calculate Domain PageRank Ad hoc,

    1. collect all outbound domains for a given Job
    2. convert Job's Domain into a Second Level Domain (super-blog.tumblr.com -> tumblr.com)
    3. convert all outbound domains into unique Second Level Domains as well
    4. store all this in RedisGraph(this will work because there's only very limited N of second level domains and RedisGraph uses sparse matrixes)

    https://oss.redislabs.com/redisgraph/ https://github.com/RedisGraph/RedisGraph/issues/398

    Depending on the underlying hardware results may vary. However, inserting a new relationship is done in O(1). RedisGraph is able to create over 1 million nodes under half a second and form 500K relations within 0.3 of a second.

    RedisGraph has PageRank built-in

    enhancement 
    opened by let4be 2
  • Review how we access DNS resolved addresses

    Review how we access DNS resolved addresses

    Right now we do not precisely control which address hyper will use when connecting but we assume it's the first one(and concurrency restrictions being applied accordingly, which may backfire)

    enhancement 
    opened by let4be 1
  • Improve robots.txt support

    Improve robots.txt support

    We probably should always try to download robots.txt before we access index page... and if it gets resolved with 4xx or 5xx codes we should act accordingly and follow google best practices https://developers.google.com/search/docs/advanced/robots/robots_txt

    right now we download / && /robots.txt in parallel and external links from / most likely will be added to the Q(not internal though)

    enhancement 
    opened by let4be 1
  • Lolhtml check if we can remove elements and if this saves some cpu cycles

    Lolhtml check if we can remove elements and if this saves some cpu cycles

    Right now it does some "wasteful serialization"which we just throw away Yet the lib is so damn fast it doesn't matter...

    ideally we would like to completely disable HTML rewriting functionality, but I don't think it's currently possible, see https://github.com/cloudflare/lol-html/issues/91

    enhancement Low prio 
    opened by let4be 1
  • Investigate possible deadlock/hanging of clickhouse writer

    Investigate possible deadlock/hanging of clickhouse writer

    Sometimes when testing on aws c5.metal I see that writers "hang" which leads to pile of unprocessed messages in buffers(particularly metrics_task as the heaviest clickhouse hitter)

    bug 
    opened by let4be 1
  • Evaluate docker overlay network performance influence on high volume setups

    Evaluate docker overlay network performance influence on high volume setups

    Could be cool to keep network inside docker overlay, but i'm concerned about performance especially on high end setups

    Probably it should be default configuration anyway just to ensure everyone can try Crusty no matter open ports on the system...

    opened by let4be 1
  • Implement faster HTML parsing

    Implement faster HTML parsing

    As soon as crusty-core fully supports custom html processing I'd like to experiment a bit and find a faster way to extract links(and probably some meta data) from HTML

    we don't need anything complex when doing broad web crawling so it should be possible to speed this up a lot(right now we do full DOM parsing)

    Extracting links/title/meta should be easy to do with a simple tokenizer, like in https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html

    enhancement 
    opened by let4be 1
  • Queue sharding support

    Queue sharding support

    It's partially here, but need to add

    • routing to proper shard based on addr_key
    • spawn a green thread for each owned shard(shard_min .. shard_max)
    • glue it all together and test
    opened by let4be 1
  • Glitchy buffers panel in grafana dashboard

    Glitchy buffers panel in grafana dashboard

    It uses dynamic pulling of all available buffers - displays labels wrong and cannot calc max(outputs Trillions) it's either a grafana bug or I did something wrong :\

    bug 
    opened by let4be 0
  • Concurrency auto-tuning

    Concurrency auto-tuning

    Figure a way to auto-tune domain concurrency(there is a ~perfect N based on CPU and network bandwidth available) Will need some kind of graceful adaptive algo which will look at metrics(tx/rx, error rates) and determine the N

    enhancement Low prio 
    opened by let4be 0
Owner
Sergey F.
Rust Enthusiast, Rust is trully The Best thing to software development since first appearance of higher level languages. Rust Is The next step.
Sergey F.
MASQ Network 121 Dec 20, 2022
A multiplayer web based roguelike built on Rust and WebRTC

Gorgon A multiplayer web-based roguelike build on Rust and WebRTC. License This project is licensed under either of Apache License, Version 2.0, (LICE

RICHΛRD ΛNΛYΛ 2 Sep 19, 2022
Getting the token's holder info and pushing to a web server.

Purpose of this program I've made this web scraper so you can use it to get the holder's amount from BSCscan and it will upload for you in JSON format

null 3 Jul 7, 2022
Multithreaded Web Server Made with Rust

Multithreaded Web Server Made with Rust The server listens for TCP connections at address 127.0.0.1:7878. Several pages can be accessed: 127.0.0.1:787

null 2 May 28, 2022
An efficient web server for TiddlyWikis.

Tiddlywiki Server This is a web backend for TiddlyWiki. It uses TiddlyWiki's web server API to save tiddlers in a [SQLite database]. It should come wi

Nathaniel Knight 17 Nov 19, 2022
A simple web server(and library) to display server stats over HTTP and Websockets/SSE or stream it to other systems.

x-server-stats A simple web server(and library) to display server stats over HTTP and Websockets/SSE or stream it to other systems. x-server(in x-serv

Pratyaksh 11 Oct 17, 2022
The open source distributed web search engine that searches by meaning.

DawnSearch DawnSearch is an open source distributed web search engine that searches by meaning. It uses semantic search (searching on meaning), using

DawnSearch 4 Aug 8, 2023
💱 A crusty currency converter

?? moneyman A crusty currency converter Example $ moneyman convert 50 --from EUR --to PHP --on 2023-05-06 --fallback 50 EUR -> 3044.5833333333350 PHP

SEKUN 6 May 22, 2023
A lightweight async Web crawler in Rust, optimized for concurrent scraping while respecting `robots.txt` rules.

??️ crawly A lightweight and efficient web crawler in Rust, optimized for concurrent scraping while respecting robots.txt rules. ?? Features Concurren

CrystalSoft 5 Aug 29, 2023
Rebuilderd debian buildinfo crawler

Sponsored by: rebuilderd-debian-buildinfo-crawler This program parses the Packages.xz debian package index, attempts to discover the right buildinfo f

null 4 Feb 14, 2022
A CLI tool based on the crypto-crawler-rs library to crawl trade, level2, level3, ticker, funding rate, etc.

carbonbot A CLI tool based on the crypto-crawler-rs library to crawl trade, level2, level3, ticker, funding rate, etc. Run To quickly get started, cop

null 8 Dec 21, 2022
Lens crawler & cacher

netrunner netrunner is a tool to help build, validate, & create archives for Spyglass lenses. Lenses are a simple set of rules that tell a crawler whi

Spyglass Search 16 Dec 15, 2022
The parser library to parse messages from crypto-crawler.

crypto-msg-parser The parser library to parse messages from crypto-crawler. Architecture crypto-msg-parser is the parser library to parse messages fro

null 5 Jan 2, 2023
🌊 ~ seaward is a crawler which searches for links or a specified word in a website.

?? seaward Installation cargo install seaward On NetBSD a pre-compiled binary is available from the official repositories. To install it, simply run:

null 3 Jul 16, 2023
A small, memory efficient crawler written in Rust.

Atra - The smaller way to crawl !!This read me will we reworked in a few days. Currently I am working on a better version and a wiki for the config fi

Felix Engl 3 Mar 23, 2024
Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.

Building 1. Install rustc, cargo and rustfmt. $ curl https://sh.rustup.rs -sSf | sh $ source $HOME/.cargo/env $ rustup component add rustfmt When buil

Solana Foundation 9.8k Jan 3, 2023
A scalable, distributed, collaborative, document-graph database, for the realtime web

is the ultimate cloud database for tomorrow's applications Develop easier. Build faster. Scale quicker. What is SurrealDB? SurrealDB is an end-to-end

SurrealDB 16.9k Jan 8, 2023
A traditional web forum built in Rust with modern technology to be fast, secure, scalable, and stable.

Volksforo A traditional web forum built in Rust with modern technology to be fast, secure, scalable, and stable. Stack Rust actix-web askama ScyllaDB

Josh 5 Mar 21, 2023
Static Web Server - a very small and fast production-ready web server suitable to serve static web files or assets

Static Web Server (or SWS abbreviated) is a very small and fast production-ready web server suitable to serve static web files or assets.

Jose Quintana 496 Jan 2, 2023
[no longer maintained] Scalable, coroutine-based, fibers/green-threads for Rust. (aka MIO COroutines).

Documentation mioco Mioco provides green-threads (aka fibers) like eg. Goroutines in Go, for Rust. Status This repo is a complete re-implementation of

Dawid Ciężarkiewicz 137 Dec 19, 2022