Crusty - polite && scalable broad web crawler

Related tags

crusty
Overview

crates.io Dependency status

Crusty - polite && scalable broad web crawler

Introduction

Broad web crawling is an activity of going through practically boundless web by starting from a set of locations(urls) and following outgoing links. Usually it doesn't matter where you start from as long as it has outgoing links to external domains.

It presents unique set of challenges one must overcome to get a stable and scalable system, Crusty is an attempt to tackle on some of those challenges to see what's out here while having fun with Rust ;)

This particular implementation could be used to quickly fetch a subset of all observable internet and for example, discover most popular domains/links

Built on top of crusty-core which handles all low-level aspects of web crawling

Key features

  • Configurability && extensibility

    see a typical config file with some explanations regarding available options

  • Fast single node performance

    Crusty is written in Rust on top of green threads running on tokio, so it can achieve quite impressive single-node performance even on a moderate PC

    Additional optimizations are possible to further improve this(mostly better html parsing, there are tasks that do not require full DOM parsing, this implementation does full DOM parsing mostly for the sake of extensibility and configurability)

    Crusty has small, stable and predictable memory footprint and is usually cpu/network bound. There is no GC pressure and no war over memory.

  • Scalability

    Each Crusty node is essentially an independent unit which we can run hundreds of in parallel(on different machines of course), the tricky part is job delegation and domain discovery which is solved by a high performance sharded queue-like structure built on top of clickhouse(huh!).

    One might think "clickhouse? wtf?!" but this DB is so darn fast(while providing rich querying capabilities, indexing, filtering), so it seems like a good fit.

    The idea is basically a huge sharded table where each domain(actually IP derivative it was resolved to) belongs to some shard(crc32(addr) % number_of_shards), now each Crusty instance can read from a unique subset of all those shards while can write to all of them(so-called domain discovery).

    On moderate installments(~ <16 nodes) such system is viable as is, although if someone tries to take this to a mega-scale dynamic shard manager might be required...

    There is additional challenge of domain discovery deduplication in multi-node setups, - right now we dedup locally and on clickhouse(AggregatingMergeTree) but the more nodes we add the less efficient local deduplication becomes

    In big setups a dedicated dedup layer might be required, alternatively one might try to simply push quite some of deduplication work on clickhouse by ensuring there are enough shards and enough clickhouse instances to satisfy the desired performance

  • Basic politeness

    While we can crawl thousands of domains in parallel - we should absolutely limit concurrency on per-domain level to avoid any stress to crawled sites, see job_reader.default_crawler_settings.concurrency. More over testing shows that A LOT of totally different domains can live on the same physical IP... so we never try to fetch more than job_reader.domain_top_n domains from the same IP

    It's also a good practice to introduce delays between visiting pages, see job_reader.default_crawler_settings.delay.

    robots.txt is supported!

  • Observability

    Crusty uses tracing and stores multiple metrics in clickhouse that we can observe with grafana - giving a real-time insight in crawling performance

example

Getting started

  • before you start

install docker && docker-compose, follow instructions at

https://docs.docker.com/get-docker/

https://docs.docker.com/compose/install/

  • play with it
git clone https://github.com/let4be/crusty
cd crusty
# might take some time
docker-compose build
# can use ANY or even several(separated by a comma), example.com works too just has one external link ;)
CRUSTY_SEEDS=https://example.com docker-compose up -d

additionally

  • study config file and adapt to your needs, there are sensible defaults for a 100mbit channel, if you have more/less bandwidth/cpu you might need to adjust concurrency_profile

  • to stop background run and retain crawling data docker-compose down

  • to run && attach and see live logs from all containers (can abort with ctrl+c) CRUSTY_SEEDS=https://example.com docker-compose up

  • to see running containers docker ps(should be 3 - crusty-grafana, crusty-clickhouse and crusty)

  • to see logs: docker logs crusty


if you decide to build manually via cargo build, remember - release build is a lot faster(and default is debug)

In the real world usage scenario on high bandwidth channel docker might become a bit too expensive, so it might be a good idea either to run directly or at least in network_mode = host

External service dependencies - clickhouse and grafana

just use docker-compose, it's the recommended way to play with Crusty

however...

to create / clean db use this sql(must be fed to clickhouse client -in context- of clickhouse docker container)

grafana dashboard is exported as json model

Development

  • make sure rustup is installed: https://rustup.rs/

  • make sure pre-commit is installed: https://pre-commit.com/

  • run ./go setup

  • run ./go check to run all pre-commit hooks and ensure everything is ready to go for git

  • run ./go release minor to release a next minor version for crates.io

Contributing

I'm open to discussions/contributions, - use github issues,

pull requests are welcomed

Issues
  • Provide example configurations for various setups

    Provide example configurations for various setups

    This could better explain what the primary scaling points are, config for aws's t5.metal would be quite different from micro

    opened by let4be 0
  • Migrate job management system to Redis

    Migrate job management system to Redis

    While current "queue-like system" on top of clickhouse worked quite well for testing it's no near as good as required for any serious high-volume use

    Recently I did some testing on a beefy AWS hardware and fixed some internal bottlenecks(not yet merged) and in some testing scenarios where I could temporary alleviate the last left bottleneck - job distribution(writing new/updating completed/selecting), Crusty was capable of doing over 900MiB/sec - a whooping 7+gbit/sec! on 48 core(96 logical) c5.metal with a 25gbit/s port

    New job queue should be solely redis-based using redis modules: https://redis.io/topics/modules-intro rust has good enough library to allow writing redis module logic: https://github.com/RedisLabsModules/redismodule-rs

    We will use pre-sharded queue(based on addr_key)

    Atomic operations:

    1. Enqueue jobs
    2. Dequeue jobs
    3. Finish jobs

    using correct underlying data types(mostly sets and bloom filter for history) + batching and pipelining we can have solid throughput, low cpu usage per redis node, decent reliability and scalability careful expiration could help to avoid memory overflow on redis node - we always discover domains faster than we can process them

    enhancement 
    opened by let4be 0
  • Add WARC support

    Add WARC support

    seems like https://github.com/jedireza/warc could help

    enhancement 
    opened by let4be 0
Owner
Sergey F.
Rust Enthusiast, Rust is trully The Best thing to software development since first appearance of higher level languages. Rust Is The next step.
Sergey F.
Rust crate for configurable parallel web crawling, designed to crawl for content

url-crawler A configurable parallel web crawler, designed to crawl a website for content. Changelog Docs.rs Example extern crate url_crawler; use std:

Pop!_OS 52 Apr 27, 2021
A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...

Tokio A runtime for writing reliable, asynchronous, and slim applications with the Rust programming language. It is: Fast: Tokio's zero-cost abstracti

Tokio 12.3k Jun 14, 2021
Imagine your SSH server only listens on an IPv6 address, and where the last 6 digits are changing every 30 seconds as a TOTP code...

tosh Imagine your SSH server only listens on an IPv6 address, and where the last 6 digits are changing every 30 seconds as a TOTP code... Inspired fro

Mark Vainomaa 359 Jun 13, 2021
** Deprecated **

parallel-getter When fetching a file from a web server via GET, it is possible to define a range of bytes to receive per request. This allows the poss

Pop!_OS 68 Jun 4, 2021