Spider ported to Node.js

Overview

spider-rs

The spider project ported to Node.js

Getting Started

  1. npm i @spider-rs/spider-rs --save
import { Website, pageTitle } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr")
  .withHeaders({
    authorization: "somerandomjwt",
  })
  .withBudget({
    "*": 20, // limit max request 20 pages for the website
    "/docs": 10, // limit only 10 pages on the `/docs` paths
  })
  .withBlacklistUrl(["/resume"]) // regex or pattern matching to ignore paths
  .build();

// optional: page event handler
const onPageEvent = (_err, page) => {
  const title = pageTitle(page); // comment out to increase performance if title not needed
  console.info(`Title of ${page.url} is '${title}'`);
  website.pushData({
    status: page.statusCode,
    html: page.content,
    url: page.url,
    title,
  });
};

await website.crawl(onPageEvent);
await website.exportJsonlData("./storage/rsseau.jsonl");
console.log(website.getLinks());

Collect the resources for a website.

import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr")
  .withBudget({
    "*": 20,
    "/docs": 10,
  })
  // you can use regex or string matches to ignore paths
  .withBlacklistUrl(["/resume"])
  .build();

await website.scrape();
console.log(website.getPages());

Run the crawls in the background on another thread.

import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr");

const onPageEvent = (_err, page) => {
  console.log(page);
};

await website.crawl(onPageEvent, true);
// runs immediately

Use headless Chrome rendering for crawls.

import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://rsseau.fr").withChromeIntercept(true, true);

const onPageEvent = (_err, page) => {
  console.log(page);
};

// the third param determines headless chrome usage.
await website.crawl(onPageEvent, false, true);
console.log(website.getLinks());

Cron jobs can be done with the following.

import { Website } from "@spider-rs/spider-rs";

const website = new Website("https://choosealicense.com").withCron(
  "1/5 * * * * *",
);
// sleep function to test cron
const stopCron = (time: number, handle) => {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve(handle.stop());
    }, time);
  });
};

const links = [];

const onPageEvent = (err, value) => {
  links.push(value);
};

const handle = await website.runCron(onPageEvent);

// stop the cron in 4 seconds
await stopCron(4000, handle);

Use the crawl shortcut to get the page content and url.

import { crawl } from "@spider-rs/spider-rs";

const { links, pages } = await crawl("https://rsseau.fr");
console.log(pages);

Benchmarks

View the benchmarks to see a breakdown between libs and platforms.

Development

Install the napi cli npm i @napi-rs/cli --global.

  1. yarn build:test
You might also like...
Trinci blockchain node

TRINCI TRINCI Blockchain Node. Requirements The required dependencies to build correctly the project are the following: clang libclang-dev (ver. 11 su

Substrate Node Template Generator

Substrate Node Template Generator A tool to generate stand-alone node templates of a customized Substrate clients used in "Substrate Library Extension

Subsocial full node with Substrate/Polkadot pallets for decentralized communities: blogs, posts, comments, likes, reputation.

Subsocial Node by DappForce Subsocial is a set of Substrate pallets with web UI that allows anyone to launch their own decentralized censorship-resist

Node implementation for aleph blockchain built with Substrate framework
Node implementation for aleph blockchain built with Substrate framework

This repository contains the Rust implementation of Aleph Zero blockchain node based on the Substrate framework. Aleph Zero is an open-source layer 1

Ecoball Node is the Official Rust implementation of the Ecoball protocol.

Ecoball Node is the Official Rust implementation of the Ecoball protocol. It is a fork of OpenEthereum - https://github.com/openethereum/

Multy-party threshold ECDSA Substrate node

Webb DKG 🕸️ The Webb DKG 🧑‍✈️ ⚠️ Beta Software ⚠️ Running the DKG Currently the easiest way to run the DKG is to use a 3-node local testnet using dk

A Substrate-based PoA node supporting dynamic addition/removal of authorities.

Substrate PoA A sample Substrate runtime for a PoA blockchain that allows: Dynamically add/remove authorities. Automatically remove authorities when t

Doubly-linked list that stores key-node pairs.

key-node-list Doubly-linked list that stores key-node pairs. KeyNodeList is a doubly-linked list, it uses a hash map to maintain correspondence betwee

An ongoing Rust implementation of a Zcash node. 🦓
An ongoing Rust implementation of a Zcash node. 🦓

Contents Contents About Beta Releases Getting Started Known Issues Future Work Documentation Security License About Zebra is the Zcash Foundation's in

Releases(v0.0.41)
Owner
spider-rs
Your friendly neighborhood spiderbot.
spider-rs
Simple node and rust script to achieve an easy to use bridge between rust and node.js

Node-Rust Bridge Simple rust and node.js script to achieve a bridge between them. Only 1 bridge can be initialized per rust program. But node.js can h

Pure 5 Apr 30, 2023
Sample lightning node command-line app built on top of Ldk Node (similar to ldk-sample).

ldk-node-sample Sample lightning node command-line app built on top of Ldk Node (similar to ldk-sample ). Installation git clone https://github.com/op

optout 3 Nov 21, 2023
Polkadot Node Implementation

Polkadot Implementation of a https://polkadot.network node in Rust based on the Substrate framework. NOTE: In 2018, we split our implementation of "Po

Parity Technologies 6.5k Jan 6, 2023
⋰·⋰ Feeless is a Nano cryptocurrency node, wallet, tools, and Rust crate.

⋰·⋰ Feeless What is Feeless? Feeless is a Nano cryptocurrency node, wallet, tools, and Rust crate. This is not the official project for Nano, only an

null 127 Dec 5, 2022
Statemint Node Implementation

Statemint Implementation of Statemint, a blockchain to support generic assets in the Polkadot and Kusama networks. Statemint will allow users to: Depl

Parity Technologies 72 Oct 30, 2022
Basilisk node - cross-chain liquidity protocol built on Substrate

Basilisk node Local Development Follow these steps to prepare a local Substrate development environment ??️ Simple Setup Install all the required depe

Galactic Council 52 Dec 27, 2022
A node API for the dprint TypeScript and JavaScript code formatter

dprint-node A node API for the dprint TypeScript and JavaScript code formatter. It's written in Rust for blazing fast speed. Usage Pass a file path an

Devon Govett 431 Dec 24, 2022
Substrate Node for Anmol Network

Anmol Substrate Node ?? ??️ ?? Anmol is the First Cross-Chain NFT Toolkit, on Polkadot. Introducing: Moulds NFT Breeding Multi-Chain NFT Migration ink

Anmol Network 12 Aug 28, 2022
Manager for single node Rancher clusters

Bovine Manage single node Rancher clusters with a single binary, bovine. % bovine run Pulling [rancher/rancher:latest], this may take awhile... Ranche

Nick Gerace 51 Feb 17, 2022
Minimal Substrate node configured for smart contracts via pallet-contracts.

substrate-contracts-node This repository contains Substrate's node-template configured to include Substrate's pallet-contracts ‒ a smart contract modu

Parity Technologies 73 Dec 30, 2022