spider-rs
The spider project ported to Node.js
Getting Started
npm i @spider-rs/spider-rs --save
import { Website, pageTitle } from "@spider-rs/spider-rs";
const website = new Website("https://rsseau.fr")
.withHeaders({
authorization: "somerandomjwt",
})
.withBudget({
"*": 20, // limit max request 20 pages for the website
"/docs": 10, // limit only 10 pages on the `/docs` paths
})
.withBlacklistUrl(["/resume"]) // regex or pattern matching to ignore paths
.build();
// optional: page event handler
const onPageEvent = (_err, page) => {
const title = pageTitle(page); // comment out to increase performance if title not needed
console.info(`Title of ${page.url} is '${title}'`);
website.pushData({
status: page.statusCode,
html: page.content,
url: page.url,
title,
});
};
await website.crawl(onPageEvent);
await website.exportJsonlData("./storage/rsseau.jsonl");
console.log(website.getLinks());
Collect the resources for a website.
import { Website } from "@spider-rs/spider-rs";
const website = new Website("https://rsseau.fr")
.withBudget({
"*": 20,
"/docs": 10,
})
// you can use regex or string matches to ignore paths
.withBlacklistUrl(["/resume"])
.build();
await website.scrape();
console.log(website.getPages());
Run the crawls in the background on another thread.
import { Website } from "@spider-rs/spider-rs";
const website = new Website("https://rsseau.fr");
const onPageEvent = (_err, page) => {
console.log(page);
};
await website.crawl(onPageEvent, true);
// runs immediately
Use headless Chrome rendering for crawls.
import { Website } from "@spider-rs/spider-rs";
const website = new Website("https://rsseau.fr").withChromeIntercept(true, true);
const onPageEvent = (_err, page) => {
console.log(page);
};
// the third param determines headless chrome usage.
await website.crawl(onPageEvent, false, true);
console.log(website.getLinks());
Cron jobs can be done with the following.
import { Website } from "@spider-rs/spider-rs";
const website = new Website("https://choosealicense.com").withCron(
"1/5 * * * * *",
);
// sleep function to test cron
const stopCron = (time: number, handle) => {
return new Promise((resolve) => {
setTimeout(() => {
resolve(handle.stop());
}, time);
});
};
const links = [];
const onPageEvent = (err, value) => {
links.push(value);
};
const handle = await website.runCron(onPageEvent);
// stop the cron in 4 seconds
await stopCron(4000, handle);
Use the crawl shortcut to get the page content and url.
import { crawl } from "@spider-rs/spider-rs";
const { links, pages } = await crawl("https://rsseau.fr");
console.log(pages);
Benchmarks
View the benchmarks to see a breakdown between libs and platforms.
Development
Install the napi cli npm i @napi-rs/cli --global
.
yarn build:test