rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

Knuckleheads' Club

Last update: Nov 9, 2021

Related tags

Parsing python rust parser robots-txt robotstxt

Overview

rbdt

🚨 🚨 🚨 🚨

rbdt is a work in progress, currently being extracted out of another (private) project for the purpose of open sourcing and better software engineering.

🚨 🚨 🚨 🚨

rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

rbdt features:

MIT license, have fun.
Written in Rust, so it is fast.
Callable from Python, so it is useful.
Has been and continues to be run against millions of unique robots.txt files.
Forgiving, corrects some typical mistakes in files written by hand, like recognizing dissallows probably meant to be disallow.
Intentionally provides direct access to the parsed robots.txt representation (unlike Reppy or Google's parser).
Ability to compare which user agent has more privilege given to it by the website owner, both heuristically and logically.

rbdt anti-features:

rbdt isn't meant to be used as part of a web crawler, but as part of a large scale analysis of robots.txt files. If ends up being useful for web crawlers eventually, that's great and only incidental.

Development

maturin develop
python py_tests/tests.py

Releases

rbdt uses github ci/cd to do releases to pypi. Tag the commit with the version and it will end up on pypi.

Contributions

File a ticket or send a PR if you'd like.

To Do

Real Open Sourcing Hours
- Changelog
- Write documentation and put them somewhere
- branch protection for main, no direct writes only PR's
- automated tests
Crawl-delay parsing and restructuring of the data representation.
Be able to detect whether a crawler can access a specific page.
More tests of all the various edge cases.
Benchmarks, (maybe someday never).
Publish it as a Rust library as well (maybe).
Get Rust tests working (maybe).

A parser combinator for parsing &[Token].

PickTok A parser combinator like nom but specialized in parsing &[Token]. It has similar combinators as nom, but also provides convenient parser gener

6 Feb 24, 2023

Parser for Object files define the geometry and other properties for objects in Wavefront's Advanced Visualizer.

format of the Rust library load locad blender obj file to Rust NDArray. cargo run test\t10k-images.idx3-ubyte A png file will be generated for the fi

1 Jan 3, 2022

JsonPath engine written in Rust. Webassembly and Javascript support too

jsonpath_lib Rust 버전 JsonPath 구현으로 Webassembly와 Javascript에서도 유사한 API 인터페이스를 제공 한다. It is JsonPath JsonPath engine written in Rust. it provide a simil

95 Dec 29, 2022

PEG parser for YAML written in Rust 🦀

yaml-peg PEG parser (pest) for YAML written in Rust 🦀 Quick Start ⚡️ # Run cargo run -- --file example_files/test.yaml # Output { "xmas": "true",

4 Sep 17, 2022

MRT/BGP data parser written in Rust.

BGPKIT Parser BGPKIT Parser aims to provides the most ergonomic MRT/BGP message parsing Rust API. BGPKIT Parser has the following features: performant

46 Dec 19, 2022

gors is an experimental go toolchain written in rust (parser, compiler).

gors gors is an experimental go toolchain written in rust (parser, compiler). Install Using git This method requires the Rust toolchain to be installe

12 Dec 14, 2022

🕑 A personal git log and MacJournal output parser, written in rust.

🕑 git log and MacJournal export parser A personal project, written in rust. WORK IN PROGRESS; NOT READY This repo consolidates daily activity from tw

4 Aug 17, 2022

A CSS parser, transformer, and minifier written in Rust.

@parcel/css A CSS parser, transformer, and minifier written in Rust. Features Extremely fast – Parsing and minifying large files is completed in milli