rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

Overview

rbdt

🚨 🚨 🚨 🚨

rbdt is a work in progress, currently being extracted out of another (private) project for the purpose of open sourcing and better software engineering.

🚨 🚨 🚨 🚨

rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

PyPI version

rbdt features:

  • MIT license, have fun.
  • Written in Rust, so it is fast.
  • Callable from Python, so it is useful.
  • Has been and continues to be run against millions of unique robots.txt files.
  • Forgiving, corrects some typical mistakes in files written by hand, like recognizing dissallows probably meant to be disallow.
  • Intentionally provides direct access to the parsed robots.txt representation (unlike Reppy or Google's parser).
  • Ability to compare which user agent has more privilege given to it by the website owner, both heuristically and logically.

rbdt anti-features:

  • rbdt isn't meant to be used as part of a web crawler, but as part of a large scale analysis of robots.txt files. If ends up being useful for web crawlers eventually, that's great and only incidental.

Development

maturin develop
python py_tests/tests.py

Releases

rbdt uses github ci/cd to do releases to pypi. Tag the commit with the version and it will end up on pypi.

Contributions

File a ticket or send a PR if you'd like.

To Do

  • Real Open Sourcing Hours
    • Changelog
    • Write documentation and put them somewhere
    • branch protection for main, no direct writes only PR's
    • automated tests
  • Crawl-delay parsing and restructuring of the data representation.
  • Be able to detect whether a crawler can access a specific page.
  • More tests of all the various edge cases.
  • Benchmarks, (maybe someday never).
  • Publish it as a Rust library as well (maybe).
  • Get Rust tests working (maybe).
You might also like...
A parser combinator for parsing &[Token].

PickTok A parser combinator like nom but specialized in parsing &[Token]. It has similar combinators as nom, but also provides convenient parser gener

Parser for Object files define the geometry and other properties for objects in Wavefront's Advanced Visualizer.

format of the Rust library load locad blender obj file to Rust NDArray. cargo run test\t10k-images.idx3-ubyte A png file will be generated for the fi

JsonPath engine written in Rust. Webassembly and Javascript support too

jsonpath_lib Rust 버전 JsonPath κ΅¬ν˜„μœΌλ‘œ Webassembly와 Javascriptμ—μ„œλ„ μœ μ‚¬ν•œ API μΈν„°νŽ˜μ΄μŠ€λ₯Ό 제곡 ν•œλ‹€. It is JsonPath JsonPath engine written in Rust. it provide a simil

PEG parser for YAML written in Rust πŸ¦€

yaml-peg PEG parser (pest) for YAML written in Rust πŸ¦€ Quick Start ⚑️ # Run cargo run -- --file example_files/test.yaml # Output { "xmas": "true",

MRT/BGP data parser written in Rust.
MRT/BGP data parser written in Rust.

BGPKIT Parser BGPKIT Parser aims to provides the most ergonomic MRT/BGP message parsing Rust API. BGPKIT Parser has the following features: performant

gors is an experimental go toolchain written in rust (parser, compiler).

gors gors is an experimental go toolchain written in rust (parser, compiler). Install Using git This method requires the Rust toolchain to be installe

πŸ•‘ A personal git log and MacJournal output parser, written in rust.
πŸ•‘ A personal git log and MacJournal output parser, written in rust.

πŸ•‘ git log and MacJournal export parser A personal project, written in rust. WORK IN PROGRESS; NOT READY This repo consolidates daily activity from tw

A CSS parser, transformer, and minifier written in Rust.
A CSS parser, transformer, and minifier written in Rust.

@parcel/css A CSS parser, transformer, and minifier written in Rust. Features Extremely fast – Parsing and minifying large files is completed in milli

A WIP minimal C Compiler written in Rust πŸ¦€

_ _ ____ ____ | | __ _ _ __ | | __ / ___/ ___| _ | |/ _` | '_ \| |/ / | | | | | |_| | (_| | | | | | |__| |___

Owner
Knuckleheads' Club
Google's Got A Secret
Knuckleheads' Club
Rust library for parsing configuration files

configster Rust library for parsing configuration files Config file format The 'option' can be any string with no whitespace. arbitrary_option = false

The Impossible Astronaut 19 Jan 5, 2022
A Rust library for zero-allocation parsing of binary data.

Zero A Rust library for zero-allocation parsing of binary data. Requires Rust version 1.6 or later (requires stable libcore for no_std). See docs for

Nick Cameron 45 Nov 27, 2022
Yet Another Parser library for Rust. A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing strings and slices.

Yap: Yet another (rust) parsing library A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing input.

James Wilson 117 Dec 14, 2022
Parsing Expression Grammar (PEG) parser generator for Rust

Parsing Expression Grammars in Rust Documentation | Release Notes rust-peg is a simple yet flexible parser generator that makes it easy to write robus

Kevin Mehall 1.2k Dec 30, 2022
A typed parser generator embedded in Rust code for Parsing Expression Grammars

Oak Compiled on the nightly channel of Rust. Use rustup for managing compiler channels. You can download and set up the exact same version of the comp

Pierre Talbot 138 Nov 25, 2022
Parsing and inspecting Rust literals (particularly useful for proc macros)

litrs: parsing and inspecting Rust literals litrs offers functionality to parse Rust literals, i.e. tokens in the Rust programming language that repre

Lukas Kalbertodt 31 Dec 26, 2022
A Rust crate for RDF parsing and inferencing.

RDF-rs This crate provides the tools necessary to parse RDF graphs. It currently contains a full (with very few exceptions) Turtle parser that can par

null 2 May 29, 2022
A Rust crate for hassle-free Corosync's configuration file parsing

corosync-config-parser A Rust crate for hassle-free Corosync's configuration file parsing. Inspired by Kilobyte22/config-parser. Usage extern crate co

Alessio Biancalana 2 Jun 10, 2022
This crate provide parsing fontconfig file but not yet complete all features

This crate provide parsing fontconfig file but not yet complete all features

null 4 Dec 27, 2022
Extensible inline parser engine, the backend parsing engine for Lavendeux.

Lavendeux Parser - Extensible inline parser engine lavendeux-parser is an exensible parsing engine for mathematical expressions. It supports variable

Richard Carson 10 Nov 3, 2022