A native Rust port of Google's robots.txt parser and matcher C++ library.

Overview

robotstxt

Crates.io Docs.rs Apache 2.0

A native Rust port of Google's robots.txt parser and matcher C++ library.

  • Native Rust port, no third-part crate dependency
  • Zero unsafe code
  • Preserves all behavior of original library
  • Consistent API with the original library
  • 100% google original test passed

Installation

[dependencies]
robotstxt = "0.3.0"

Quick start

use robotstxt::DefaultMatcher;

let mut matcher = DefaultMatcher::default();
let robots_body = "user-agent: FooBot\n\
                   disallow: /\n";
assert_eq!(false, matcher.one_agent_allowed_by_robots(robots_body, "FooBot", "https://foo.com/"));

About

Quoting the README from Google's robots.txt parser and matcher repo:

The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.

Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.

The library is slightly modified (i.e. some internal headers and equivalent symbols) production code used by Googlebot, Google's crawler, to determine which URLs it may access based on rules provided by webmasters in robots.txt files. The library is released open-source to help developers build tools that better reflect Google's robots.txt parsing and matching.

Crate robotstxt aims to be a faithful conversion, from C++ to Rust, of Google's robots.txt parser and matcher.

Testing

$ git clone https://github.com/Folyd/robotstxt
Cloning into 'robotstxt'...
$ cd robotstxt/tests 
...
$ mkdir c-build && cd c-build
...
$ cmake ..
...
$ make
...
$ make test
Running tests...
Test project ~/robotstxt/tests/c-build
    Start 1: robots-test
1/1 Test #1: robots-test ......................   Passed    0.33 sec

License

The robotstxt parser and matcher Rust library is licensed under the terms of the Apache license. See LICENSE for more information.

You might also like...
An IRC (RFC1459) parser and formatter, built in Rust.

ircparser An IRC (RFC1459) parser and formatter, built in Rust. ircparser should work on basically any Rust version, but the earliest version checked

A WIP svelte parser written in rust. Designed with error recovery and reporting in mind

Svelte(rs) A WIP parser for svelte files that is designed with error recovery and reporting in mind. This is mostly a toy project for now, with some v

A rusty, dual-wielding Quake and Half-Life texture WAD parser.

Ogre   A rusty, dual-wielding Quake and Half-Life texture WAD parser ogre is a rust representation and nom parser for Quake and Half-Life WAD files. I

A modern dialogue executor and tree parser using YAML.

A modern dialogue executor and tree parser using YAML. This crate is for building(ex), importing/exporting(ex), and walking(ex) dialogue trees. convo

Org mode structural parser/emitter with an emphasis on modularity and avoiding edits unrelated to changes.

Introduction Org mode structural parser/emitter with an emphasis on modularity and avoiding edits unrelated to changes. The goal of this library is to

Parser for Object files define the geometry and other properties for objects in Wavefront's Advanced Visualizer.

format of the Rust library load locad blender obj file to Rust NDArray. cargo run test\t10k-images.idx3-ubyte A png file will be generated for the fi

Lexer and parser collections.

laps Lexer and parser collections. With laps, you can build parsers by just defining ASTs and deriving Parse trait for them. Usage Add laps to your pr

Rust parser combinator framework

nom, eating data byte by byte nom is a parser combinators library written in Rust. Its goal is to provide tools to build safe parsers without compromi

Parsing Expression Grammar (PEG) parser generator for Rust

Parsing Expression Grammars in Rust Documentation | Release Notes rust-peg is a simple yet flexible parser generator that makes it easy to write robus

Comments
  • crashes for deviantart robots.txt

    crashes for deviantart robots.txt

    When parsing https://www.deviantart.com/robots.txt

    User-agent: *
    Disallow: /*q=
    Disallow: /users/*?
    Disallow: /join/*?
    Disallow: /morelikethis/
    Disallow: /download/
    Disallow: /checkout/
    Disallow: /global/
    Disallow: /api/
    Disallow: /critiques/
     
    Sitemap: http://sitemaps.deviantart.net/sitemap-index.xml.gz
    

    the parser fails with

    thread 'main' panicked at 'assertion failed: !val.is_empty()', /home/me/.local/share/cargo/registry/src/github.com-1ecc6299db9ec823/robotstxt-0.2.0/src/parser.rs:207:17
    

    Reproduction:

    use robotstxt::DefaultMatcher;
    
    fn main() {
        let robots_content = r#"User-agent: *
    Disallow: /*q=
    Disallow: /users/*?
    Disallow: /join/*?
    Disallow: /morelikethis/
    Disallow: /download/
    Disallow: /checkout/
    Disallow: /global/
    Disallow: /api/
    Disallow: /critiques/
     
    Sitemap: http://sitemaps.deviantart.net/sitemap-index.xml.gz"#;
        let mut matcher = DefaultMatcher::default();
        matcher.one_agent_allowed_by_robots(&robots_content, "oldnews", "https://www.deviantart.com/");
    }
    

    I'm assuming it is because of the line between the Disallows and the Sitemap, which only contains a single space.

    opened by iyzana 2
  • panics when robots_txt is not valid

    panics when robots_txt is not valid

    Fetching https://install.pivpn.io/robots.txt redirects to https://raw.githubusercontent.com/pivpn/pivpn/master/auto_install/install.sh which is a text/plain shell script.

    The parser panics when trying to parse that file.

    thread 'main' panicked at 'assertion failed: self.type_ == ParseKeyType::Unknown && !self.key_text.is_empty()', /home/me/.local/share/cargo/git/checkouts/robotstxt-269483cb38f6894f/ffe972d/src/parser.rs:88:9
    

    I'm not entirely sure what the correct behavior should be, but simply ignoring unparsable lines seems like a good option.

    opened by iyzana 1
  • Panics when slicing within char boundaries

    Panics when slicing within char boundaries

    The parser can panic if it tries to slice a string within char boundaries.

    thread '<unnamed>' panicked at 'byte index 58198 is not a char boundary; it is inside '\u{a0}'
    ...
    /root/.cargo/registry/src/github.com-1ecc6299db9ec823/robotstxt-0.3.0/src/parser.rs:169:57
    
    opened by samuelfekete 0
  • Document test dependencies

    Document test dependencies

    While following README.md and trying to build && run official google tests I get this error /usr/bin/ld: cannot find -labsl::container collect2: error: ld returned 1 exit status

    on make stage

    What should I install in order to fix this and run tests?

    uname -a 5.8.0-53-generic #60-Ubuntu SMP Thu May 6 07:46:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

    opened by let4be 3
Owner
Folyd
Creator of https://huhu.io, https://anyshortcut.com, https://paybase.cn
Folyd
Yet Another Parser library for Rust. A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing strings and slices.

Yap: Yet another (rust) parsing library A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing input.

James Wilson 117 Dec 14, 2022
PromQL Parser in Rust w/ native Node bindings

⚙️ promql-parser-js PromQL parsing wasm module based on Rust crate promql-rs Status Experiemental, don't use it! Install npm install @qxip/promql-pars

Lorenzo Mangani 2 Aug 7, 2022
Website for Microformats Rust parser (using 'microformats-parser'/'mf2')

Website for Microformats Rust parser (using 'microformats-parser'/'mf2')

Microformats 5 Jul 19, 2022
A parser combinator library for Rust

combine An implementation of parser combinators for Rust, inspired by the Haskell library Parsec. As in Parsec the parsers are LL(1) by default but th

Markus Westerlind 1.1k Dec 28, 2022
The Simplest Parser Library (that works) in Rust

The Simplest Parser Library (TSPL) TSPL is the The Simplest Parser Library that works in Rust. Concept In pure functional languages like Haskell, a Pa

HigherOrderCO 28 Mar 1, 2024
Universal configuration library parser

LIBUCL Table of Contents generated with DocToc Introduction Basic structure Improvements to the json notation General syntax sugar Automatic arrays cr

Vsevolod Stakhov 1.5k Jan 7, 2023
Pure, simple and elegant HTML parser and editor.

HTML Editor Pure, simple and elegant HTML parser and editor. Examples Parse HTML segment/document let document = parse("<!doctype html><html><head></h

Lomirus 16 Nov 8, 2022
🕑 A personal git log and MacJournal output parser, written in rust.

?? git log and MacJournal export parser A personal project, written in rust. WORK IN PROGRESS; NOT READY This repo consolidates daily activity from tw

Steven Black 4 Aug 17, 2022
Sqllogictest parser and runner in Rust.

Sqllogictest-rs Sqllogictest parser and runner in Rust. License Licensed under either of Apache License, Version 2.0 (LICENSE-APACHE or http://www.apa

Singularity Data Inc. 101 Dec 21, 2022
A CSS parser, transformer, and minifier written in Rust.

@parcel/css A CSS parser, transformer, and minifier written in Rust. Features Extremely fast – Parsing and minifying large files is completed in milli

Parcel 3.1k Jan 9, 2023