A Rust library to extract useful data from HTML documents, suitable for web scraping.

Overview

select.rs CI Documentation

A library to extract useful data from HTML documents, suitable for web scraping.


NOTE: The following example only works in the upcoming release of this library. Check out the 0.5.0 tag for the latest public release.


Examples

from examples/stackoverflow.rs

use select::document::Document;
use select::predicate::{Attr, Class, Name, Predicate};

pub fn main() {
    // stackoverflow.html was fetched from
    // http://stackoverflow.com/questions/tagged/rust?sort=votes&pageSize=50 on
    // Aug 10, 2015.
    let document = Document::from(include_str!("stackoverflow.html"));

    println!("# Menu");
    for node in document.find(Attr("id", "hmenus").descendant(Name("a"))) {
        println!("{} ({:?})", node.text(), node.attr("href").unwrap());
    }
    println!();

    println!("# Top 5 Questions");
    for node in document.find(Class("question-summary")).take(5) {
        let question = node.find(Class("question-hyperlink")).next().unwrap();
        let votes = node.find(Class("vote-count-post")).next().unwrap().text();
        let answers = node
            .find(Class("status").descendant(Name("strong")))
            .next()
            .unwrap()
            .text();
        let tags = node
            .find(Class("post-tag"))
            .map(|tag| tag.text())
            .collect::<Vec<_>>();
        let asked_on = node.find(Class("relativetime")).next().unwrap().text();
        let asker = node
            .find(Class("user-details").descendant(Name("a")))
            .next()
            .unwrap()
            .text();
        println!(" Question: {}", question.text());
        println!("  Answers: {}", answers);
        println!("    Votes: {}", votes);
        println!("   Tagged: {}", tags.join(", "));
        println!(" Asked on: {}", asked_on);
        println!("    Asker: {}", asker);
        println!(
            "Permalink: http://stackoverflow.com{}",
            question.attr("href").unwrap()
        );
        println!();
    }

    println!("# Top 10 Related Tags");
    for node in document
        .find(Attr("id", "h-related-tags"))
        .next()
        .unwrap()
        .parent()
        .unwrap()
        .find(Name("div"))
        .take(10)
    {
        let tag = node.find(Name("a")).next().unwrap().text();
        let count = node
            .find(Class("item-multiplier-count"))
            .next()
            .unwrap()
            .text();
        println!("{} ({})", tag, count);
    }
}

prints

# Menu
Questions ("/questions")
Tags ("/tags")
Users ("/users")
Badges ("/help/badges")
Unanswered ("/unanswered")
Ask Question ("/questions/ask")

# Top 5 Questions
 Question: Applications and libraries written in Rust [closed]
  Answers: 8
    Votes: 67
   Tagged: rust
 Asked on: Feb 19 '12 at 14:39
    Asker: Atom
Permalink: http://stackoverflow.com/questions/9350125/applications-and-libraries-written-in-rust

 Question: How to debug Rust programs? [closed]
  Answers: 6
    Votes: 52
   Tagged: rust
 Asked on: Apr 8 '13 at 5:30
    Asker: macropas
Permalink: http://stackoverflow.com/questions/15871885/how-to-debug-rust-programs

 Question: How to access command line parameters?
  Answers: 9
    Votes: 51
   Tagged: rust
 Asked on: Mar 25 '13 at 15:59
    Asker: shutefan
Permalink: http://stackoverflow.com/questions/15619320/how-to-access-command-line-parameters

 Question: Why are explicit lifetimes needed in Rust?
  Answers: 6
    Votes: 48
   Tagged: pointers, rust, static-analysis, lifetime
 Asked on: Jul 24 at 11:15
    Asker: jco
Permalink: http://stackoverflow.com/questions/31609137/why-are-explicit-lifetimes-needed-in-rust

 Question: What is the difference between traits in Rust and typeclasses in Haskell?
  Answers: 1
    Votes: 46
   Tagged: haskell, rust
 Asked on: Jan 24 at 7:50
    Asker: LogicChains
Permalink: http://stackoverflow.com/questions/28123453/what-is-the-difference-between-traits-in-rust-and-typeclasses-in-haskell

# Top 10 Related Tags
lifetime (165)
traits (83)
rust-cargo (79)
string (76)
ffi (62)
iterator (58)
multithreading (50)
generics (50)
arrays (49)
borrow-checker (47)

License

MIT

Comments
  • Add more CSS selectors as predicates.

    Add more CSS selectors as predicates.

    Implements #40.

    Changes:

    • Adds Node::classes which returns an iterator, and Node::id for convenience.
    • Documents which CSS selectors are equivalent to which predicates.
    • Renames Attr(s, ()) to HasAttr (breaking change; can be removed if desired).
    • Adds Nothing predicate, the opposite of Any.
    • Adds Id predicate, which calls Node::id.
    • Adds Root predicate, that ensures that the node has no parents.
    • Adds Empty predicate, which allows comment children to be compatible with CSS.
    • Adds AttrMatches predicate for convenience; it takes a string for an attribute and a function to test the string.
    • Adds AttrContains, AttrStartsWith, and AttrEndsWith which call the relevant string functions.
    • Adds AttrContainsWord and AttrLang, which are kind of esoteric but are equivalent to the [attr~=word] and [attr|=lang] selectors.

    Once all of the CSS selectors are added we can potentially allow parsing &str -> Box<Predicate>.

    opened by clarfonthey 13
  • Implement From<String> for Document

    Implement From for Document

    Currently, one cannot create a Document from a String, as is the case when reading a response from Hyper into a String. With From, this is now possible.

    Example of new allowed behaviour:

    let client = hyper::Client::new();
    let mut response = try!(client.get(&request_url).send());
    let mut body = String::new();
    try!(response.read_to_string(&mut body));
    
    let dom = Document::from(body);
    
    opened by aleksanb 13
  • Add Selection::last() for symmetry with first()

    Add Selection::last() for symmetry with first()

    Getting the last element leads to some non-trivial lifetime wrangling when implemented using iterators over a Selection, so adding a last() method helps avoid those difficulties.

    Fixes #17

    opened by porglezomp 7
  • Are you still developing this?

    Are you still developing this?

    There haven't been any commits since over a year and I'm wondering whether @utkarshkukreti is still developing this. @utkarshkukreti, if you're reading this, do you plan to keep developing/maintaining this library?

    opened by svenstaro 4
  • iterate over attributes?

    iterate over attributes?

    Is it possible to access attributes using select.rs without knowing their names beforehand? If not, could an interface for iterating over (attr name, attr value) pairs be added?

    opened by sp3d 4
  • Updated to html5ever 0.18

    Updated to html5ever 0.18

    Hi @utkarshkukreti

    We want to show your excellent crate in rust-cookbook https://github.com/brson/rust-cookbook/pull/183 :smile: but due to bug in our documentation test driver we need for all of intermediate dependencies to be on the same level.

    Long story (really!) made short. We would require to update select to latest html5ever released today.

    If there is anything wrong with the PR I will gladly fix it!

    Best Regards

    opened by budziq 4
  • Update the html5ever dependency

    Update the html5ever dependency

    This requires using the newly split-off RcDom implementation from html5ever.

    Since this is now clearly marked that it shouldn't be used in production, this raises the question, should a production-ready DOM implementation be used instead for select.rs?

    opened by nuxeh 3
  • Example from README isn't compilable

    Example from README isn't compilable

    Looks like example outdated because I couldn't compile it:

    src/main.rs:16:52: 16:62 error: no method named `descendant` found for type `select::predicate::Attr<&str, &str>` in the current scope
    src/main.rs:16     for node in document.find(Attr("id", "hmenus").descendant(Name("a"))) {
                                                                      ^~~~~~~~~~
    src/main.rs:22:58: 22:62 error: no method named `take` found for type `select::selection::Selection<'_>` in the current scope
    src/main.rs:22     for node in document.find(Class("question-summary")).take(5) {
                                                                            ^~~~
    src/main.rs:22:58: 22:62 note: the method `take` exists but the following trait bounds were not satisfied: `select::selection::Selection<'_> : std::iter::Iterator`
    src/main.rs:25:49: 25:59 error: no method named `descendant` found for type `select::predicate::Class<&str>` in the current scope
    src/main.rs:25         let answers = node.find(Class("status").descendant(Name("strong")))
                                                                   ^~~~~~~~~~
    src/main.rs:31:53: 31:63 error: no method named `descendant` found for type `select::predicate::Class<&str>` in the current scope
    src/main.rs:31         let asker = node.find(Class("user-details").descendant(Name("a")))
                                                                       ^~~~~~~~~~
    src/main.rs:49:10: 49:16 error: no method named `unwrap` found for type `select::selection::Selection<'_>` in the current scope
    src/main.rs:49         .unwrap()
                            ^~~~~~
    error: aborting due to 5 previous errors
    
    $ cargo --version
    cargo 0.12.0-nightly (6b98d1f 2016-07-04)
    $ rustc --version
    rustc 1.11.0 (9b21dcd6a 2016-08-15)
    
    opened by php-coder 3
  • Implement `Debug` for `Find`

    Implement `Debug` for `Find`

    As mentioned in https://github.com/utkarshkukreti/select.rs/issues/61, I find myself in need of a Debug implementation for Find.

    The only member of Find that does not necessarily implement Debug is predicate (which may be a closure), so I left it out from the Debug impl. This is in line with Rust's Filter (and other iterators).

    opened by phimuemue 2
  • Panic due to assertion failed: c.is_some() in html5ever-0.18.0/src/tokenizer/mod.rs:555:9

    Panic due to assertion failed: c.is_some() in html5ever-0.18.0/src/tokenizer/mod.rs:555:9

    The following code cause panic:

    use std::fs::File;
    use std::io::prelude::*;
    
    #[macro_use]
    extern crate error_chain;
    
    use select::document::Document;
    
    error_chain! {
        foreign_links {
            Reqwest(reqwest::Error);
            Std(std::io::Error);
        }
    }
    
    fn main() -> Result<()> {
        /*
         * This download the file
         *
         */
        // let html = reqwest::get("http://sampsonsheriff.com/")?.text()?;
        // let mut file = File::create("src/panic.txt")?;
        // file.write_all(html.as_bytes())?;
    
    
        let document = Document::from(include_str!("panic.txt"));
    
        Ok(())
    }
    

    Here is the stack trace:

    thread 'main' panicked at 'assertion failed: c.is_some()', /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/tokenizer/mod.rs:555:9
    stack backtrace:
       0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
                 at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
       1: std::sys_common::backtrace::_print
                 at src/libstd/sys_common/backtrace.rs:70
       2: std::panicking::default_hook::{{closure}}
                 at src/libstd/sys_common/backtrace.rs:58
                 at src/libstd/panicking.rs:200
       3: std::panicking::default_hook
                 at src/libstd/panicking.rs:215
       4: std::panicking::rust_panic_with_hook
                 at src/libstd/panicking.rs:478
       5: std::panicking::begin_panic
                 at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libstd/panicking.rs:412
       6: <html5ever::tokenizer::Tokenizer<Sink>>::discard_char
                 at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/select-0.4.2/<::std::macros::panic macros>:3
       7: <html5ever::tokenizer::Tokenizer<Sink>>::step
                 at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/tokenizer/mod.rs:570
       8: <html5ever::tokenizer::Tokenizer<Sink>>::run
                 at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/tokenizer/mod.rs:362
       9: <html5ever::tokenizer::Tokenizer<Sink>>::feed
                 at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/tokenizer/mod.rs:220
      10: <html5ever::driver::Parser<Sink> as tendril::stream::TendrilSink<tendril::fmt::UTF8>>::process
                 at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.18.0/src/driver.rs:88
      11: tendril::stream::TendrilSink::one
                 at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/tendril-0.3.1/src/stream.rs:47
      12: <select::document::Document as core::convert::From<tendril::tendril::Tendril<tendril::fmt::UTF8>>>::from
                 at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/select-0.4.2/src/document.rs:53
      13: <select::document::Document as core::convert::From<&'a str>>::from
                 at /home/abdullah/.cargo/registry/src/github.com-1ecc6299db9ec823/select-0.4.2/src/document.rs:133
      14: reqwest_test::main
                 at src/main.rs:26
      15: std::rt::lang_start::{{closure}}
                 at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libstd/rt.rs:64
      16: std::panicking::try::do_call
                 at src/libstd/rt.rs:49
                 at src/libstd/panicking.rs:297
      17: __rust_maybe_catch_panic
                 at src/libpanic_unwind/lib.rs:92
      18: std::rt::lang_start_internal
                 at src/libstd/panicking.rs:276
                 at src/libstd/panic.rs:388
                 at src/libstd/rt.rs:48
      19: std::rt::lang_start
                 at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libstd/rt.rs:64
      20: main
      21: __libc_start_main
      22: _start
    

    I have attached the file in case the website change. panic.txt

    opened by Alvenix 2
  • Upgraded test framework to version compatible with latest nightlies.

    Upgraded test framework to version compatible with latest nightlies.

    Added example of how to just get the plain text from a webpage. (E.g. for machine learning sometimes you just want access to plain text).

    If there's a neater way to do it, I'd love to see it - I couldn't find a specific root node function on document? Did I miss one, if not it would be nice to call it out as a specific function rather than nth(0).

    opened by gilescope 2
  • transitive dependency on `time` with CVE

    transitive dependency on `time` with CVE

    https://github.com/time-rs/time/issues/293

    https://github.com/utkarshkukreti/select.rs/blob/master/Cargo.toml#L14

    https://github.com/servo/html5ever/blob/master/xml5ever/Cargo.toml

    It appears that bumping the html5ever dependency would resolve it, but I haven't checked myself

    opened by blakehawkins 0
  • Provide a way to recover from a potential stack overflows

    Provide a way to recover from a potential stack overflows

    I'm using select in a broad web crawler, everything is working fine until it does not... Web is full of surprises and magic input nobody expects to find that can crash almost anything(just by the sheer amount of data web contains)... sometimes, sometimes select will go into deep recursion that causes stack overflow(and I have thread stack size set to 32mb), now this might be a bug in select which can be fixed...(but I cannot even log it... because need to run crawler under gdb to catch such stuff it seems)

    What I'm looking for is a way to limit the depth of recursion in select's code to ensure we do not stack overflow. Afaik stack overflows panics are impossible to recover in rust and lead to forceful program exit :\

    opened by let4be 0
  • How to use `next()` from find()

    How to use `next()` from find()

    Hi,

    What is next() doing? in the below code? And is there documentation or something about how to use the next()? Thanks!

        for node in document.find(Class("question-summary")).take(5) {
            let question = node.find(Class("question-hyperlink")).next().unwrap();
    
    opened by jihuun 1
  • Document::from get `STATUS_STACK_OVERFLOW`

    Document::from get `STATUS_STACK_OVERFLOW`

    Hi, I'v encounter STATUS_STACK_OVERFLOW when load a big xml file, abount 341KB. On win7 64bit, vs2013.

    fn main() {
        use select::document::Document;
        use select::predicate::{Name, Predicate};
    
        let document = Document::from(
            include_str!("8144940")
        );
        `println!("Hello,` world!");
    }
    

    8144940.zip

    opened by lynnux 1
Owner
Utkarsh Kukreti
Utkarsh Kukreti
An async no_std HTTP server suitable for bare-metal environments, heavily inspired by axum

picoserve An async no_std HTTP server suitable for bare-metal environments, heavily inspired by axum. It was designed with embassy on the Raspberry Pi

Samuel Hicks 81 Oct 7, 2023
The simplest build-time framework for writing web apps with html templates and typescript

Encoped A build-time fast af tool to write static apps with html and TypeScript Features Template-based ESLint, Prettier and Rollup integration No ext

null 1 Dec 11, 2021
A html document syntax and operation library written in Rust, use APIs similar to jQuery.

Visdom A server-side html document syntax and operation library written in Rust, it uses apis similar to jQuery, left off the parts thoes only worked

轩子 80 Dec 21, 2022
Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.

Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.

Actix 16.2k Jan 2, 2023
Generate html/js/css with rust

Generate html/js/css with rust

null 79 Sep 29, 2022
Hot reload static web server for deploying mutiple static web site with version control.

SPA-SERVER It is to provide a static web http server with cache and hot reload. 中文 README Feature Built with Hyper and Warp, fast and small! SSL with

null 7 Dec 18, 2022
Code template for a production Web Application using Axum: The AwesomeApp Blueprint for Professional Web Development.

AwesomeApp rust-web-app More info at: https://awesomeapp.dev/rust-web-app/ rust-web-app YouTube episodes: Episode 01 - Rust Web App - Course to Produc

null 45 Sep 6, 2023
A highly customizable, full scale web backend for web-rwkv, built on axum with websocket protocol.

web-rwkv-axum A axum web backend for web-rwkv, built on websocket. Supports BNF-constrained grammar, CFG sampling, etc., all streamed over network. St

Li Junyu 12 Sep 25, 2023
Scraper - HTML parsing and querying with CSS selectors

scraper HTML parsing and querying with CSS selectors. scraper is on Crates.io and GitHub. Scraper provides an interface to Servo's html5ever and selec

june 1.2k Dec 30, 2022
jq, but for HTML

hq jq, but for HTML. hq reads HTML and converts it into a JSON object based on a series of CSS selectors. The selectors are expressed in a similar way

Tom Forbes 511 Jan 5, 2023
A Google-like web search engine that provides the user with the most relevant websites in accordance to his/her query, using crawled and indexed textual data and PageRank.

Mini Google Course project for the Architecture of Computer Systems course. Overview: Architecture: We are working on multiple components of the web c

Max 11 Aug 10, 2022
📝 Web-based, reactive Datalog notebooks for data analysis and visualization

Percival is a declarative data query and visualization language. It provides a reactive, web-based notebook environment for exploring complex datasets, producing interactive graphics, and sharing results.

Eric Zhang 486 Dec 28, 2022
Noria: data-flow for high-performance web applications

Noria: data-flow for high-performance web applications Noria is a new streaming data-flow system designed to act as a fast storage backend for read-he

MIT PDOS 4.5k Dec 28, 2022
Silkenweb - A library for writing reactive single page web apps

Silkenweb A library for building reactive single page web apps. Features Fine grained reactivity using signals to minimize DOM API calls No VDOM. Call

null 85 Dec 26, 2022
living library; snapgene for the web

Viviteca The living library; SnapGene for the web. Setup To contribute to this project you will require: rust pre-commit Git workflow You should insta

Maximilien Rothier Bautzer 1 Feb 5, 2022
axum-serde is a library that provides multiple serde-based extractors and responders for the Axum web framework.

axum-serde ?? Overview axum-serde is a library that provides multiple serde-based extractors / responses for the Axum web framework. It also offers a

GengTeng 3 Dec 12, 2023
A Rust web framework

cargonauts - a Rust web framework Documentation cargonauts is a Rust web framework intended for building maintainable, well-factored web apps. This pr

null 179 Dec 25, 2022
A rust web framework with safety and speed in mind.

darpi A web api framework with speed and safety in mind. One of the big goals is to catch all errors at compile time, if possible. The framework uses

null 32 Apr 11, 2022
A web framework for Rust.

Rocket Rocket is an async web framework for Rust with a focus on usability, security, extensibility, and speed. #[macro_use] extern crate rocket; #[g

Sergio Benitez 19.4k Jan 4, 2023