Scraper - HTML parsing and querying with CSS selectors

Overview

scraper

crates.io package test Join the chat at https://gitter.im/scraper-rs/community

HTML parsing and querying with CSS selectors.

scraper is on Crates.io and GitHub.

Scraper provides an interface to Servo's html5ever and selectors crates, for browser-grade parsing and querying.

Examples

Parsing a document

Hello, world!

Hello, world!

"#; let document = Html::parse_document(html);">
use scraper::Html;

let html = r#"
    
    
    Hello, world!
    

Hello, world!

"#; let document = Html::parse_document(html);

Parsing a fragment

Hello, world!");">
use scraper::Html;
let fragment = Html::parse_fragment("

Hello, world!

"
);

Parsing a selector

use scraper::Selector;
let selector = Selector::parse("h1.foo").unwrap();

Selecting elements

  • Foo
  • Bar
  • Baz
  • "#; let fragment = Html::parse_fragment(html); let selector = Selector::parse("li").unwrap(); for element in fragment.select(&selector) { assert_eq!("li", element.value().name()); }">
    use scraper::{Html, Selector};
    
    let html = r#"
        
    • Foo
    • Bar
    • Baz
    "#; let fragment = Html::parse_fragment(html); let selector = Selector::parse("li").unwrap(); for element in fragment.select(&selector) { assert_eq!("li", element.value().name()); }

    Selecting descendent elements

  • Foo
  • Bar
  • Baz
  • "#; let fragment = Html::parse_fragment(html); let ul_selector = Selector::parse("ul").unwrap(); let li_selector = Selector::parse("li").unwrap(); let ul = fragment.select(&ul_selector).next().unwrap(); for element in ul.select(&li_selector) { assert_eq!("li", element.value().name()); }">
    use scraper::{Html, Selector};
    
    let html = r#"
        
    • Foo
    • Bar
    • Baz
    "#; let fragment = Html::parse_fragment(html); let ul_selector = Selector::parse("ul").unwrap(); let li_selector = Selector::parse("li").unwrap(); let ul = fragment.select(&ul_selector).next().unwrap(); for element in ul.select(&li_selector) { assert_eq!("li", element.value().name()); }

    Accessing element attributes

    "#); let selector = Selector::parse(r#"input[name="foo"]"#).unwrap(); let input = fragment.select(&selector).next().unwrap(); assert_eq!(Some("bar"), input.value().attr("value"));">
    use scraper::{Html, Selector};
    
    let fragment = Html::parse_fragment(r#""#);
    let selector = Selector::parse(r#"input[name="foo"]"#).unwrap();
    
    let input = fragment.select(&selector).next().unwrap();
    assert_eq!(Some("bar"), input.value().attr("value"));

    Serializing HTML and inner HTML

    Hello, world!"); let selector = Selector::parse("h1").unwrap(); let h1 = fragment.select(&selector).next().unwrap(); assert_eq!("

    Hello, world!

    ", h1.html()); assert_eq!("Hello, world!", h1.inner_html());">
    use scraper::{Html, Selector};
    
    let fragment = Html::parse_fragment("

    Hello, world!

    "
    ); let selector = Selector::parse("h1").unwrap(); let h1 = fragment.select(&selector).next().unwrap(); assert_eq!("

    Hello, world!

    "
    , h1.html()); assert_eq!("Hello, world!", h1.inner_html());

    Accessing descendent text

    Hello, world!"); let selector = Selector::parse("h1").unwrap(); let h1 = fragment.select(&selector).next().unwrap(); let text = h1.text().collect::>(); assert_eq!(vec!["Hello, ", "world!"], text);">
    use scraper::{Html, Selector};
    
    let fragment = Html::parse_fragment("

    Hello, world!

    "
    ); let selector = Selector::parse("h1").unwrap(); let h1 = fragment.select(&selector).next().unwrap(); let text = h1.text().collect::<Vec<_>>(); assert_eq!(vec!["Hello, ", "world!"], text);

    Contributing

    Please feel free to open pull requests. If you're planning on implementing something big (i.e. not fixing a typo, a small bug fix, minor refactor, etc) then please open an issue first.

    Stargazers over time

    Stargazers over time

    License: ISC

    Comments
    • Looking for maintainer(s)

      Looking for maintainer(s)

      I haven't been actively using or developing this project for quite a while now and haven't had time to respond to new issues and PRs. I would like to find someone more qualified than me to continue maintenance and/or development of this project which seems to be quite popular. If anyone would like to help out, please comment here or send me an email.

      C-help-wanted 
      opened by causal-agent 23
    • Cut a new release

      Cut a new release

      Clicked the enter key too early (oops)!

      cc @cfvescovo we should try to cut a new release (probably v0.14) at some point (we don't have many changes, but as they are stable it makes sense to push them to crates.io in my view).

      I would make a release, but I don't have permissions to publish on crates.io (I think)

      opened by teymour-aldridge 9
    • A way to get innerHTML or innerText?

      A way to get innerHTML or innerText?

      I haven't found any function like this in the Documentation. When dealing with <br> elements, it would be really handy if there was a way to just get the innerText or at least innerHTML of a NodeRef.

      opened by luxalpa 9
    • Selector doesn't work with newline after

      Selector doesn't work with newline after

      document.select with Selector::parse is not working when there's a newline directly after the the tag.

      Code:

      let a_sel = scraper::Selector::parse("a").unwrap();
      for el in document.select(&a_sel) {
          //...
      }
      

      HTML example that triggers this:

      <a
                                  href="...")"
      

      When printing these affected elements:

      Element(<a\n href="\\\"/...
      

      Other elements in the query that are of the form Element(<a href="\\\"/... don't trigger this problem. Happy for a workaround in the meanwhile.

      opened by David-OConnor 8
    • Error using scraper in spawned process via Tokio runtime

      Error using scraper in spawned process via Tokio runtime

      Environment:

      • Linux
      • Rust 1.60.0-nightly
      • Scraper 0.12.0

      Problem:

      When calling rt.spawn(my_job::run(...)); I receive these compile errors

      generator cannot be sent between threads safely within ego_tree::Node<scraper::node::Node>, the trait Sync is not implemented for Cell<NonZeroUsize>rustc mod.rs(381, 25):required by a bound in Runtime::spawn generator cannot be sent between threads safely within ego_tree::Node<scraper::node::Node>, the trait Sync is not implemented for UnsafeCell<tendril::tendril::Buffer>rustc mod.rs(381, 25):required by a bound in Runtime::spawn generator cannot be sent between threads safely within ego_tree::Node<scraper::node::Node>, the trait Sync is not implemented for *mut tendril::fmt::UTF8rustc mod.rs(381, 25):required by a bound in Runtime::spawn generator cannot be sent between threads safely within ego_tree::Node<scraper::node::Node>, the trait Sync is not implemented for Cell<tendril::tendril::PackedUsize>rustc mod.rs(381, 25):required by a bound in Runtime::spawn

      UPDATE: The error only appears when I run this code within my_job.rs.

      download::run(...).await.unwrap();

      If I remove that line everything compiles. Can someone explain why?

      Code:

      main.rs

      use tokio::runtime::Runtime;
      ...
      let mut rt = Runtime::new().unwrap();
      rt.spawn(my_job::run(...))
      

      my_job.rs

      let title_selector = scraper::Selector::parse("title").unwrap();
      let title = document.select(&title_selector).next().unwrap().inner_html();
      download::run("", "./Downloads", "").await.unwrap();
      

      Question:

      It appears scraper might not be thread safe according to Rust compiler. Are there any work arounds to this so I can still use scraper crate?

      opened by Randall-Coding 7
    • How do i remove certain tags/nodes before selecting a text?

      How do i remove certain tags/nodes before selecting a text?

      When we have few tags that need to be removed before selecting a tag for example

      fn main() {
      let selector = Selector::parse("body").unwrap();
          let html = r#"
          <!DOCTYPE html>
         <body>
         Hello World
         <script type="application/json" data-selector="settings-json">
         {"test":"json"}
         </script>
         </body>
      "#;
          let document = Html::parse_document(html);
          let body = document.select(&selector).next().unwrap();
          let text = body.text().collect::<Vec<_>>();
          println!("{:?}", text);
      }
      

      Output

      ["\n Hello World\n ", "\n {\"test\":\"json\"}\n ", "\n \n"]

      The output will have the value from the script tags, Is there any way we can remove those?

      opened by naveenann 6
    • selector accidently edited html

      selector accidently edited html

      I'm writing a robot to fetch cn.etherscan.com's token data.

      On their site the transfers section has content: 939,005

      image

      while using the following code it gives me different thing:

          let transfers_selector = Selector::parse(
              ".card .card-body #ContentPlaceHolder1_trNoOfTxns #totaltxns",
          )
          .unwrap();
      
          if let Some(overview) =
              fragment.select(&overview_selector).next()
          {
              dbg!(&overview
                  .select(&transfers_selector)
                  .next()
                  .unwrap()
                  .html());
          }
      

      image

      opened by GopherJ 5
    • An Error Type for `Selector::parse`

      An Error Type for `Selector::parse`

      I might be adding different error types if needed, but for now the error type for Selector::parse's Result is now a custom type that is exported. This PR resolves #60 Improvements to the code are welcome

      opened by Kiwifuit 4
    • Selector::parse transforms > etc. into > etc.

      Selector::parse transforms > etc. into > etc.

      The following code produces the following prints.

          let data = "<html><body><p>Foo bar &lt;&gt; Cake ! Heh &le;</p></body></html>";
          let html = Html::parse_document(&data);
          let selector = Selector::parse("p").unwrap();
          println!("{:?}", data);
          println!("{:?}", html.select(&selector).next());
      
      "<html><body><p>Foo bar &lt;&gt; Cake ! Heh &le;</p></body></html>"
      Some(ElementRef { node: NodeRef { id: NodeId(5), tree: Tree { vec: [Node { parent: None, prev_sibling: None, next_sibling: None, children: Some((NodeId(2), NodeId(2))), value: Document }, Node { parent: Some(NodeId(1)), prev_sibling: None, next_sibling: None, children: Some((NodeId(3), NodeId(4))), value: Element(<html>) }, Node { parent: Some(NodeId(2)), prev_sibling: None, next_sibling: Some(NodeId(4)), children: None, value: Element(<head>) }, Node { parent: Some(NodeId(2)), prev_sibling: Some(NodeId(3)), next_sibling: None, children: Some((NodeId(5), NodeId(5))), value: Element(<body>) }, Node { parent: Some(NodeId(4)), prev_sibling: None, next_sibling: None, children: Some((NodeId(6), NodeId(6))), value: Element(<p>) }, Node { parent: Some(NodeId(5)), prev_sibling: None, next_sibling: None, children: None, value: Text("Foo bar <> Cake ! Heh ≤") }] }, node: Node { parent: Some(NodeId(4)), prev_sibling: None, next_sibling: None, children: Some((NodeId(6), NodeId(6))), value: Element(<p>) } } })
      

      Issue being this part: Text("Foo bar <> Cake ! Heh ≤") It should be Text("Foo bar &lt;&gt; Cake ! Heh &le;")

      Is this a bug?

      opened by Temeez 4
    • How to parse html threads safely?

      How to parse html threads safely?

      I had load a html, parse with scraper, but got a woring.

          pub async fn comic_introduction(&self, id: i32) -> Result<()> {
              let body = self
                  .agent
                  .get(&format!("{}/galleryblock/{}.html", LTN_BASE_URL, id))
                  .send()
                  .await?
                  .error_for_status()?
                  .text()
                  .await?;
              let fragment = Html::parse_fragment(&body);
              let title_selector = Selector::parse("h1.lillie>a")?;
              let title_elements = fragment.select(&title_selector);
              let title = if let Some(item) = title_elements.last() {
                  item.inner_html()
              } else {
                  return Err(anyhow::Error::msg("not found title"));
              };
              Ok(())
          }
      
      Selector::parse("h1.lillie>a")?;
                              ^ `Rc<std::string::String>` cannot be shared between threads safely
      
      opened by niuhuan 4
    • Bump html5ever for removed time methods

      Bump html5ever for removed time methods

      Bug Info

      The current of version of html5ever is utilizing sinnce-removed method (precise_time_ns):

      • https://github.com/servo/html5ever/pull/453
      • https://github.com/time-rs/time/blob/main/CHANGELOG.md#removed

        Removed

        v0.1 APIs, previously behind an enabled-by-default feature flag ...

        • precise_time_ns
      error[E0425]: cannot find function `precise_time_ns` in crate `time`
         --> ...../cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.5.4/src/macros.rs:27:26
          |
      27  |         let t0 = ::time::precise_time_ns();
          |                          ^^^^^^^^^^^^^^^ not found in `time`
          |
         ::: ...../.cargo/registry/src/github.com-1ecc6299db9ec823/html5ever-0.5.4/src/tokenizer/mod.rs:230:27
          |
      230 |             let (_, dt) = time!(self.sink.process_token(token));
          |                           ------------------------------------- in this macro invocation
          |
          = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
      

      edit: looks like

      html5ever = "0.25.1"
      

      is still currently the latest published release of html5ever still :\

      opened by bigpick 4
    • Performance improvements

      Performance improvements

      I have tried to improve parsing performance by replacing the standard hash function with fnv and by replacing LocalNames with strings (in node.rs) as suggested in #45. @teymour-aldridge are there any other changes we could make to improve parsing speed? Could someone that needs to parse multiple/huge HTMLs perform some benchmarks and report back?

      opened by cfvescovo 1
    • Dependency `selectors` is not on the last version

      Dependency `selectors` is not on the last version

      Current version used: 0.22 Last version: 0.23

      Seems like just bump the version won't make a breaking change. A good practice is also to re-export the crate selectors to avoid users having to guess which version they should be using 😊

      wontfix 
      opened by Elrendio 4
    • API idea: selector!() macro for Selector literals

      API idea: selector!() macro for Selector literals

      Inspired by this blog post which mentions some API struggles with writing a Rust backend program, and one of the mentioned things is that Selector::parse('...').unwrap() is not ergonomic when the selectors are constants and you are writing a lot of them.

      How about a selector!() macro to define a selector from a string, which removes the boilerplate for this case? There is precedent for this in the form of vec![] provided by the language and e.g. the regex!() macro provided by the regex crate.

      C-enhancement 
      opened by rvolgers 9
    • method select should be implemented as a trait

      method select should be implemented as a trait

      Both Html & ElementRef support the same method select. Having that method implemented as a Trait would allow to create a function that can accept both, and would allow to call that method generically, which doesn't seem to be currently possible.

      My current workaround is to call root_element() on Html, to get its ElementRef, which fortunately works even if it's just an html fragment without the <html> tag... Although its description didn't seem to indicate so.

      C-enhancement 
      opened by joseluis 1
    • HashSet<LocalName> for classes is slow

      HashSet for classes is slow

      LocalName::new called for something that isn't in https://github.com/servo/html5ever/blob/master/markup5ever/local_names.txt locks a global Mutex. As LocalName::new is called for every class in Element::new, and most classes are unlikely to be in that list, this means that multithreading is much less of a win for HTML parsing than it should be. While it's a breaking change and probably not the best approach, I've switched to using Strings locally, which gives me about a 10% performance improvement. My program's performance is still dominated by HashSet construction, though. It might be faster to intern the entire HashSet<String>s per-document.

      opened by leo60228 5
    • Ability to get ElementRef of parent of ElementRef

      Ability to get ElementRef of parent of ElementRef

      Please add the ability to get an ElementRef of an ElementRef's parent (and any ancestor/sibling), so that one can call .html() (and other ElementRef methods) on the parent :)

      opened by Boscop 3
    Releases(v0.14.0)
    • v0.14.0(Dec 19, 2022)

      What's Changed

      • Update dependencies by @teymour-aldridge in https://github.com/causal-agent/scraper/pull/81
      • Add a test for tags with newline. by @teymour-aldridge in https://github.com/causal-agent/scraper/pull/82
      • Implement serializer for Html by @TonalidadeHidrica in https://github.com/causal-agent/scraper/pull/86
      • refactor: Make selectors field private by @volsa in https://github.com/causal-agent/scraper/pull/87
      • implement DoubleEndedIterator for Select by @arctic-penguin in https://github.com/causal-agent/scraper/pull/96
      • An Error Type for Selector::parse by @Kiwifuit in https://github.com/causal-agent/scraper/pull/95

      New Contributors

      • @TonalidadeHidrica made their first contribution in https://github.com/causal-agent/scraper/pull/86
      • @volsa made their first contribution in https://github.com/causal-agent/scraper/pull/87
      • @arctic-penguin made their first contribution in https://github.com/causal-agent/scraper/pull/96
      • @Kiwifuit made their first contribution in https://github.com/causal-agent/scraper/pull/95

      Full Changelog: https://github.com/causal-agent/scraper/compare/v0.13.0...v0.14.0

      Source code(tar.gz)
      Source code(zip)
    • v0.13.0(Apr 24, 2022)

      What's Changed

      • feat: add support for order-keeping attributes by @mainrs in https://github.com/causal-agent/scraper/pull/55
      • docs: add notice for deterministic attribute (de)serialization by @mainrs in https://github.com/causal-agent/scraper/pull/59
      • Add Github Actions CI workflow. by @teymour-aldridge in https://github.com/causal-agent/scraper/pull/63
      • Add a Gitter chat badge to README.md by @gitter-badger in https://github.com/causal-agent/scraper/pull/66
      • Fix Github actions workflow by @cfvescovo in https://github.com/causal-agent/scraper/pull/67
      • Add contributing guidelines. by @teymour-aldridge in https://github.com/causal-agent/scraper/pull/68
      • Update dependencies by @cfvescovo in https://github.com/causal-agent/scraper/pull/69

      New Contributors

      • @mainrs made their first contribution in https://github.com/causal-agent/scraper/pull/55
      • @teymour-aldridge made their first contribution in https://github.com/causal-agent/scraper/pull/63
      • @gitter-badger made their first contribution in https://github.com/causal-agent/scraper/pull/66
      • @cfvescovo made their first contribution in https://github.com/causal-agent/scraper/pull/67

      Full Changelog: https://github.com/causal-agent/scraper/compare/v0.12.0...v0.13.0

      Source code(tar.gz)
      Source code(zip)
    Owner
    june
    Millennial princess. Artisanal programmer. I don't use GitHub anymore.
    june
    Discover GitHub token scope permission and return you an easy interface for checking token permission before querying GitHub.

    github-scopes-rs Discover GitHub token scope permission and return you an easy interface for checking token permission before querying GitHub. In many

    null 8 Sep 15, 2022
    A html document syntax and operation library written in Rust, use APIs similar to jQuery.

    Visdom A server-side html document syntax and operation library written in Rust, it uses apis similar to jQuery, left off the parts thoes only worked

    轩子 80 Dec 21, 2022
    The simplest build-time framework for writing web apps with html templates and typescript

    Encoped A build-time fast af tool to write static apps with html and TypeScript Features Template-based ESLint, Prettier and Rollup integration No ext

    null 1 Dec 11, 2021
    Sauron is an html web framework for building web-apps. It is heavily inspired by elm.

    sauron Guide Sauron is an web framework for creating fast and interactive client side web application, as well as server-side rendering for back-end w

    Jovansonlee Cesar 1.7k Dec 26, 2022
    A Rust library to extract useful data from HTML documents, suitable for web scraping.

    select.rs A library to extract useful data from HTML documents, suitable for web scraping. NOTE: The following example only works in the upcoming rele

    Utkarsh Kukreti 829 Dec 28, 2022
    jq, but for HTML

    hq jq, but for HTML. hq reads HTML and converts it into a JSON object based on a series of CSS selectors. The selectors are expressed in a similar way

    Tom Forbes 511 Jan 5, 2023
    Tools that parsing Rust code into UML diagram (in dot format currently).

    rudg Rust UML Diagram Generator Tools that parsing Rust code into UML diagram (in dot format currently). Usage $ rudg.exe --help rudg 0.1.0 USAGE:

    Zhai Yao 16 Nov 13, 2022
    A Google-like web search engine that provides the user with the most relevant websites in accordance to his/her query, using crawled and indexed textual data and PageRank.

    Mini Google Course project for the Architecture of Computer Systems course. Overview: Architecture: We are working on multiple components of the web c

    Max 11 Aug 10, 2022
    Ergonomic and modular web framework built with Tokio, Tower, and Hyper

    axum axum is a web application framework that focuses on ergonomics and modularity. More information about this crate can be found in the crate docume

    Tokio 7.9k Dec 31, 2022
    Proxies all incoming connections to a minecraft server of your choosing, while also logging all ping and login requests to a json file and discord webhook.

    minecraft-honeypot Proxies all incoming connections to a minecraft server of your choosing, while also logging all ping and login requests to a json f

    Cleo 19 Jan 4, 2023
    A simple authentication flow using Rust and Actix-web, with a PostgreSQL database and a sveltekit frontend.

    Rust-auth-example This repository aims to represent a simple authentication flow using Rust and Actix-web, with a PostgreSQL database and a sveltekit

    Kival Mahadew 4 Feb 19, 2023
    A secure and efficient gateway for interacting with OpenAI's API, featuring load balancing, user request handling without individual API keys, and global access control.

    OpenAI Hub OpenAI Hub is a comprehensive and robust tool designed to streamline and enhance your interaction with OpenAI's API. It features an innovat

    Akase Cho 30 Jun 16, 2023
    Rust Macro which loads files into the rust binary at compile time during release and loads the file from the fs during dev.

    Rust Embed Rust Custom Derive Macro which loads files into the rust binary at compile time during release and loads the file from the fs during dev. Y

    Peter 1k Jan 5, 2023
    Sōzu HTTP reverse proxy, configurable at runtime, fast and safe, built in Rust. It is awesome! Ping us on gitter to know more

    Sōzu · Sōzu is a lightweight, fast, always-up reverse proxy server. Why use Sōzu? Hot configurable: Sozu can receive configuration changes at runtime

    sōzu 2k Dec 30, 2022
    Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.

    Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.

    Actix 16.2k Jan 2, 2023
    A rust web framework with safety and speed in mind.

    darpi A web api framework with speed and safety in mind. One of the big goals is to catch all errors at compile time, if possible. The framework uses

    null 32 Apr 11, 2022
    Markdown LSP server for easy note-taking with cross-references and diagnostics.

    Zeta Note is a language server that helps you write and manage notes. The primary focus is to support Zettelkasten-like1, 2 note taking by providing an easy way to cross-reference notes (see more about features below).

    Artem Pyanykh 4 Oct 27, 2022
    Volt - A powerful, fast and memory safe package manager for the web

    Volt - A powerful, fast and memory safe package manager for the web

    Volt Package Manager 811 Dec 30, 2022
    Thruster - An fast and intuitive rust web framework

    A fast, middleware based, web framework written in Rust

    null 913 Dec 27, 2022