hello world hello world hello world

A WHATWG-compliant HTML5 tokenizer and tag soup parser

Overview

html5gum

docs.rs crates.io

html5gum is a WHATWG-compliant HTML tokenizer.

use std::fmt::Write;
use html5gum::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in Tokenizer::new(html) {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", hello_world).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

It fully implements 13.2 of the WHATWG HTML spec and passes html5lib's tokenizer test suite, except that:

  • this implementation requires all input to be Rust strings and therefore valid UTF-8. There is no charset detection or handling of invalid surrogates, and the relevant html5lib tests are skipped in CI.

  • there's some remaining testcases to be decided on at issue 5.

A distinguishing feature of html5gum is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

  • Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.

  • Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

  • use quick-xml or xmlparser with some hacks to make either one not choke on bad HTML. For some (rather large) set of HTML input this works well (particularly quick-xml can be configured to be very lenient about parsing errors) and parsing speed is stellar. But neither can parse all HTML.

    For my own usecase html5gum is about 2x slower than quick-xml.

  • use html5ever's own tokenizer to avoid as much tree-building overhead as possible. This was functional but had poor performance for my own usecase (10-15x slower than quick-xml).

  • use lol-html, which would probably perform at least as well as html5gum, but comes with a closure-based API that I didn't manage to get working for my usecase.

Etymology

Why is this library called html5gum?

  • G.U.M: Giant Unreadable Match-statement

  • <insert "how it feels to chew 5 gum parse HTML" meme here>

License

Licensed under the MIT license, see ./LICENSE.

Comments
  • Added more lints

    Added more lints

    The list of new lints are:

    • clippy::all
    • clippy::pedantic
    • absolute_paths_not_starting_with_crate
    • rustdoc::invalid_html_tags
    • missing_copy_implementations
    • missing_debug_implementations
    • semicolon_in_expressions_from_macros
    • unreachable_pub
    • unused_extern_crates
    • variant_size_differences
    • ~clippy::missing_const_for_fn~
    opened by lebensterben 14
  • use arrayvec crate instead

    use arrayvec crate instead

    Though the intention is to avoid using unsafe Rust, the current implementation of html5gum::arrayvec::ArrayVec needs much improvement and the there's no real unsafety if we just use the arrayvec crate.


    Consider html5gum::arrayvec::ArrayVec::push():

        pub fn push(&mut self, item: T) {
            self.content[self.len] = item;
            self.len += 1;
        }
    

    Keep pushing and you are guaranteed to get an runtime index out of bound error.

    While arrayvec::ArrayVec::push() checks the capacity and there's no unsafety or run-time error.

        fn push(&mut self, element: Self::Item) {
            self.try_push(element).unwrap()
        }
    
        fn try_push(&mut self, element: Self::Item) -> Result<(), CapacityError<Self::Item>> {
            if self.len() < Self::CAPACITY {
                unsafe {
                    self.push_unchecked(element);
                }
                Ok(())
            } else {
                Err(CapacityError::new(element))
            }
        }
    

    Consider html5gum::arrayvec::ArrayVec::drain()

        pub fn drain(&mut self) -> &[T] {
            let rv = &self.content[..self.len];
            self.len = 0;
            rv
        }
    

    It doesn't drain anything actually. It just gives you a view of the content.


    I propose to use arrayvec::ArrayVec instead. If you wish, we can use new type pattern and only expose new(), push(), and drain() function.

    opened by lebensterben 13
  • Unable to use HtmlString struct with new v0.5.0 version for custom emitters.

    Unable to use HtmlString struct with new v0.5.0 version for custom emitters.

    Hello there,

    First, thanks for this crate, i was using Html5Ever before this, but this crate seems to be faster, and less dependencies.

    However, i just noticed there was a new version available (v0.5.0), but i get errors now with my current implementation.

    It seems to be using a HtmlString struct now, instead of a vec[u8]. Even though that struct is pub, the emitter module is private.

    My current implementation can be found here: https://github.com/BlackDex/vaultwarden/blob/main/src/api/icons.rs#L419-L476 https://github.com/BlackDex/vaultwarden/blob/main/src/api/icons.rs#L827-L959

    And it gets called from here: https://github.com/BlackDex/vaultwarden/blob/main/src/api/icons.rs#L557-L558

    Not sure if this is a bug, or that i need to change some other items on my side for this new version to work? If you have any suggestions, please let me know.

    Thanks in advance!

    opened by BlackDex 11
  • Make tokenizer operate on raw bytes

    Make tokenizer operate on raw bytes

    When I and SimonSapin were looking into html5ever speed up, one low-hanging fruit left was iterating over bytes rather than on Characters. It seems iterating on chars does a lot of UTF8 bounds check.

    With bytes iteration and memchr, you might get a few speed-ups.

    enhancement 
    opened by Ygg01 8
  • rewrite entire crate to run on bytes

    rewrite entire crate to run on bytes

    as per #12, change everything to run on bytes. Statemachine transitions are now per-byte, character boundaries are validated in emitter.

    Open questions:

    • [x] do we want to allow reading from raw bytes? currently BufReadReader does not validate any input, leading to panics in emitter when converting to string
    • [x] where do we validate the characters of the input stream? maybe we want to do that in the emitter now, when assembling tokens? -- terrible idea, breaks tests in irreparable ways
    • [x] serious performance regression with hyperlink
    • [x] surrogate detection possible? seems too hard after initial attempt -- let's not
    • [x] fuzzing in comparison mode
    • [x] separate fuzzing testcases from rest
    • [x] change fuzzing code such that it fuzzes arbitrary bytes
    • [x] fuzz against crashes
    opened by untitaker 5
  • Expose tokenizer state (dealing with script data correctly)

    Expose tokenizer state (dealing with script data correctly)

    I think most people who use an HTML5 tokenizer will want <script><b>test</b></script> to be tokenized as

    StartTag(StartTag { self_closing: false, name: "script", attributes: {} })
    String("<b>test</b>")
    EndTag(EndTag { name: "script" })
    

    instead of

    StartTag(StartTag { self_closing: false, name: "script", attributes: {} })
    StartTag(StartTag { self_closing: false, name: "b", attributes: {} })
    String("test")
    EndTag(EndTag { name: "b" })
    EndTag(EndTag { name: "script" })
    

    Unless I am missing something that doesn't seem to be possible with the current API?

    Sidenote: It would also be nice to have some convenience utility that automatically dealt with the state implications of script, style, title, textarea, iframe etc. For example the html tokenizer of the Python standard library automatically takes care of script and style.

    enhancement 
    opened by not-my-profile 5
  • I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable`

    I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable`

    this is likely because I am bad at rust but I struggled to get this working (though in theory it shouldn't be too difficult?)

    I was hoping to be able to do something like:

    let resp = reqwest::get(u).await?;
    
    for token in html5gum::Tokenizer::new(&resp) { ... }
    

    or even with bytes_stream

    let resp = reqwest::get(u).await?.bytes_stream();
    
    for token in html5gum::Tokenizer::new(&resp) { ... }
    

    fell back on

    let resp = reqwest::get(u).await?.text().await?;
    
    for token in html5gum::Tokenizer::new(&resp) { ... }
    

    but I'm pretty sure that's not going to stream the response and it's going to convert it back and forth from Bytes -> String -> Vec -> String unnecessarily

    question 
    opened by asottile 2
  • Bump tests/html5lib-tests from `e3e6e15` to `dd0d815`

    Bump tests/html5lib-tests from `e3e6e15` to `dd0d815`

    Bumps tests/html5lib-tests from e3e6e15 to dd0d815.

    Commits
    • dd0d815 test: fix <template> tests to not use document-fragment
    • 038c066 Test spec change: Remove parse error for <template><tr></tr> </template>
    • See full diff in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies submodules 
    opened by dependabot[bot] 1
  • Bump tests/html5lib-tests from `e3e6e15` to `038c066`

    Bump tests/html5lib-tests from `e3e6e15` to `038c066`

    Bumps tests/html5lib-tests from e3e6e15 to 038c066.

    Commits
    • 038c066 Test spec change: Remove parse error for <template><tr></tr> </template>
    • See full diff in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies submodules 
    opened by dependabot[bot] 1
  • Audit code for potential panics

    Audit code for potential panics

    @lebensterben pointed out in #32 that it's currently hard to understand why certain potential panics (supposedly) don't occur in practice.

    We should

    • start documenting the relevant invariants in code comments
    • write more explicit assertion messages when those fail (either by adding more debug_asserts on top or doing something else)
    • statically enforce the above (this is probably impossible)

    labelling as documentation as I'm not aware of actual panics being hit in usage.

    documentation good first issue 
    opened by untitaker 1
  • use arrayvec::ArrayVec instead

    use arrayvec::ArrayVec instead

    Added a wrapper type html5gum::arrayvec::ArrayVec which in turn uses arrayvec::ArrayVec but only exposes the following interface:

    • new() -> html5gum::arrayvec::ArrayVec<T, CAP>
    • push(&mut html5gum::arrayvec::ArrayVec<T, CAP>)
    • drain(&mut html5gum::arrayvec::ArrayVec<T, CAP>) -> arrayvec::Drain<T, CAP>

    closes #32

    opened by lebensterben 1
  • Improve performance of DefaultEmitter

    Improve performance of DefaultEmitter

    While implementing https://github.com/lycheeverse/lychee/pull/480 I realized how slow the default emitter really is. It makes link extraction 10-40% slower than html5ever. It is currently not really possible to beat html5ever at all unless a custom emitter is implemented.

    We could:

    • build another emitter that reuses strings, and calls a callback with borrowed strings instead. Therefore much closer to lol-html's API.
    • allow for custom allocators for all the strings we create -- similar to strtendril magic html5ever does (but definetly not using that crate)
    bug good first issue 
    opened by untitaker 5
  • html5gum does not detect lone surrogates

    html5gum does not detect lone surrogates

    lone surrogates are invalid utf8, so html5gum 0.3.0, which takes &str/String, is not able to handle those.

    after merging #25, html5gum will be able to read arbitrary bytes. at this point the expectation might be that lone surrogates produce error tokens, but they do not.

    note: lone surrogates have no impact on parsing behavior. only some error tokens are missing from token stream.

    bug 
    opened by untitaker 2
  • Add tree builder

    Add tree builder

    We should add a real treebuilder and lol-html like api on top of that to this crate. As part of that we need to move the tokenizer into a submodule and rework the readme

    enhancement 
    opened by untitaker 1
  • Attach input location to tokens (add spans feature)

    Attach input location to tokens (add spans feature)

    Hey, thanks for this library ... it looks really promising :) I am working on an HTML linter for which I require the spans of parser errors, tag names, attribute names and attribute values. These spans would ideally be reported as core::ops::Range<usize>, so that I can pass them directly to the codespan_reporting library (codespan_reporting::diagnostic::Label::range in particular). Since span tracking is of course overhead it would be behind an off-by-default feature flag.

    I recently implemented this in my fork of the html5ever tokenizer ... which I frankly would love to abandon for a more sound library :) If you are interested in this I can probably implement it.

    enhancement 
    opened by not-my-profile 5
  • Make tokenizer buffers customizable

    Make tokenizer buffers customizable

    The tokenizer type itself allocates some strings on self that one should be able to replace with their own buffers and allocation behavior. Create another trait much like Emitter and expose push/pop/clear-type methods.

    enhancement 
    opened by untitaker 0
Owner
Markus Unterwaditzer
"Do not even think of telephoning me about this program. Send cash first!" --Author of the UNIX file command.
Markus Unterwaditzer
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
The Bytepiece Tokenizer Implemented in Rust.

bytepiece Implementation of Su's bytepiece. Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little prep

Yam(长琴) 11 Oct 2, 2023
Neural network transition-based dependency parser (in Rust)

dpar Introduction dpar is a neural network transition-based dependency parser. The original Go version can be found in the oldgo branch. Dependencies

Daniël de Kok 41 Jan 25, 2022
A HDPSG-inspired symbolic natural language parser written in Rust

Treebender A symbolic natural language parsing library for Rust, inspired by HDPSG. What is this? This is a library for parsing natural or constructed

Theia Vogel 32 Dec 26, 2022
Zero-grammer definition command-line parser

zgclp Zgclp (Zero-grammar definition command-line parser) is one of Rust's command-line parsers. A normal command-line parser generates a parser from

Toshihiro Kamiya 1 Mar 31, 2022
Makdown-like text parser.

Makdown-like text parser.

Ryo Nakamura 1 Dec 7, 2021
Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

Bernhard Schuster 274 Nov 5, 2022
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

A blazingly fast( possibly the fastest) markdown to html parser and syntax highlighter built using Rust's pulldown-cmark and tree-sitter-highlight crate natively for Node's Foreign Function Interface.

Ben Wishovich 48 Nov 11, 2022
Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

Kasper 82 Jan 4, 2023
A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

Peter M. Stahl 5.8k Dec 30, 2022
Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

Tanker 331 Dec 28, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Web 3.0 Realized with Traceless Privacy and Seamless Compatibility

Automata Build On Ubuntu/Debian (or similar distributions on WSL), install the following packages: sudo apt-get update sudo apt-get install -y build-e

Automata Network 81 Nov 29, 2022
Text Expression Runner – Readable and easy to use text expressions

ter - Text Expression Runner ter is a cli to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the c

Maximilian Schulke 72 Jul 31, 2022