A WHATWG-compliant HTML5 tokenizer and tag soup parser

Markus Unterwaditzer

Last update: Dec 30, 2022

Related tags

Text processing html parser html5 parsing xml tokenizer sax whatwg lexer

Overview

html5gum

html5gum is a WHATWG-compliant HTML tokenizer.

use std::fmt::Write;
use html5gum::{Tokenizer, Token};

let html = "<title   >hello world</title>";
let mut new_html = String::new();

for token in Tokenizer::new(html) {
    match token {
        Token::StartTag(tag) => {
            write!(new_html, "<{}>", tag.name).unwrap();
        }
        Token::String(hello_world) => {
            write!(new_html, "{}", hello_world).unwrap();
        }
        Token::EndTag(tag) => {
            write!(new_html, "</{}>", tag.name).unwrap();
        }
        _ => panic!("unexpected input"),
    }
}

assert_eq!(new_html, "<title>hello world</title>");

It fully implements 13.2 of the WHATWG HTML spec and passes html5lib's tokenizer test suite, except that:

this implementation requires all input to be Rust strings and therefore valid UTF-8. There is no charset detection or handling of invalid surrogates, and the relevant html5lib tests are skipped in CI.
there's some remaining testcases to be decided on at issue 5.

A distinguishing feature of html5gum is that you can bring your own token datastructure and hook into token creation by implementing the Emitter trait. This allows you to:

Rewrite all per-HTML-tag allocations to use a custom allocator or datastructure.
Efficiently filter out uninteresting categories data without ever allocating for it. For example if any plaintext between tokens is not of interest to you, you can implement the respective trait methods as noop and therefore avoid any overhead creating plaintext tokens.

html5gum was created out of a need to parse HTML tag soup efficiently. Previous options were to:

use quick-xml or xmlparser with some hacks to make either one not choke on bad HTML. For some (rather large) set of HTML input this works well (particularly quick-xml can be configured to be very lenient about parsing errors) and parsing speed is stellar. But neither can parse all HTML.

For my own usecase html5gum is about 2x slower than quick-xml.
use html5ever's own tokenizer to avoid as much tree-building overhead as possible. This was functional but had poor performance for my own usecase (10-15x slower than quick-xml).
use lol-html, which would probably perform at least as well as html5gum, but comes with a closure-based API that I didn't manage to get working for my usecase.

Etymology

Why is this library called html5gum?

G.U.M: Giant Unreadable Match-statement
<insert "how it feels to ~~chew 5 gum~~ parse HTML" meme here>

License

Licensed under the MIT license, see ./LICENSE.

Comments

Added more lints
The list of new lints are:

clippy::all

clippy::pedantic

absolute_paths_not_starting_with_crate

rustdoc::invalid_html_tags

missing_copy_implementations

missing_debug_implementations

semicolon_in_expressions_from_macros

unreachable_pub

unused_extern_crates

variant_size_differences

~clippy::missing_const_for_fn~
opened by lebensterben 14
use arrayvec crate instead
Though the intention is to avoid using unsafe Rust, the current implementation of html5gum::arrayvec::ArrayVec needs much improvement and the there's no real unsafety if we just use the arrayvec crate.

Consider html5gum::arrayvec::ArrayVec::push():

pub fn push(&mut self, item: T) { self.content[self.len] = item; self.len += 1; }

Keep pushing and you are guaranteed to get an runtime index out of bound error.

While arrayvec::ArrayVec::push() checks the capacity and there's no unsafety or run-time error.

fn push(&mut self, element: Self::Item) { self.try_push(element).unwrap() } fn try_push(&mut self, element: Self::Item) -> Result<(), CapacityError<Self::Item>> { if self.len() < Self::CAPACITY { unsafe { self.push_unchecked(element); } Ok(()) } else { Err(CapacityError::new(element)) } }

Consider html5gum::arrayvec::ArrayVec::drain()

pub fn drain(&mut self) -> &[T] { let rv = &self.content[..self.len]; self.len = 0; rv }

It doesn't drain anything actually. It just gives you a view of the content.

I propose to use arrayvec::ArrayVec instead. If you wish, we can use new type pattern and only expose new(), push(), and drain() function.
opened by lebensterben 13
Unable to use HtmlString struct with new v0.5.0 version for custom emitters.

Hello there,

First, thanks for this crate, i was using Html5Ever before this, but this crate seems to be faster, and less dependencies.

However, i just noticed there was a new version available (v0.5.0), but i get errors now with my current implementation.

It seems to be using a HtmlString struct now, instead of a vec[u8]. Even though that struct is pub, the emitter module is private.

My current implementation can be found here: https://github.com/BlackDex/vaultwarden/blob/main/src/api/icons.rs#L419-L476 https://github.com/BlackDex/vaultwarden/blob/main/src/api/icons.rs#L827-L959

And it gets called from here: https://github.com/BlackDex/vaultwarden/blob/main/src/api/icons.rs#L557-L558

Not sure if this is a bug, or that i need to change some other items on my side for this new version to work? If you have any suggestions, please let me know.

Thanks in advance!

opened by BlackDex 11
Make tokenizer operate on raw bytes

When I and SimonSapin were looking into html5ever speed up, one low-hanging fruit left was iterating over bytes rather than on Characters. It seems iterating on chars does a lot of UTF8 bounds check.

With bytes iteration and memchr, you might get a few speed-ups.
enhancement

opened by Ygg01 8
rewrite entire crate to run on bytes
as per #12, change everything to run on bytes. Statemachine transitions are now per-byte, character boundaries are validated in emitter.

Open questions:

[x] do we want to allow reading from raw bytes? currently BufReadReader does not validate any input, leading to panics in emitter when converting to string

[x] where do we validate the characters of the input stream? maybe we want to do that in the emitter now, when assembling tokens? -- terrible idea, breaks tests in irreparable ways

[x] serious performance regression with hyperlink

[x] surrogate detection possible? seems too hard after initial attempt -- let's not

[x] fuzzing in comparison mode

[x] separate fuzzing testcases from rest

[x] change fuzzing code such that it fuzzes arbitrary bytes

[x] fuzz against crashes
opened by untitaker 5
Expose tokenizer state (dealing with script data correctly)
I think most people who use an HTML5 tokenizer will want <script><b>test</b></script> to be tokenized as

StartTag(StartTag { self_closing: false, name: "script", attributes: {} }) String("<b>test</b>") EndTag(EndTag { name: "script" })

instead of

StartTag(StartTag { self_closing: false, name: "script", attributes: {} }) StartTag(StartTag { self_closing: false, name: "b", attributes: {} }) String("test") EndTag(EndTag { name: "b" }) EndTag(EndTag { name: "script" })

Unless I am missing something that doesn't seem to be possible with the current API?

Sidenote: It would also be nice to have some convenience utility that automatically dealt with the state implications of script, style, title, textarea, iframe etc. For example the html tokenizer of the Python standard library automatically takes care of script and style.
enhancement
opened by not-my-profile 5
I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable`
this is likely because I am bad at rust but I struggled to get this working (though in theory it shouldn't be too difficult?)

I was hoping to be able to do something like:

let resp = reqwest::get(u).await?; for token in html5gum::Tokenizer::new(&resp) { ... }

or even with bytes_stream

let resp = reqwest::get(u).await?.bytes_stream(); for token in html5gum::Tokenizer::new(&resp) { ... }

fell back on

let resp = reqwest::get(u).await?.text().await?; for token in html5gum::Tokenizer::new(&resp) { ... }

but I'm pretty sure that's not going to stream the response and it's going to convert it back and forth from Bytes -> String -> Vec -> String unnecessarily
question
opened by asottile 2
Bump tests/html5lib-tests from `e3e6e15` to `dd0d815`
Bumps tests/html5lib-tests from e3e6e15 to dd0d815.

Commits

dd0d815 test: fix <template> tests to not use document-fragment

038c066 Test spec change: Remove parse error for <template><tr></tr> </template>

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies submodules
opened by dependabot[bot] 1
Bump tests/html5lib-tests from `e3e6e15` to `038c066`
Bumps tests/html5lib-tests from e3e6e15 to 038c066.

Commits

038c066 Test spec change: Remove parse error for <template><tr></tr> </template>

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies submodules
opened by dependabot[bot] 1
Audit code for potential panics
@lebensterben pointed out in #32 that it's currently hard to understand why certain potential panics (supposedly) don't occur in practice.

We should

start documenting the relevant invariants in code comments

write more explicit assertion messages when those fail (either by adding more debug_asserts on top or doing something else)

statically enforce the above (this is probably impossible)

labelling as documentation as I'm not aware of actual panics being hit in usage.
documentation good first issue
opened by untitaker 1
use arrayvec::ArrayVec instead
Added a wrapper type html5gum::arrayvec::ArrayVec which in turn uses arrayvec::ArrayVec but only exposes the following interface:

new() -> html5gum::arrayvec::ArrayVec<T, CAP>

push(&mut html5gum::arrayvec::ArrayVec<T, CAP>)

drain(&mut html5gum::arrayvec::ArrayVec<T, CAP>) -> arrayvec::Drain<T, CAP>

closes #32
opened by lebensterben 1
Improve performance of DefaultEmitter
While implementing https://github.com/lycheeverse/lychee/pull/480 I realized how slow the default emitter really is. It makes link extraction 10-40% slower than html5ever. It is currently not really possible to beat html5ever at all unless a custom emitter is implemented.

We could:

build another emitter that reuses strings, and calls a callback with borrowed strings instead. Therefore much closer to lol-html's API.

allow for custom allocators for all the strings we create -- similar to strtendril magic html5ever does (but definetly not using that crate)

bug good first issue
opened by untitaker 5
html5gum does not detect lone surrogates

lone surrogates are invalid utf8, so html5gum 0.3.0, which takes &str/String, is not able to handle those.

after merging #25, html5gum will be able to read arbitrary bytes. at this point the expectation might be that lone surrogates produce error tokens, but they do not.

note: lone surrogates have no impact on parsing behavior. only some error tokens are missing from token stream.
bug

opened by untitaker 2
Add tree builder

We should add a real treebuilder and lol-html like api on top of that to this crate. As part of that we need to move the tokenizer into a submodule and rework the readme
enhancement

opened by untitaker 1
Attach input location to tokens (add spans feature)

Hey, thanks for this library ... it looks really promising :) I am working on an HTML linter for which I require the spans of parser errors, tag names, attribute names and attribute values. These spans would ideally be reported as core::ops::Range<usize>, so that I can pass them directly to the codespan_reporting library (codespan_reporting::diagnostic::Label::range in particular). Since span tracking is of course overhead it would be behind an off-by-default feature flag.

I recently implemented this in my fork of the html5ever tokenizer ... which I frankly would love to abandon for a more sound library :) If you are interested in this I can probably implement it.
enhancement

opened by not-my-profile 5
Make tokenizer buffers customizable

The tokenizer type itself allocates some strings on self that one should be able to replace with their own buffers and allocation behavior. Create another trait much like Emitter and expose push/pop/clear-type methods.
enhancement

opened by untitaker 0

Owner

Markus Unterwaditzer

"Do not even think of telephoning me about this program. Send cash first!" --Author of the UNIX file command.

GitHub

Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

184 Dec 22, 2022

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

17 Dec 22, 2022

Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

20 Dec 29, 2022

The Bytepiece Tokenizer Implemented in Rust.

bytepiece Implementation of Su's bytepiece. Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little prep

11 Oct 2, 2023

Neural network transition-based dependency parser (in Rust)

dpar Introduction dpar is a neural network transition-based dependency parser. The original Go version can be found in the oldgo branch. Dependencies

41 Jan 25, 2022

A HDPSG-inspired symbolic natural language parser written in Rust

Treebender A symbolic natural language parsing library for Rust, inspired by HDPSG. What is this? This is a library for parsing natural or constructed

32 Dec 26, 2022

Zero-grammer definition command-line parser

zgclp Zgclp (Zero-grammar definition command-line parser) is one of Rust's command-line parsers. A normal command-line parser generates a parser from

1 Mar 31, 2022

Makdown-like text parser.

1 Dec 7, 2021

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

274 Nov 5, 2022

A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

953 Jan 3, 2023

A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

A blazingly fast( possibly the fastest) markdown to html parser and syntax highlighter built using Rust's pulldown-cmark and tree-sitter-highlight crate natively for Node's Foreign Function Interface.

48 Nov 11, 2022

Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

82 Jan 4, 2023

A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

5.8k Dec 30, 2022

Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

331 Dec 28, 2022

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

322 Dec 26, 2022

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

2.6k Jan 8, 2023

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

75 Jan 8, 2023

Web 3.0 Realized with Traceless Privacy and Seamless Compatibility

Automata Build On Ubuntu/Debian (or similar distributions on WSL), install the following packages: sudo apt-get update sudo apt-get install -y build-e

81 Nov 29, 2022

Text Expression Runner – Readable and easy to use text expressions

ter - Text Expression Runner ter is a cli to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the c

72 Jul 31, 2022