Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files.

Michael Maclean

Last update: Jan 3, 2023

Related tags

Utilities htmlq

Overview

htmlq

Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files. Mozilla's MDN has a good reference for CSS selector syntax.

Usage

$ htmlq -h
htmlq 0.0.1
Runs CSS selectors on HTML

USAGE:
    htmlq [FLAGS] [OPTIONS] ...

FLAGS:
    -h, --help                 Prints help information
    -w, --ignore-whitespace    When printing text nodes, ignore those that consist entirely of whitespace
    -p, --pretty               Pretty-print the serialised output
    -t, --text                 Output only the contents of text nodes inside selected elements
    -V, --version              Prints version information

OPTIONS:
    -a, --attribute     Only return this attribute (if present) from selected elements
    -f, --filename           The input file. Defaults to stdin
    -o, --output             The output file. Defaults to stdout

ARGS:
    ...    The CSS expression to select
$

Examples

Using with cURL to find part of a page by ID

Get help!

$ curl -s https://www.rust-lang.org/ | htmlq '#get-help'
<div class="four columns mt3 mt0-l" id="get-help">
        <h4>Get help!</h4>
        <ul>
          <li>"https://doc.rust-lang.org">Documentation</a>>
          <li>"https://users.rust-lang.org">Ask a Question on the Users Forum</a>>
          <li>"http://ping.rust-lang.org">Check Website Status</a>>
        </ul>
        <div class="languages">
            <label class="hidden" for="language-footer">Language</label>
            <select id="language-footer">
                <option title="English (US)" value="en-US">English (en-US)</option>
<option title="French" value="fr">Français (fr)</option>
<option title="German" value="de">Deutsch (de)</option>

            </select>
        </div>
      </div>

Find all the links in a page

$ curl -s https://www.rust-lang.org/ | htmlq -a href a
/
/tools/install
/learn
/tools
/governance
/community
https://blog.rust-lang.org/
/learn/get-started
https://blog.rust-lang.org/2019/04/25/Rust-1.34.1.html
https://blog.rust-lang.org/2018/12/06/Rust-1.31-and-rust-2018.html
[...]
$

Get the text content of a post

$ curl -s https://nixos.org/nixos/about.html | htmlq  -t .main

          About NixOS

NixOS is a GNU/Linux distribution that aims to
improve the state of the art in system configuration management.  In
existing distributions, actions such as upgrades are dangerous:
upgrading a package can cause other packages to break, upgrading an
entire system is much less reliable than reinstalling from scratch,
you can’t safely test what the results of a configuration change will
be, you cannot easily undo changes to the system, and so on.  We want
to change that.  NixOS has many innovative features:

[...]

Pretty print HTML

(This is a bit of a work in progress)

I write about...

29/04/2019
Debugging network connections on macOS with nettop

Using nettop to find out what network connections a program is trying to make.

$ curl -s https://mgdm.net | htmlq -p '#posts'

  I write about...
  
  
    
      
        29/04/2019
        Debugging network connections on macOS with nettop
        
      Using nettop to find out what network connections a program is trying to make.
      
    
[...]

Comments

Improve display of code-blocks in `README.md`

This PR combines a bunch of cosmetic enhancements to the readme file, plus an extra example to showcase how bat can be used to add syntax highlighting.

The latter includes a screenshot that's uploaded as a file attachment to https://github.com/mgdm/htmlq/pull/17#issuecomment-915206105, since I didn't feel comfortable asking the author to accept a 19 KB blob of uncompressible image data in addition to a bunch of lightweight improvements.

opened by Alhadis 4
Add option for converting relative href to absolute.

In the example curl -s https://www.rust-lang.org/ | htmlq -a href a the links are output as-is, for example, /policies. In order to use this with other tools, it would be useful to make these links absolute. For example, curl -s https://www.rust-lang.org/ | htmlq -u https://www.rust-lang.org/ -a href a would results in https://www.rust-lang.org/policies (i.e. any relative href attributes are converted to absolute using the base url specified with -u).

opened by Chaz6 3
How to install this

Could you please write a couple of lines, in the README.md how one can download, compile and install this? As it is now, I believe that only a Rust native knows how to install this.

opened by alexanderkoponen 3
add a binary build github workflow

Hi! Thanks for writing such a great tool. I've added a GitHub action to automatically build the tool for windows, mac, and linux on x86_64; it runs for any tag of the form v<semver> (e.g. v1.0.0.) Pushing the tag to the repo starts the workflow, which creates a draft release and attaches binaries for each of the platforms and architectures. I've also attached a gif of the process of cutting a release as some added documentation for how it works!

Fixes #6.

opened by chrisdickinson 3
[Feature request]

I will truly appreciate an invert selector, something which will display everything else other than that. It will be very useful for excluding and remove weird javascript and google ads

opened by Aeres-u99 3
Does htmlq support very large XML?

Hi

I downloaded htmlq to process a large XML database (1.4GB, link) before data analysis.

when I run cat 'full database' | htmlq 'drug'the command would run for 10 seconds before htmlq runs out of memory.

Is that behaviour expected or is this a memory bug?

opened by she3o 2
jq-like DSL?
I decided to try htmlq since I have quite some positive experience with jq. But the first practical task I tried to implement doesn't seem possible with htmlq because of limitations of CSS. Specifically, I need to extract a list of elements matching some CSS selector, that also have some child matching some other selector. Example:

# select elements that have the data-attr1 attribute htmlq -f input.html '[data-attr1]' # select <span> elements htmlq -f input.html 'span' # select <span> elements that are descentants of elements with the data-attr1 attribute htmlq -f input.html '[data-attr1] span' # select elements that have the data-attr1 attribute and some <span> descendants # impossible

I imagine it would be possible with some DSL, for example (very crude):

css("[data-attr1]") | select(isempty(css("span")) | not)
opened by cyberhuman 2
Add specific permissions to workflows under .github/workflows

This PR adds specific permissions to the existing workflows under .github/workflows.

Background

I have implemented a GitHub App to automatically restrict permissions for the GITHUB_TOKEN in workflows. This is a security best practice as per the GitHub Actions hardening guide.

I am trying the App out on public repositories, by forking them, installing the App on the fork, and manually creating PRs with the fixed workflows. The App automatically fixes permissions when a PR is created that creates a new workflow, so feel free to install it for future workflows, or try it out on other repos.

I have manually reviewed the changes, and they do look good to me. If something looks off, please let me know. If you have feedback, would love to hear it. Thanks!

opened by varunsh-coder 2
Add Windows Support
Tried to install this under windows (not WSL2) and it fails to compile:

error: failed to build archive: function not supported error: aborting due to previous error error: could not compile `rand_core`

Any chance you could add Windows support or just release a windows binary?
opened by dmoath 2
Selecting an arbitrary property in tags

Sometimes there isn't much to reference in HTML, but elements do still have something like <div data-testid="foobar">. As far as I know, htmlq doesn't have anything to handle this, or does it? If not, how could it be implemented?

opened by jtagcat 1
Improving the `-B` logic

Most sites will not have a <base...> element. Is there anyway to ascertain the domain it came from? I guess not huh if we're piping from curl output. Maybe use tee to pass domain and body?

opened by ralyodio 1