Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files.

Related tags

Utilities htmlq
Overview

htmlq

Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files. Mozilla's MDN has a good reference for CSS selector syntax.

Usage

$ htmlq -h
htmlq 0.0.1
Runs CSS selectors on HTML

USAGE:
    htmlq [FLAGS] [OPTIONS] ...

FLAGS:
    -h, --help                 Prints help information
    -w, --ignore-whitespace    When printing text nodes, ignore those that consist entirely of whitespace
    -p, --pretty               Pretty-print the serialised output
    -t, --text                 Output only the contents of text nodes inside selected elements
    -V, --version              Prints version information

OPTIONS:
    -a, --attribute     Only return this attribute (if present) from selected elements
    -f, --filename           The input file. Defaults to stdin
    -o, --output             The output file. Defaults to stdout

ARGS:
    ...    The CSS expression to select
$

Examples

Using with cURL to find part of a page by ID

Get help!

">
$ curl -s https://www.rust-lang.org/ | htmlq '#get-help'
<div class="four columns mt3 mt0-l" id="get-help">
        <h4>Get help!</h4>
        <ul>
          <li>"https://doc.rust-lang.org">Documentation</a>>
          <li>"https://users.rust-lang.org">Ask a Question on the Users Forum</a>>
          <li>"http://ping.rust-lang.org">Check Website Status</a>>
        </ul>
        <div class="languages">
            <label class="hidden" for="language-footer">Language</label>
            <select id="language-footer">
                <option title="English (US)" value="en-US">English (en-US)</option>
<option title="French" value="fr">Français (fr)</option>
<option title="German" value="de">Deutsch (de)</option>

            </select>
        </div>
      </div>

Find all the links in a page

$ curl -s https://www.rust-lang.org/ | htmlq -a href a
/
/tools/install
/learn
/tools
/governance
/community
https://blog.rust-lang.org/
/learn/get-started
https://blog.rust-lang.org/2019/04/25/Rust-1.34.1.html
https://blog.rust-lang.org/2018/12/06/Rust-1.31-and-rust-2018.html
[...]
$

Get the text content of a post

$ curl -s https://nixos.org/nixos/about.html | htmlq  -t .main

          About NixOS

NixOS is a GNU/Linux distribution that aims to
improve the state of the art in system configuration management.  In
existing distributions, actions such as upgrades are dangerous:
upgrading a package can cause other packages to break, upgrading an
entire system is much less reliable than reinstalling from scratch,
you can’t safely test what the results of a configuration change will
be, you cannot easily undo changes to the system, and so on.  We want
to change that.  NixOS has many innovative features:

[...]

Pretty print HTML

(This is a bit of a work in progress)

I write about...

Comments
  • Improve display of code-blocks in `README.md`

    Improve display of code-blocks in `README.md`

    This PR combines a bunch of cosmetic enhancements to the readme file, plus an extra example to showcase how bat can be used to add syntax highlighting.

    The latter includes a screenshot that's uploaded as a file attachment to https://github.com/mgdm/htmlq/pull/17#issuecomment-915206105, since I didn't feel comfortable asking the author to accept a 19 KB blob of uncompressible image data in addition to a bunch of lightweight improvements.

    opened by Alhadis 4
  • Add option for converting relative href to absolute.

    Add option for converting relative href to absolute.

    In the example curl -s https://www.rust-lang.org/ | htmlq -a href a the links are output as-is, for example, /policies. In order to use this with other tools, it would be useful to make these links absolute. For example, curl -s https://www.rust-lang.org/ | htmlq -u https://www.rust-lang.org/ -a href a would results in https://www.rust-lang.org/policies (i.e. any relative href attributes are converted to absolute using the base url specified with -u).

    opened by Chaz6 3
  • How to install this

    How to install this

    Could you please write a couple of lines, in the README.md how one can download, compile and install this? As it is now, I believe that only a Rust native knows how to install this.

    opened by alexanderkoponen 3
  • add a binary build github workflow

    add a binary build github workflow

    Hi! Thanks for writing such a great tool. I've added a GitHub action to automatically build the tool for windows, mac, and linux on x86_64; it runs for any tag of the form v<semver> (e.g. v1.0.0.) Pushing the tag to the repo starts the workflow, which creates a draft release and attaches binaries for each of the platforms and architectures. I've also attached a gif of the process of cutting a release as some added documentation for how it works!

    Fixes #6.

    htmlq

    opened by chrisdickinson 3
  • [Feature request]

    [Feature request]

    I will truly appreciate an invert selector, something which will display everything else other than that. It will be very useful for excluding and remove weird javascript and google ads

    opened by Aeres-u99 3
  • Does htmlq support very large XML?

    Does htmlq support very large XML?

    Hi

    I downloaded htmlq to process a large XML database (1.4GB, link) before data analysis.

    when I run cat 'full database' | htmlq 'drug'the command would run for 10 seconds before htmlq runs out of memory.

    Is that behaviour expected or is this a memory bug?

    opened by she3o 2
  • jq-like DSL?

    jq-like DSL?

    I decided to try htmlq since I have quite some positive experience with jq. But the first practical task I tried to implement doesn't seem possible with htmlq because of limitations of CSS. Specifically, I need to extract a list of elements matching some CSS selector, that also have some child matching some other selector. Example:

    # select elements that have the data-attr1 attribute
    htmlq -f input.html '[data-attr1]'
    
    # select <span> elements
    htmlq -f input.html 'span'
    
    # select <span> elements that are descentants of elements with the data-attr1 attribute
    htmlq -f input.html '[data-attr1] span'
    
    # select elements that have the data-attr1 attribute and some <span> descendants
    # impossible
    

    I imagine it would be possible with some DSL, for example (very crude):

    css("[data-attr1]") | select(isempty(css("span")) | not)
    
    opened by cyberhuman 2
  • Add specific permissions to workflows under .github/workflows

    Add specific permissions to workflows under .github/workflows

    This PR adds specific permissions to the existing workflows under .github/workflows.

    Background

    I have implemented a GitHub App to automatically restrict permissions for the GITHUB_TOKEN in workflows. This is a security best practice as per the GitHub Actions hardening guide.

    I am trying the App out on public repositories, by forking them, installing the App on the fork, and manually creating PRs with the fixed workflows. The App automatically fixes permissions when a PR is created that creates a new workflow, so feel free to install it for future workflows, or try it out on other repos.

    I have manually reviewed the changes, and they do look good to me. If something looks off, please let me know. If you have feedback, would love to hear it. Thanks!

    opened by varunsh-coder 2
  • Add Windows Support

    Add Windows Support

    Tried to install this under windows (not WSL2) and it fails to compile:

    error: failed to build archive: function not supported
    
    error: aborting due to previous error
    
    error: could not compile `rand_core`
    
    

    Any chance you could add Windows support or just release a windows binary?

    opened by dmoath 2
  • Selecting an arbitrary property in tags

    Selecting an arbitrary property in tags

    Sometimes there isn't much to reference in HTML, but elements do still have something like <div data-testid="foobar">. As far as I know, htmlq doesn't have anything to handle this, or does it? If not, how could it be implemented?

    opened by jtagcat 1
  • Improving the `-B` logic

    Improving the `-B` logic

    Most sites will not have a <base...> element. Is there anyway to ascertain the domain it came from? I guess not huh if we're piping from curl output. Maybe use tee to pass domain and body?

    opened by ralyodio 1
Owner
Michael Maclean
Michael Maclean
Extract tokens by simple condition expression.

Condex Extract tokens by simple condition expression. | Docs | Latest Note | [dependencies]

Doha Lee 2 Jun 1, 2022
List of Persian Colors and hex colors for CSS, SCSS, PHP, JS, Python, and Ruby.

Persian Colors (Iranian colors) List of Persian Colors and hex colors for CSS, SCSS, PHP, C++, QML, JS, Python, Ruby and CSharp. Persian colors Name H

Max Base 12 Sep 3, 2022
A proc macro for creating compile-time checked CSS class sets, in the style of classNames

semester Semester is a declarative CSS conditional class name joiner, in the style of React's classnames. It's intended for use in web frameworks (lik

Nathan West 11 Oct 20, 2022
A principled BSDF pathtracer with an abstracted backend. Perfect for rendering procedural content.

This is a port of the excellent GLSL_Pathtracer to Rust utilizing an abstracted, trait based backend. Perfect for rendering procedural content. Rust F

Markus Moenig 5 Nov 23, 2022
The tool like Browserslist, but written in Rust.

browserslist-rs The tool like Browserslist, but written in Rust. Try it out Before trying this crate, you're required to get Rust installed. Then, clo

Pig Fang 76 Nov 29, 2022
Freebsd-embedded-hal - Like linux-embedded-hal but FreeBSD

freebsd-embedded-hal Implementation of embedded-hal traits for FreeBSD devices: gpio: using libgpio, with stateful and toggleable support, with suppor

null 2 Oct 1, 2022
Booru software for the 21st century. (Name is supposed to be like Puro, the big monster, but I failed..)

Pooru Booru software for the 21st century. Setup Setup is a little funky, but I hope to fix this funkyness down the road. First and foremost, you will

null 2 May 8, 2022
A lightweight distributed message queue. Like AWS SQS and RSMQ but on Postgres.

Postgres Message Queue (PGMQ) A lightweight distributed message queue. Like AWS SQS and RSMQ but on Postgres. Features Lightweight - Built with Rust a

Tembo 15 Jul 25, 2023
It's like Circus but totally different.

Read I do not know Rust. If you see something that is being done in a suboptimal way at a language-level, I'd love to hear it. If you want to argue ab

zkxjzmswkwl 4 Oct 5, 2023
Rust macro that uses GPT3 codex to generate code at compiletime

gpt3_macro Rust macro that uses GPT3 codex to generate code at compiletime. Just describe what you want the function to do and (optionally) define a f

Maximilian von Gaisberg 59 Dec 18, 2022
A Kubernetes Operator that uses Bitwarden to provision secrets, written in Rust with kube-rs

bitwarden-secret-operator-rs bitwarden-secret-operator-rs is a kubernetes Operator written in Rust thanks to kube-rs. The goal is to create Kubernetes

Blowa 4 Mar 28, 2024
Generate an HTML page based on a Notion document

Notion Generator Generate an HTML page based on a Notion document! Still a bit of a work in progress, but I am about to actually use it for some actua

null 9 Dec 14, 2022
A simple script (in Rust lang) to create HTML from SVD

A simple script to create HTML from an SVD file This is a simple script written in Rust language to create a single HTML file from an SVD file. It's r

Björn Quentin 14 Aug 22, 2022
A simplified but faster version of Routerify

Routerify lite Routerify-lite is a simplified but faster version of Routerify. It only provides below functions: path matching error handling Why not

jinhua luo 7 Dec 30, 2022
Catify, but built in rust

A simple project to prettify commit messages NOW IN RUST! Commit messages are good. They provide information about tons of things. But far too many co

Alecto Irene Perez 2 Oct 20, 2021
Fibonacci, but different

n-days Fibonacci, but different? Problem You're given a workout in the 12 Days of Christmas style: 1. Burpee Bar Muscle-Up 2. Thrusters 3. Power Clean

Phillip Copley 0 Dec 24, 2021
rz-pipes, but with UNIX pipes. Replace a rizin command

rz-fluid rz-pipes, but with UNIX pipes. Replace a rizin command such as pdg @ $(is~mag[1]) with is~mag[1] | pdg @. Example cargo r -- --bin ../dump109

wcampbell 1 Jan 11, 2022
Bongo Copy Cat wants to be involved in everything you do but instead just imitates you hitting your keyboard all day. After all it's just a cat.

Bongo Copy Cat Introduction Bongo Copy Cat wants to be involved in everything you do but instead just imitates you hitting your keyboard all day. Afte

Abhijeet Singh 4 Jan 23, 2023
cargo-expand, but with Hygiene [WIP]

cargo-hexpand cargo-expand, but with Hygiene*. *Still very WIP. The problem cargo-expand works well, but it does not respect hygiene when expanding th

Sasha Pourcelot 20 Aug 9, 2023