WIP: Parse archived parler pages into structured html

Christopher Tarquini

Last update: Feb 16, 2021

Related tags

Parsing parler-parse

Overview

parler-parse

Parler HTML goes in (stdin), structured JSON comes out (stdout)

Might be useful for feeding into elasticsearch or cross-referencing with the video/images dump.

Usage

You will need a rust compiler (easiest way is via rustup) to build from source. After that run the following commands in your terminal:

# clone the repo
git clone https://github.com/ilsken/parler-parse.git && cd parler-parse

# run the example
https://github.com/ilsken/parler-parse.git
cargo run < examples/echo--parent-no-comment.html

CLI options

USAGE:
    parler-indexer [FLAGS] [OPTIONS] [--] [path]...

FLAGS:
    -c, --compact      Output compact (single line) JSON. Defaults to true if stdin in not a terminal
    -h, --help         Prints help information
    -r, --recursive    Recursively search directories
    -V, --version      Prints version information

OPTIONS:
        --fail-log <fail file>              Write failed paths to a file
        --paths-from-file <path file>...    Read paths from a file
        --success-log <success file>        Write successfully processed paths to a file

ARGS:
    <path>...    HTML File(s) or directory of HTML File(s) to parse

Where do I get the archives?

This project was developed against the "partial parler post text" archive that available from Distributed Denial of Secrets.

Currently parses:

OG Meta
Posts + Echos
- Author (username + name + avatar + badge)
- Body
- Media Attachments (Url, Title, Excerpt, Type, ID (numeric and base62/hex encoded))
Comments + Replies + Engagements
Metrics (impressions, echoes, comment count, etc)
All mentioned usernames in the post
Profile pages + all posts
Estimated timestamp offset (3 days ago -> - 3 days in seconds)

Roadmap

✅ Bug: Author field will be null if a user just echoe'd a post (only has the author of the echoed post). We can populate it with the og meta title field
✅ Multi-threaded, recursive directory processing (crossbeam + rayon)
✅ Allow bulk / multi-threaded processing for all files in a directory for quickly importing into elastic/mellisearch/tantivy
[TODO] Add file metadata (create/modified date/path)
[TODO] WARC support + metadata
[TODO] Fix up timestamps based on metadata

Example output

{
  "opengraph_meta": {
    "title": "@AnthonyDaubs - AnthonyDaubs -",
    "owner": {
      "name": "AnthonyDaubs",
      "username": "@AnthonyDaubs"
    },
    "url": "/post/8c36602d9568482dacfc55d9b63d5a07",
    "image_url": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256"
  },
  "posts": [
    {
      "echo_by": null,
      "cards": [
        {
          "kind": "Post",
          "author": {
            "name": "AnthonyDaubs",
            "username": "@AnthonyDaubs",
            "avatar": {
              "url_raw": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256",
              "url": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256",
              "host": "images.parler.com",
              "is_external": false,
              "id": "af00acf47ba74651998fb9676aabd117"
            }
          },
          "rel_ts": "2 days ago",
          "approx_ts_offset": -172800,
          "body": "",
          "impression_count": 3,
          "is_sensitive_content": true,
          "media_items": [
            {
              "kind": "Video",
              "title": "",
              "link": {
                "label": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url_raw": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "host": "video.parler.com",
                "is_external": false,
                "id": "Q2s5oVN1pfgk",
                "id_b62_dec": 1355361448748163000000
              },
              "excerpt": "",
              "source": {
                "label": "",
                "url_raw": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "host": "video.parler.com",
                "is_external": false,
                "id": "Q2s5oVN1pfgk",
                "id_b62_dec": 1355361448748163000000
              },
              "numeric_id": null
            }
          ]
        }
      ],
      "comments": [],
      "post_id": null,
      "mentions": [],
      "engagements": {
        "comment_count": 0,
        "echo_count": 0,
        "upvote_count": 0
      }
    }
  ]
}

License

MIT licensed, feel free to use it. If you want to use it for research, I'd love to hear about it and help if I can. Shoot me an email or message me on twitter (@chris_tarquini)

Rust crate for scraping URLs from HTML pages

url-scraper Rust crate for scraping URLs from HTML pages. Example extern crate url_scraper; use url_scraper::UrlScraper; fn main() { let director

35 Aug 18, 2022

Rslide - A web service that allows you to move through multiple html pages in the browser like a slide, even without focusing on the app console or the browser. Currently only supports Windows.

rslide rslide is a web service that allows you to move through multiple html pages in the browser like a slide, even without focusing on the app conso

3 Jan 1, 2022

Comments

Add minimal usage example
Hi, I'm new to Rust, could you please provide a minimal guide on how to use the script from a terminal? I managed to install rust and build the project (I guess) but it seems I'm not able to run it.

My commands were:

cargo build cargo run /examples/echo--parent-no-comment.html

Am I missing something?

Thanks in advance, Giulio
opened by GiulioRossetti 5
Add echo timestamp / by line
Add a parsing for echo metadata (https://github.com/dwillis/parler-parse/commit/457c0bcb10176ffdbbd84e9dfac51dfbf4cae8ae) with the new RelTimestamp type (added here).

Also would like add an fix timestamps option that will give you an estimate by either:

Taking a "scrape time" argument and using that to calculate an estimated real timestamp

Using file metadata (path names with timestamps in them or create date)

Accepting WARC files as input and using that metadata
opened by tarqd 1

Videos showing as anchors

{
  "kind": "Anchor",
  "label": "https://video.parler.com/Nr/4w/Nr4wNawMApbI_small.mp4",
  "location": "https://video.parler.com/Nr/4w/Nr4wNawMApbI_small.mp4",
  "id": "Nr4wNawMApbI"
}

Data is still good but the kind is wrong :/

opened by tarqd 1

WIP: Parse archived parler pages into structured html

Related tags

Overview

parler-parse

Usage

CLI options

Where do I get the archives?

Currently parses:

Roadmap

Example output

License

You might also like...

Rust crate for scraping URLs from HTML pages

Rslide - A web service that allows you to move through multiple html pages in the browser like a slide, even without focusing on the app console or the browser. Currently only supports Windows.

ARCHIVED -- moved into the main Embassy repo at https://github.com/embassy-rs/embassy

Transform Obsidian Vault's notes into web pages

Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files.

Distribute a wasm SPA as HTML by wrapping it as a polyglot "html+wasm+zip"

A command-line downloader for sites archived on the Wayback Machine

🌍 The Earth Blockchain on Polkadot (archived)

Proof of concept writing a monolith BBS using Rust, GraphQL, WASM, and SQL. WILL BE ARCHIVED ONCE PROVEN

parse command-line arguments into a hashmap and vec of positional args

Parse byte size into integer accurately.

Vim plugin to quickly parse strings into arrays.

A Rust library to parse Blueprint files and convert them into GTK UI files

A wasm template for Rust to publish to gh-pages without npm-deploy

Generate manual pages from mdBooks!

Live Server - Launch a local network server with live reload feature for static pages

Automatically transform your Next.js Pages to use SuperJSON with SWC

`memory_pages` is a small library provinig a cross-platform API to request pages from kernel with certain premisions

Rust Vector for large amounts of data, that does not copy when growing, by using full `mmap`'d pages.

Comments

Add minimal usage example

Add echo timestamp / by line

Videos showing as anchors

Owner

Christopher Tarquini

Parse byte size into integer accurately.

Parse BNF grammar definitions

Generate and parse UUIDs.

tiny md parser - md to html

Pure, simple and elegant HTML parser and editor.

Rohanasantml an easy alternative to html!

A WIP minimal C Compiler written in Rust 🦀

A WIP svelte parser written in rust. Designed with error recovery and reporting in mind

🏭 Convert Markdown documents into themed HTML pages with support for code syntax highlighting, LaTeX and Mermaid diagrams.

Parse RISC-V opcodes to provide more detailed structured data