WIP: Parse archived parler pages into structured html

Related tags

Parsing parler-parse
Overview

parler-parse

Parler HTML goes in (stdin), structured JSON comes out (stdout)

Might be useful for feeding into elasticsearch or cross-referencing with the video/images dump.

Usage

You will need a rust compiler (easiest way is via rustup) to build from source. After that run the following commands in your terminal:

# clone the repo
git clone https://github.com/ilsken/parler-parse.git && cd parler-parse

# run the example
https://github.com/ilsken/parler-parse.git
cargo run < examples/echo--parent-no-comment.html

CLI options

USAGE:
    parler-indexer [FLAGS] [OPTIONS] [--] [path]...

FLAGS:
    -c, --compact      Output compact (single line) JSON. Defaults to true if stdin in not a terminal
    -h, --help         Prints help information
    -r, --recursive    Recursively search directories
    -V, --version      Prints version information

OPTIONS:
        --fail-log <fail file>              Write failed paths to a file
        --paths-from-file <path file>...    Read paths from a file
        --success-log <success file>        Write successfully processed paths to a file

ARGS:
    <path>...    HTML File(s) or directory of HTML File(s) to parse

Where do I get the archives?

This project was developed against the "partial parler post text" archive that available from Distributed Denial of Secrets.

Currently parses:

  • OG Meta
  • Posts + Echos
    • Author (username + name + avatar + badge)
    • Body
    • Media Attachments (Url, Title, Excerpt, Type, ID (numeric and base62/hex encoded))
  • Comments + Replies + Engagements
  • Metrics (impressions, echoes, comment count, etc)
  • All mentioned usernames in the post
  • Profile pages + all posts
  • Estimated timestamp offset (3 days ago -> - 3 days in seconds)

Roadmap

  • Bug: Author field will be null if a user just echoe'd a post (only has the author of the echoed post). We can populate it with the og meta title field
  • Multi-threaded, recursive directory processing (crossbeam + rayon)
  • Allow bulk / multi-threaded processing for all files in a directory for quickly importing into elastic/mellisearch/tantivy
  • [TODO] Add file metadata (create/modified date/path)
  • [TODO] WARC support + metadata
  • [TODO] Fix up timestamps based on metadata

Example output

{
  "opengraph_meta": {
    "title": "@AnthonyDaubs - AnthonyDaubs -",
    "owner": {
      "name": "AnthonyDaubs",
      "username": "@AnthonyDaubs"
    },
    "url": "/post/8c36602d9568482dacfc55d9b63d5a07",
    "image_url": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256"
  },
  "posts": [
    {
      "echo_by": null,
      "cards": [
        {
          "kind": "Post",
          "author": {
            "name": "AnthonyDaubs",
            "username": "@AnthonyDaubs",
            "avatar": {
              "url_raw": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256",
              "url": "https://images.parler.com/af00acf47ba74651998fb9676aabd117_256",
              "host": "images.parler.com",
              "is_external": false,
              "id": "af00acf47ba74651998fb9676aabd117"
            }
          },
          "rel_ts": "2 days ago",
          "approx_ts_offset": -172800,
          "body": "",
          "impression_count": 3,
          "is_sensitive_content": true,
          "media_items": [
            {
              "kind": "Video",
              "title": "",
              "link": {
                "label": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url_raw": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "host": "video.parler.com",
                "is_external": false,
                "id": "Q2s5oVN1pfgk",
                "id_b62_dec": 1355361448748163000000
              },
              "excerpt": "",
              "source": {
                "label": "",
                "url_raw": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "url": "https://video.parler.com/Q2/s5/Q2s5oVN1pfgk_small.mp4",
                "host": "video.parler.com",
                "is_external": false,
                "id": "Q2s5oVN1pfgk",
                "id_b62_dec": 1355361448748163000000
              },
              "numeric_id": null
            }
          ]
        }
      ],
      "comments": [],
      "post_id": null,
      "mentions": [],
      "engagements": {
        "comment_count": 0,
        "echo_count": 0,
        "upvote_count": 0
      }
    }
  ]
}

License

MIT licensed, feel free to use it. If you want to use it for research, I'd love to hear about it and help if I can. Shoot me an email or message me on twitter (@chris_tarquini)

You might also like...
Rust crate for scraping URLs from HTML pages

url-scraper Rust crate for scraping URLs from HTML pages. Example extern crate url_scraper; use url_scraper::UrlScraper; fn main() { let director

Rslide - A web service that allows you to move through multiple html pages in the browser like a slide, even without focusing on the app console or the browser. Currently only supports Windows.

rslide rslide is a web service that allows you to move through multiple html pages in the browser like a slide, even without focusing on the app conso

ARCHIVED -- moved into the main Embassy repo at https://github.com/embassy-rs/embassy

ARCHIVED - moved into the main Embassy repo https://github.com/embassy-rs/embassy cyw43 WIP driver for the CYW43439 wifi chip, used in the Raspberry P

Transform Obsidian Vault's notes into web pages

Transform Obsidian Vault's notes into web pages. Converts your markdown notes, created in Obsidian, into fully functional site, ready for deployment.

Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files.

Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files. Mozilla's MDN has a good reference for CSS selector syntax.

Distribute a wasm SPA as HTML by wrapping it as a polyglot "html+wasm+zip"

A packer that adds a webpage to WASM module, making it self-hosted! Motivation At the moment, Browsers can not execute WebAssembly as a native single

A command-line downloader for sites archived on the Wayback Machine

This is a small command-line utility I wrote to help with browsing archived websites from the Wayback Machine, which can sometimes be pretty slow.

🌍 The Earth Blockchain on Polkadot (archived)

Social Network Blockchain · The Social Network blockchain is a next-generation governance, economic, and social system for humanity built on Polkadot

Proof of concept writing a monolith BBS using Rust, GraphQL, WASM, and SQL. WILL BE ARCHIVED ONCE PROVEN

GraphQL Forum Important DO NOT even think about using this in production, lest your sanity be destroyed and credentials lost! Loosely following the aw

parse command-line arguments into a hashmap and vec of positional args

parse command-line arguments into a hashmap and vec of positional args This library doesn't populate custom structs, format help messages, or convert types.

Parse byte size into integer accurately.

parse-size parse-size is an accurate, customizable, allocation-free library for parsing byte size into integer. use parse_size::parse_size; assert_eq

Vim plugin to quickly parse strings into arrays.

butcher Vim plugin to quickly parse strings into arrays. It is painful to write arrays in any programming language, so butcher makes it easy for you.

A Rust library to parse Blueprint files and convert them into GTK UI files

🦀 gtk-ui-builder A Rust library to parse Blueprint files and convert them into GTK UI files Inspired by the Blueprint project Example 1 - blueprints

A wasm template for Rust to publish to gh-pages without npm-deploy
A wasm template for Rust to publish to gh-pages without npm-deploy

Wasm template for Rust hosting without npm-deploy on github pages using Travis script It automatically hosts you wasm projects on gh-pages using a tra

Generate manual pages from mdBooks!

mdbook-man Generate man pages from mdBooks! Usage To use mdbook-man you'll first need to install it with: $ cargo install mdbook-man And add the follo

Live Server - Launch a local network server with live reload feature for static pages

Live Server - Launch a local network server with live reload feature for static pages

Automatically transform your Next.js Pages to use SuperJSON with SWC

🔌 NEXT SUPERJSON PLUGIN export default function Page({ date }) { return ( div Today is {date.toDateString()} /div ) } // You c

`memory_pages` is a small library provinig a cross-platform API to request pages from kernel with certain premisions

memory_pages: High level API for low level memory management While using low-level memory management in a project can provide substantial benefits, it

Rust Vector for large amounts of data, that does not copy when growing, by using full `mmap`'d pages.

Large Vector Rust Vector for large amounts of data, that does not copy when growing, by using full mmap'd pages. Maturity I made ths to learn about mm

Comments
  • Add minimal usage example

    Add minimal usage example

    Hi, I'm new to Rust, could you please provide a minimal guide on how to use the script from a terminal? I managed to install rust and build the project (I guess) but it seems I'm not able to run it.

    My commands were:

    cargo build
    cargo run /examples/echo--parent-no-comment.html
    

    Am I missing something?

    Thanks in advance, Giulio

    opened by GiulioRossetti 5
  • Add echo timestamp / by line

    Add echo timestamp / by line

    Add a parsing for echo metadata (https://github.com/dwillis/parler-parse/commit/457c0bcb10176ffdbbd84e9dfac51dfbf4cae8ae) with the new RelTimestamp type (added here).

    Also would like add an fix timestamps option that will give you an estimate by either:

    • Taking a "scrape time" argument and using that to calculate an estimated real timestamp
    • Using file metadata (path names with timestamps in them or create date)
    • Accepting WARC files as input and using that metadata
    opened by tarqd 1
  • Videos showing as anchors

    Videos showing as anchors

    {
      "kind": "Anchor",
      "label": "https://video.parler.com/Nr/4w/Nr4wNawMApbI_small.mp4",
      "location": "https://video.parler.com/Nr/4w/Nr4wNawMApbI_small.mp4",
      "id": "Nr4wNawMApbI"
    }
    

    Data is still good but the kind is wrong :/

    opened by tarqd 1
Owner
Christopher Tarquini
System Security Engineer at @linode. Pretty good at Scrabble.
Christopher Tarquini
Parse byte size into integer accurately.

parse-size parse-size is an accurate, customizable, allocation-free library for parsing byte size into integer. use parse_size::parse_size; assert_eq

null 20 Aug 16, 2022
Parse BNF grammar definitions

bnf A library for parsing Backus–Naur form context-free grammars. What does a parsable BNF grammar look like? The following grammar from the Wikipedia

Shea Newton 188 Dec 26, 2022
Generate and parse UUIDs.

uuid Here's an example of a UUID: 67e55044-10b1-426f-9247-bb680e5fe0c8 A UUID is a unique 128-bit value, stored as 16 octets, and regularly formatted

Rust Uuid 754 Jan 6, 2023
tiny md parser - md to html

tiny md parser md to html covered features lexer h1 h2 h3 list ( also numbered list ) code images links blockquote parser h1 h2 h3 list ( also numbere

jumango pussu 2 Dec 5, 2021
Pure, simple and elegant HTML parser and editor.

HTML Editor Pure, simple and elegant HTML parser and editor. Examples Parse HTML segment/document let document = parse("<!doctype html><html><head></h

Lomirus 16 Nov 8, 2022
Rohanasantml an easy alternative to html!

Rohanasantml: An alternative to html An easy way to write your messy html code in a better way: html head title {Rohanasantml} title

Rohanasan 3 Mar 24, 2024
A WIP minimal C Compiler written in Rust 🦀

_ _ ____ ____ | | __ _ _ __ | | __ / ___/ ___| _ | |/ _` | '_ \| |/ / | | | | | |_| | (_| | | | | < | |__| |___

null 5 Oct 26, 2022
A WIP svelte parser written in rust. Designed with error recovery and reporting in mind

Svelte(rs) A WIP parser for svelte files that is designed with error recovery and reporting in mind. This is mostly a toy project for now, with some v

James Birtles 3 Apr 19, 2023
🏭 Convert Markdown documents into themed HTML pages with support for code syntax highlighting, LaTeX and Mermaid diagrams.

Marky Markdown Magician ?? Features Hot reload previewing ?? Conversion to HTML / PDF ?? Themes! ✨ Extensions - Math, diagrams, syntax-highlighting ??

Vadim 12 Feb 19, 2023
Parse RISC-V opcodes to provide more detailed structured data

riscv-opcodes-parser Parse RISC-V opcodes to provide more detailed structured data. License Licensed under either of Apache License, Version 2.0 (LICE

Sprite 2 Jul 30, 2022