Anglosaxon is a command line tool to parse XML files using SAX

Overview

anglosaxon - Convert large XML files to other formats

Crates.io

anglosaxon is a command line tool to parse XML files using SAX. You can do simple transformations of XML files into other textual formats in a streaming format. Since it uses SAX it doesn't load the entire XML file into memory before processing, so it can work with large XML files, like some of the OpenStreetMap data dump files.

Example Usage

bzcat ~/osm/data/changeset-examples.osm.bz2  | anglosaxon -S -o changeset_id,tag_key,tag_value --nl -s tag -v ../id -o,  -v k -o , -v v --nl

This converts the OSM changeset dump file to a CSV of changeset_id, changeset_tag_key, changeset_tag_value, allowing you to use standard unix tools to analyze OSM changesets. As of January 2022, the changesets file is 4 GB bzip2 compress (40+ GB uncompressed XML), and is too large for DOM based tools.

Installation

cargo install anglosaxon

Documention

anglosaxon reads an xml file from stdin and writes to stdout.

Output is controlled by the CLI flags. Specify a SAX event with -S/-s/-e/-E, and then one or more output actions to take for that event. Unlike most CLI programmes, the order of flags is relevant.

SAX events

  • -S/--startdoc: Happes once at the start of the XML document
  • -s TAG/--start TAG: happens when TAG is opened, i.e. at the start of the tag. The XML attributes on this tag are available
  • -e TAG/--end TAG: happens when TAG is closed, i.e. at the end of the tag
  • -E/--end: Happes once at the end of the XML document

XML Tag names are simple strings.

Actions to take

One or more actions can be specified and are processed in the order you give.

  • -o TEXT: Print TEXT as is
  • --nl: Print a newline
  • --tab: Print a tab
  • -v ATTRIBUTE: Print the value of this XML attribute. An error happens if the tag doesn't have that attribute
  • -V ATTRIBUTE DEFAULT: Print the value of this XML attribute, and DEFAULT if that attribute doesn't exist.

XML Attributes are plain text. Parent node attributes are specified by ../ATTRIBUTE (e.g. ../../id is the id attribute of the XML node that's the parent of the parent of the current XML node). An error occurs if this required parent doesn't exist.

Similar Projects

  • xmlstarlet's sel/selection functionality was the inspiration. But it's unable to handle large XML
You might also like...
 apkeep - A command-line tool for downloading APK files from various sources
apkeep - A command-line tool for downloading APK files from various sources

apkeep - A command-line tool for downloading APK files from various sources Installation Precompiled binaries for apkeep on various platforms can be d

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions
RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions. Features Batch rename files and direct

🍅 A command-line tool to get and set values in toml files while preserving comments and formatting

tomato Get, set, and delete values in TOML files while preserving comments and formatting. That's it. That's the feature set. I wrote tomato to satisf

Command line tool for editing .ini files

Edit-ini Command line tool for editing .ini files Usage Usage: edit-ini [OPTIONS] Options: -i, --input file Input file to read f

rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially 10G)

csv, excel toolkit written in Rust rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially 10G). rsv has following fe

Potr (Po Translator) is a command line tool for translating gettext PO files.

Potr Potr (Po Translator) is a command line tool for translating Gettext PO files. Currently, it supports translation using OpenAI, Azure OpenAI Servi

Plexisort is a command-line tool designed to organize your files based on metadata.
Plexisort is a command-line tool designed to organize your files based on metadata.

Plexisort is a command-line tool designed to organize your files based on metadata. It allows for flexible source and destination directory settings, supports dry-run operations for safe previews of potential changes, and even offers an undo functionality for reversing the last set of file movements.

Pink is a command-line tool inspired by the Unix man command.

Pink is a command-line tool inspired by the Unix man command. It displays custom-formatted text pages in the terminal using a subset of HTML-like tags.

A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.
A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.

FileQL - File Query Language FileQL is a tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK. Sampl

Owner
Amanda
Amanda
Parse command line arguments by defining a struct.

StructOpt Parse command line arguments by defining a struct. It combines clap with custom derive. Documentation Find it on Docs.rs. You can also check

Guillaume P. 2.6k Jan 5, 2023
parse command-line arguments into a hashmap and vec of positional args

parse command-line arguments into a hashmap and vec of positional args This library doesn't populate custom structs, format help messages, or convert types.

James Halliday 17 Aug 11, 2022
Small command-line tool to switch monitor inputs from command line

swmon Small command-line tool to switch monitor inputs from command line Installation git clone https://github.com/cr1901/swmon cargo install --path .

William D. Jones 5 Aug 20, 2022
dovi_meta is a CLI tool for creating Dolby Vision XML metadata from an encoded deliverable with binary metadata.

dovi_meta dovi_meta is a CLI tool for creating Dolby Vision XML metadata from an encoded deliverable with binary metadata. Building Toolchain The mini

Rainbaby 12 Dec 14, 2022
tpp (Tera Pre-Processor) is a versatile CLI (Command Line Interface) tool crafted for preprocessing files using the Tera templating engine.

tpp (Tera Pre-Processor) is a versatile CLI (Command Line Interface) tool crafted for preprocessing files using the Tera templating engine. Drawing inspiration from pre-processors like cpp and gpp, tpp is the next evolution with its powerful expressive toolset.

null 3 Nov 23, 2023
Rust File Management CLI is a command-line tool written in Rust that provides essential file management functionalities. Whether you're working with files or directories, this tool simplifies common file operations with ease.

Rust FileOps Rust File Management CLI is a command-line tool written in Rust that provides essential file management functionalities. Whether you're w

Harikesh Ranjan Sinha 5 May 2, 2024
This project returns Queried value from SOAP(XML) in form of JSON.

About This is project by team SSDD for HachNUThon (TechHolding). This project stores and allows updating SOAP(xml) data and responds to various querie

Sandipsinh Rathod 3 Apr 30, 2023
Language server for Odoo Python/JS/XML

odoo-lsp Features Completion, definition and references for models, XML IDs and model fields Works for records, templates, env.ref() and other structu

Viet Dinh 5 Aug 31, 2023
A CLI command to parse kustomize build result and notify it to GitLab

ksnotify A CLI command to parse kustomize build result and notify it to GitLab Caution This repository is under development status. What ksnotify does

null 7 Jan 2, 2023
Command line tool to extract various data from Blender .blend files

blendtool Command line tool to extract various data from Blender .blend files. Currently supports dumping Eevee irradiance volumes to .dds, new featur

null 2 Sep 26, 2021