Experimental playground for wiktionary data

Overview

wikt

Experimental playground for wiktionary data.

This document might not update as often as the code does.

Set up

You'll want a minimum of 10 GB free space, a decent internet connection to download the dump (it's about 1GB), an SSD (it's very dependent on I/O), a multi-core processor (it's very parallel). It will work on a single core, just multiply every duration below by 8–10×.

Install

As this is an experiment/playground it's probably best to clone this repo locally.

Then call the tool with cargo run --release -- OPTIONS. For brevity below I say wikt OPTIONS but I'm actually using the cargo invocation.

Building without --release is somewhat faster but 40–100× slower to process data so really not worth it.

A useful global option is -V, which takes a log level. Set to debug for a little more logging, set to trace for a lot more logs including dumps of intermediate data, set to warn or error to omit the default (info) logging.

To get a dump:

  1. https://dumps.wikimedia.org/enwiktionary/
  2. Select the penultimate dated folder. Not the last one, which might be incomplete, the one before last. Or the last, if you're sure it's complete.
  3. You want the pages-articles-multistream-xml file. Not the index file, we don't use that.
  4. Download it and unpack it.

It might be helpful to keep the packed version around so you can just delete the unpacked file once done with it, to save space and time. Alternatively, you can repack it with zstd, it will be faster to decompress should you need to and take up about the same space. If you're working on btrfs you can probably use transparent zstd compression for the same effect without having to unpack again.

The tooling is focussed on the english wiktionary, may work on other languages, may not.

Do once: extract dump into the store

The dump is a single massive XML file. That's pretty much impossible to query or do anything with in any kind of efficient.

So, the first step with wikt is to extract just the useful information into a custom binary format which I designed to be trivial to parse but massively parallelisable, as well as having some random access capability. That's stored in (by default) a store folder, and actually works out to a bit smaller than the dump itself. At writing, my store folder contains 1472 ZStandard compressed files of this custom format.

You generate the store with:

wikt store make path/to/dump.xml

This will take hours.

Each file in the store is called a "block", each block contains up to 10k "entries", which contain the raw title and body of a wiktionary page. Blocks have a short header with the amount of entries within and an array of byte offsets into the subsequent data section where each entry starts. Blocks are zstd compressed by wikt, with a dictionary trained on the first block. Entries have a header with two byte lengths, one each for the title and body data.

So you can read an entry given the name of the block and the number of the entry within that block. That's expressed as a "ref" or "refid" which is two u32s separated by a slash in the human/textual form, or by a u64 containing the concatenation of the two u32s in machine form.

And you can read all entries by iterating (in parallel) the entire store folder, and then opening each block, decompressing it, and after parsing the block header, parsing every entry in parallel.

It could be made faster by decompressing only the block header, and then seeking to the required position (for random access) or chunking along byte offsets and parsing each chunk from the zstd stream (for sequential access). Also there might be facilities in the zstd format itself for that pattern of use that we don't take advantage of currently.

Query the store

You can search the store for substrings in the text of entries, or for negative matches. This is fairly slow for specific queries because it literally iterates the entire store and runs the matcher on every entry, but on my machine a single substring search takes ~30 seconds to run through it all, so it's not that terrible.

You can query the store while it's extracting the dump.

wikt store query word "phrase with spaces" ~negative

Each entry returned is just the title prefixed by the refid in [brackets], you can use that to get the full text of the entry:

wikt store get 10000/1234

You can use the --count flag to instead return the amount of entries it matched, this is faster simply by virtue of not having to write to output for every entry.

Build the index

Once you've gotten a full store, you can build the index:

wikt index make

This will take 5-15 minutes.

This is one of the places where you can play around, by changing how the index is build. To make it easier to see changes in effect, there's a --limited N option. Set N to e.g. 10, that will stop after reading 10 blocks into the index.

Each entry is read at least once into the index. A "document" is an indexed entry or subentry. As of writing, the full index is ~7.3 million entries and indexes out to ~20 million documents.

The index has the title of each entry stored, so it can return titles fast, and the refid, so the text of an entry can be fetched from the store. The full text of the entry is not stored, which saves considerable space. Still, as of writing the index was several gigabytes large.

Query the index

You pass a Tantivy full text query, and it returns the top scored results.

wikt index query 'star system'

You can query phrases with "phrase", and make a term requirement stronger with a + prefix, or exclude a term with a - prefix. See Tantivy for more.

By default it searches in the body, and you can query specific fields with field:expression. For example to get results of english nouns:

wikt index query '+lang:english +gram:noun'

Obviously the fields depend on how you built your index.

You can't query an index that was created with different fields than how you're querying it. So if you make changes to the schema you'll need to rebuild the index before querying. Contrary to the store, you can't query the index until changes are committed, and the index make process only commits once at the end.

The query output contains a bit of metadata:

score=24.352657 [2440000/439] (english/?) double star system
        ===Noun=== {{en-noun|head=[[double]] [[star system]]}}  # {{lb|en|star}} a [[bi…

That's:

  • the index search score
  • the refid
  • the lang/gram indicator (here english language, unset grammatical category)
  • the title of the entry (in the actual output it's in bold)
  • an excerpt (80 chars) of the entry

By default it fetches an excerpt of the text for display. You can have it show the entire entry with --full. Or you can skip fetching the text, which will be faster, with --titles.

Use -n to change the number of results returned (default 20).

You might also like...
Utilities to gather data out of roms. Written in Rust. It (should) support all types.

snesutilities Utilities to gather data out of roms. Written in Rust. It (should) support all types. How Have a look at main.rs: use snesutilities::Sne

A VtubeStudio plugin that allows iFacialMocap to stream data to the app, enabling full apple ARkit facial tracking to be used for 2D Vtuber models.

facelink_rs A VtubeStudio plugin that allows iFacialMocap to stream data to the app, enabling full apple ARkit facial tracking to be used for 2D Vtube

A real-time event-oriented data-hub

Redcar A real-time event-oriented data-hub, inspired by the data hub. It is: Universal: the front end uses gRPC to provide services. Fast: benchmarked

Parses COVID-19 testing data from DC government ArcGIS APIs

covid-dc Parses COVID-19 testing data from DC government ArcGIS APIs Example debug output from cargo run RapidSite { attributes: RapidSiteAttribut

Disaggregate zone-based origin/destination data to specific points
Disaggregate zone-based origin/destination data to specific points

odjitter This crate contains an implementation of the ‘jittering’ technique for pre-processing origin-destination (OD) data. Jittering in a data visua

Code for connecting an RP2040 to a Bosch BNO055 IMU and having the realtime orientation data be sent to the host machine via serial USB

Code for connecting an RP2040 (via Raspberry Pi Pico) to a Bosch BNO055 IMU (via an Adafruit breakout board) and having the realtime orientation data be sent to the host machine via serial USB.

Rust libraries for working with GPT (GUID Partition Table) disk data

gpt-disk-rs no_std libraries related to GPT (GUID Partition Table) disk data. There are three Rust packages in this repository: uguid The uguid packag

Stdto provides a set of functional traits for conversion between various data representations.

Stdto stdto provides a set of functional traits for conversion between various data representations. | Examples | Docs | Latest Note | stdto = "0.13.0

Code examples, data structures, and links from my book, Rust Atomics and Locks.

This repository contains the code examples, data structures, and links from Rust Atomics and Locks. The examples from chapters 1, 2, 3, and 8 can be f

Owner
Félix Saparelli
Félix Saparelli
An experimental programming language for exploring first class iterators.

An experimental programming language for exploring first class iterators.

Miccah 4 Nov 23, 2021
Experimental syntax for Rust

Osy.rs Experimental syntax for Rust Hey everyone, this readme needs work! The spec has been roughed out in Osy.rs_spec.alpha, but the file could be be

null 3 Dec 17, 2021
An experimental Rust crate for sigstore

Continuous integration Docs License This is an experimental crate to interact with sigstore. This is under high development, many features and checks

sigstore 89 Dec 29, 2022
Experimental Valve Index camera passthrough for Linux

Index camera passthrough Warning: This is still a work in progress, you could get motion sickness if you try it now The problem that the Index camera

yshui 22 Dec 1, 2022
Experimental Rust tool for generating FFI definitions allowing many other languages to call Rust code

Diplomat is an experimental Rust tool for generating FFI definitions allowing many other languages to call Rust code. With Diplomat, you can simply define Rust APIs to be exposed over FFI and get high-level C, C++, and JavaScript bindings automatically!

null 255 Dec 30, 2022
An experimental implementation of Arc against Apache Datafusion

box This is an experimental repository to perform a proof of concept replacement of the Apache Spark executor for Arc with Apache DataFusion. This is

tripl.ai 1 Nov 26, 2021
An experimental transpiler to bring tailwind macros to SWC 🚀

stailwc (speedy tailwind compiler) This is an experimental SWC transpiler to bring compile time tailwind macros to SWC (and nextjs) a-la twin macro. T

Alexander Lyon 139 Dec 20, 2022
An experimental RISC-V recompiler

WARNING: All of this code is highly experimental and is a direct result of a two day hacking binge fueled by a truckload of tea. It's definitely not s

Koute 13 Apr 2, 2023
An experimental Athena extension for DuckDB 🐤

DuckDB Athena Extension WARNING This is a work in progress - things may or may not work as expected ??‍♂️ Limitations Only the default database is sup

Damon P. Cortesi 34 Apr 3, 2023
Doku is a framework for building documentation with code-as-data methodology in mind.

Doku is a framework for building documentation with code-as-data methodology in mind. Say goodbye to stale, hand-written documentation - with D

ANIXE 73 Nov 28, 2022