A unix-friendly map-reduce parallelization alternative

Related tags

Data processing slb
Overview

slb: sharded load balancer

Like parallel --pipe --roundrobin but load balancing is performed based on input line hashing. When performing keyed aggregations in child processes this is crucial since then only one shard contains a given key. Here's a word count example on a 16-physical-cpu machine:

wikawk.txt # 203.97 sec /usr/bin/time -f "%e sec" slb \ --mapper 'tr " " "\n" | rg -v "^$"' \ --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \ --infile enwik9.clean \ --outprefix wikslb. # 6.20 sec diff <(sort wikawk.txt) <(cat wikslb.* | sort) ; echo $? # 0 ">
curl -o enwik9.bz2 https://cs.fit.edu/~mmahoney/compression/enwik9.bz2
bunzip2 enwik9.bz2
examples/clean.sh < enwik9 > enwik9.clean ; rm enwik9

/usr/bin/time -f "%e sec" awk -f examples/wc.awk enwik9.clean > wikawk.txt
# 203.97 sec

/usr/bin/time -f "%e sec" slb \
  --mapper 'tr " " "\n" | rg -v "^$"' \
  --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \
  --infile enwik9.clean \
  --outprefix wikslb.
# 6.20 sec

diff <(sort wikawk.txt) <(cat wikslb.* | sort) ; echo $?
# 0

This demonstrates a "flatmap-fold" paradigm over the typical "map-reduce" one.

Each line

a    b c d -> flatmapper 1
f g   a b -> flatmapper 2

is handed off to an independent flat mapper tr " " "\n" | rg -v "^$" which puts a word on each line

flatmapper 1 ->
a
b
c
d

flatmapper 2 ->
f
g
a
b

whose outputs are then inspected line-by-line. The first word of each line is hashed (in this case, the entire line). Assuming hash(a) == hash(b) == 1 and hash(c) == hash(d) == hash(g) == hash(f) == 0 we'll input the corresponding keys from each flatmapper into a couple awk '{a[$0]++}END{for(k in a)print k,a[k]}' folders. And the outputs are then written to output files.

a b a b -> awk 1 -> {a: 2, b: 2} -> outprefix1
f g c d -> awk 0 -> {f: 1, g: 1, c: 1, d: 1} -> outprefix0

Feature Frequency Calculation

Here's an example of counting the frequency of features in sparse SVMlight format of a large dataset, benchmarked on the large KDD12 dataset on a 16-physical-cpu machine (assumes ripgrep, GNU Parallel are installed).

results-awk.txt # 1032.18 sec 13721032 KB /usr/bin/time -f "%e sec %M KB" slb \ --mapper 'sed -E "s/^[^ ]+ //" | sed -E "s/:[^ ]+//g" | tr " " "\n" | rg -v "^$"' \ --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \ --infile kdd12.tr \ --outprefix results-slb. # 122.50 sec 881436 KB # note above doesn't count child memory # eyeballing htop, max memory use is ~12.3GB # check we're correct cat results-slb.* > results-slb && rm results-slb.* sort --parallel=$(($(nproc) / 2)) -k2nr -k1n -o results-slb results-slb & \ sort --parallel=$(($(nproc) / 2)) -k2nr -k1n -o results-awk.txt results-awk.txt & \ wait diff results-slb results-awk.txt >/dev/null ; echo $? # 0 ">
echo 'will cite' | parallel --citation 1>/dev/null 2>/dev/null
curl -o kdd12.tr.bz2 "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdd12.tr.bz2"
bunzip2 kdd12.tr.bz2
du -hs kdd12.tr 
# 17G     kdd12.tr
parallel --pipepart -a kdd12.tr wc -l | awk '{a+=$0}END{print a}'
# num rows: 119705032
parallel --pipepart -a kdd12.tr wc -w | awk '{a+=$0}END{print a}'
# num nnz: 1436460384 - 119705032 = 1316755352

/usr/bin/time -f "%e sec %M KB" awk -f examples/svm-featurecount.awk kdd12.tr > results-awk.txt
# 1032.18 sec 13721032 KB

/usr/bin/time -f "%e sec %M KB" slb \
  --mapper 'sed -E "s/^[^ ]+ //" | sed -E "s/:[^ ]+//g" | tr " " "\n" | rg -v "^$"' \
  --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \
  --infile kdd12.tr \
  --outprefix results-slb.
# 122.50 sec 881436 KB
# note above doesn't count child memory
# eyeballing htop, max memory use is ~12.3GB

# check we're correct
cat results-slb.* > results-slb && rm results-slb.*
sort --parallel=$(($(nproc) / 2)) -k2nr -k1n -o results-slb results-slb & \
sort --parallel=$(($(nproc) / 2)) -k2nr -k1n -o results-awk.txt results-awk.txt & \
wait

diff results-slb results-awk.txt >/dev/null ; echo $?
# 0

Count Distinct Feature Values

As another, similar example we could count the number of distinct values for each feature. In particular, for each feature we're looking to get the minimum of its total number of distinct values with 100 (as we might be inclined to consider anything with more than 99 values to be continuous).

cdawk.txt # 388.72 sec 23895104 KB /usr/bin/time -f "%e sec %M KB" slb \ --mapper 'sed -E "s/^[^ ]+ //" | tr " " "\n" | tr ":" " " | rg -v "^$"' \ --folder "awk '{if(!(\$1 in a)||length(a[\$1])<100)a[\$1][\$2]=1}END{for(k in a)print k,length(a[k])}'" \ --infile kdda \ --outprefix cdslb. # 26.79 sec 1499992 KB diff \ <(sort --parallel=$(($(nproc) / 2)) -k2nr -k1n cdawk.txt) \ <(cat cdslb.* | sort --parallel=$(($(nproc) / 2)) -k2nr -k1n) \ > /dev/null ; echo $? # 0 ">
curl -o kdda.bz2 "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2"
bunzip2 kdda.bz2
du -hs kdda
# 2.5G    kdda

/usr/bin/time -f "%e sec %M KB" awk -f examples/svm-countdistinct.awk kdda > cdawk.txt
# 388.72 sec 23895104 KB

/usr/bin/time -f "%e sec %M KB" slb \
  --mapper 'sed -E "s/^[^ ]+ //" | tr " " "\n" | tr ":" " " | rg -v "^$"' \
  --folder "awk '{if(!(\$1 in a)||length(a[\$1])<100)a[\$1][\$2]=1}END{for(k in a)print k,length(a[k])}'" \
  --infile kdda \
  --outprefix cdslb.
# 26.79 sec 1499992 KB

diff \
  <(sort --parallel=$(($(nproc) / 2)) -k2nr -k1n cdawk.txt) \
  <(cat cdslb.* | sort --parallel=$(($(nproc) / 2)) -k2nr -k1n) \
  > /dev/null ; echo $?
# 0

Installation

Note the above examples demonstrate the convenience of the tool:

  • For large datasets, parallelism is essential.
  • Compared to an equivalent map-reduce, we use less memory, less time, and less code.

The last point holds because slb ensures each parallel invocation recieves a unique partition of the key space. In turn, we use less memory because each folder is only tracking aggregates for its own key space and less code because we do not need to write a combiner that merges two maps.

To install locally from crates.io, run

cargo install slb

Dev Stuff

Rudimentary testing via ./test.sh.

Re-publish to crates.io with cd slb && cargo publish.

You might also like...
A Bevy plugin for loading the LDtk 2D tile map format.
A Bevy plugin for loading the LDtk 2D tile map format.

bevy_ldtk ( Tileset from "Cavernas" by Adam Saltsman ) A Bevy plugin for loading LDtk tile maps. Usage use bevy::prelude::*; use bevy_ldtk::*; fn mai

Everyday-use client-side map-aware Arch Linux mirror ranking tool

Rate Arch Mirrors This is a tool, which fetches mirrors, skips outdated/syncing Arch Linux mirrors, then uses info about submarine cables and internet

Typed any map for rust

TypedMap TypedMap is a typed HashMap. It allows you to define different value type depending on a Key type. It's useful if you want to store different

Image density/height map to mesh generator
Image density/height map to mesh generator

Image density/height map to mesh generator About Crates used to generate 2D mesh from images representing density/height map. Algorithm gets source im

Pitch-perfect copy of map generation algorithm from Slay the Spire

sts_map_oracle Pitch-perfect copy of map generation algorithm from Slay the Spire Usage Prints out map layouts in console for given seed: sts_map_orac

A simple Vec-based Map inspired on JavaScript for rust.

A simple alternative to HashMap inspired on JavaScript's Map.

A rusty, spiky, heat-seeking quake map parser.

Shalrath   A rusty, spiky, heat-seeking quake map parser shalrath is a rust representation, [nom] parser and string serializer for Quake map files. It

A program written in Rust, that allows the user to find the current location of the International Space Station and see it on a map.

ISS Location ViewFinder A program written in Rust, that allows the user to find the current location of the International Space Station and see it on

Firecracker takes your HTTP logs and uses them to map your API flows and to detect anomalies in them.
Firecracker takes your HTTP logs and uses them to map your API flows and to detect anomalies in them.

Who is BLST and what do we do? BLST (Business Logic Security Testing) is a startup company that's developing an automatic penetration tester, replacin

A Rust implementation of generic prefix tree (trie) map with wildcard capture support

prefix_tree_map A Rust implementation of generic prefix tree (trie) map with wildcard capture support. Design Trie is a good data structure for storin

B-Tree map for pub/sub services

submap B-tree map for pub/sub services. Create a new subscription map let mut smap: SubMapClient = SubMap::new(); where "Client" is a pub/sub client

Rust crates with map and set with interval keys (ranges x..y).

This crates implements map and set with interval keys (ranges x..y). IntervalMap is implemented using red-black binary tree, where each node contains

Special FUSE filesystem to map /etc/resolv.conf to different files depending on Linux network namespace

Linux network namespaces allow separate networking environment for a group of processes (sharing uid or from a separate user). DNS settings (/etc/resolv.conf) are however shared between all those environments, which may be inconvenient in some setups.

A set of safe Least Recently Used (LRU) map/cache types for Rust

LruMap A set of safe Least-Recently-Used (LRU) cache types aimed at providing flexible map-like structures that automatically evict the least recently

Show active TCP connections on a TUI world map.
Show active TCP connections on a TUI world map.

Maperick Show active TCP connections on a TUI world map. Still WIP, but it's gonna be good. Setup git clone [email protected]:schlunsen/maperick.git cd m

Serializable map of any type.

🗂️ type_reg Serializable map of any type. This library provides a map that can store any serializable type, and retrieve it as the strong type. Seria

Map the Teenage Engineering OP-1 MIDI output to keyboard commands

OP1NPUT Maps the Teenage Engineering OP-1's MIDI output to keyboard keypresses so it may be used as a game controller. This exists because many of the

Work-in-Progress NES / Famicon Image Editor & Map Creator
Work-in-Progress NES / Famicon Image Editor & Map Creator

NESImg An extremely work-in-progress tool for making NES/Famicom-compatible images. When faced with the challenge of formatting artwork in a way that

Sparse Merkle tree for a key-value map.

LSMTree A Rust library that implements a Sparse Merkle tree for a key-value store. The tree implements the same optimisations specified in the Libra w

Owner
Vladimir Feinberg
Vladimir Feinberg
enum-map enum-map xfix/enum-map [enum-map] — An optimized map implementation for enums using an array to store values.

enum-map A library providing enum map providing type safe enum array. It is implemented using regular Rust arrays, so using them is as fast as using r

Konrad Borowski 57 Dec 19, 2022
SelfOrgMap 5 Nov 4, 2020
Extension to `thiserror` that helps reduce the amount of handwriting

justerror This macro piggybacks on thiserror crate and is supposed to reduce the amount of handwriting when you want errors in your app to be describe

ShakaCode 2 Nov 16, 2022
A code generator to reduce repetitive tasks and build high-quality Rust libraries. 🦀

LibMake A code generator to reduce repetitive tasks and build high-quality Rust libraries Welcome to libmake ?? Website • Documentation • Report Bug •

Sebastien Rousseau 27 Mar 12, 2023
zink! is a library for developing ink! smart contracts with useful Rust macros that extend functionality and reduce boilerplate code.

zink! Smart Contract Macros This is a helper library for developing ink! smart contracts. It contains useful Rust macros that extend functionality and

Scio Labs 3 Nov 3, 2023
fcp is a significantly faster alternative to the classic Unix cp(1) command

A significantly faster alternative to the classic Unix cp(1) command, copying large files and directories in a fraction of the time.

Kevin Svetlitski 532 Jan 3, 2023
A simple, fast and user-friendly alternative to 'find'

fd [中文] [한국어] fd is a program to find entries in your filesytem. It is a simple, fast and user-friendly alternative to find. While it does not aim to

David Peter 25.8k Dec 30, 2022
This is choose, a human-friendly and fast alternative to cut and (sometimes) awk

Choose This is choose, a human-friendly and fast alternative to cut and (sometimes) awk Features terse field selection syntax similar to Python's list

Ryan Geary 1.4k Jan 7, 2023
fd is a program to find entries in your filesystem. It is a simple, fast and user-friendly alternative to find

fd is a program to find entries in your filesystem. It is a simple, fast and user-friendly alternative to find. While it does not aim to support all of find's powerful functionality, it provides sensible (opinionated) defaults for a majority of use cases.

David Peter 25.9k Jan 9, 2023
ASCII terminal hexagonal map roguelike written in Rust

rhex Contributors welcome! Rhex is looking for contributors. See Contributing page for details. Introduction Simple ASCII terminal hexagonal map rogue

Dawid Ciężarkiewicz 137 Dec 2, 2022