A unix-friendly map-reduce parallelization alternative

Vladimir Feinberg

Last update: Dec 8, 2021

Related tags

Data processing slb

Overview

slb: sharded load balancer

Like parallel --pipe --roundrobin but load balancing is performed based on input line hashing. When performing keyed aggregations in child processes this is crucial since then only one shard contains a given key. Here's a word count example on a 16-physical-cpu machine:

wikawk.txt # 203.97 sec /usr/bin/time -f "%e sec" slb \ --mapper 'tr " " "\n" | rg -v "^$"' \ --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \ --infile enwik9.clean \ --outprefix wikslb. # 6.20 sec diff <(sort wikawk.txt) <(cat wikslb.* | sort) ; echo $? # 0 ">

curl -o enwik9.bz2 https://cs.fit.edu/~mmahoney/compression/enwik9.bz2
bunzip2 enwik9.bz2
examples/clean.sh < enwik9 > enwik9.clean ; rm enwik9

/usr/bin/time -f "%e sec" awk -f examples/wc.awk enwik9.clean > wikawk.txt
# 203.97 sec

/usr/bin/time -f "%e sec" slb \
  --mapper 'tr " " "\n" | rg -v "^$"' \
  --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \
  --infile enwik9.clean \
  --outprefix wikslb.
# 6.20 sec

diff <(sort wikawk.txt) <(cat wikslb.* | sort) ; echo $?
# 0

This demonstrates a "flatmap-fold" paradigm over the typical "map-reduce" one.

Each line

a    b c d -> flatmapper 1
f g   a b -> flatmapper 2

is handed off to an independent flat mapper tr " " "\n" | rg -v "^$" which puts a word on each line

flatmapper 1 ->
a
b
c
d

flatmapper 2 ->
f
g
a
b

whose outputs are then inspected line-by-line. The first word of each line is hashed (in this case, the entire line). Assuming hash(a) == hash(b) == 1 and hash(c) == hash(d) == hash(g) == hash(f) == 0 we'll input the corresponding keys from each flatmapper into a couple awk '{a[$0]++}END{for(k in a)print k,a[k]}' folders. And the outputs are then written to output files.

a b a b -> awk 1 -> {a: 2, b: 2} -> outprefix1
f g c d -> awk 0 -> {f: 1, g: 1, c: 1, d: 1} -> outprefix0

Feature Frequency Calculation

Here's an example of counting the frequency of features in sparse SVMlight format of a large dataset, benchmarked on the large KDD12 dataset on a 16-physical-cpu machine (assumes ripgrep, GNU Parallel are installed).

results-awk.txt # 1032.18 sec 13721032 KB /usr/bin/time -f "%e sec %M KB" slb \ --mapper 'sed -E "s/^[^ ]+ //" | sed -E "s/:[^ ]+//g" | tr " " "\n" | rg -v "^$"' \ --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \ --infile kdd12.tr \ --outprefix results-slb. # 122.50 sec 881436 KB # note above doesn't count child memory # eyeballing htop, max memory use is ~12.3GB # check we're correct cat results-slb.* > results-slb && rm results-slb.* sort --parallel=$(($(nproc) / 2)) -k2nr -k1n -o results-slb results-slb & \ sort --parallel=$(($(nproc) / 2)) -k2nr -k1n -o results-awk.txt results-awk.txt & \ wait diff results-slb results-awk.txt >/dev/null ; echo $? # 0 ">

echo 'will cite' | parallel --citation 1>/dev/null 2>/dev/null
curl -o kdd12.tr.bz2 "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdd12.tr.bz2"
bunzip2 kdd12.tr.bz2
du -hs kdd12.tr 
# 17G     kdd12.tr
parallel --pipepart -a kdd12.tr wc -l | awk '{a+=$0}END{print a}'
# num rows: 119705032
parallel --pipepart -a kdd12.tr wc -w | awk '{a+=$0}END{print a}'
# num nnz: 1436460384 - 119705032 = 1316755352

/usr/bin/time -f "%e sec %M KB" awk -f examples/svm-featurecount.awk kdd12.tr > results-awk.txt
# 1032.18 sec 13721032 KB

/usr/bin/time -f "%e sec %M KB" slb \
  --mapper 'sed -E "s/^[^ ]+ //" | sed -E "s/:[^ ]+//g" | tr " " "\n" | rg -v "^$"' \
  --folder "awk '{a[\$0]++}END{for(k in a)print k,a[k]}'" \
  --infile kdd12.tr \
  --outprefix results-slb.
# 122.50 sec 881436 KB
# note above doesn't count child memory
# eyeballing htop, max memory use is ~12.3GB

# check we're correct
cat results-slb.* > results-slb && rm results-slb.*
sort --parallel=$(($(nproc) / 2)) -k2nr -k1n -o results-slb results-slb & \
sort --parallel=$(($(nproc) / 2)) -k2nr -k1n -o results-awk.txt results-awk.txt & \
wait

diff results-slb results-awk.txt >/dev/null ; echo $?
# 0

Count Distinct Feature Values

As another, similar example we could count the number of distinct values for each feature. In particular, for each feature we're looking to get the minimum of its total number of distinct values with 100 (as we might be inclined to consider anything with more than 99 values to be continuous).

cdawk.txt # 388.72 sec 23895104 KB /usr/bin/time -f "%e sec %M KB" slb \ --mapper 'sed -E "s/^[^ ]+ //" | tr " " "\n" | tr ":" " " | rg -v "^$"' \ --folder "awk '{if(!(\$1 in a)||length(a[\$1])<100)a[\$1][\$2]=1}END{for(k in a)print k,length(a[k])}'" \ --infile kdda \ --outprefix cdslb. # 26.79 sec 1499992 KB diff \ <(sort --parallel=$(($(nproc) / 2)) -k2nr -k1n cdawk.txt) \ <(cat cdslb.* | sort --parallel=$(($(nproc) / 2)) -k2nr -k1n) \ > /dev/null ; echo $? # 0 ">

curl -o kdda.bz2 "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2"
bunzip2 kdda.bz2
du -hs kdda
# 2.5G    kdda

/usr/bin/time -f "%e sec %M KB" awk -f examples/svm-countdistinct.awk kdda > cdawk.txt
# 388.72 sec 23895104 KB

/usr/bin/time -f "%e sec %M KB" slb \
  --mapper 'sed -E "s/^[^ ]+ //" | tr " " "\n" | tr ":" " " | rg -v "^$"' \
  --folder "awk '{if(!(\$1 in a)||length(a[\$1])<100)a[\$1][\$2]=1}END{for(k in a)print k,length(a[k])}'" \
  --infile kdda \
  --outprefix cdslb.
# 26.79 sec 1499992 KB

diff \
  <(sort --parallel=$(($(nproc) / 2)) -k2nr -k1n cdawk.txt) \
  <(cat cdslb.* | sort --parallel=$(($(nproc) / 2)) -k2nr -k1n) \
  > /dev/null ; echo $?
# 0

Installation

Note the above examples demonstrate the convenience of the tool:

For large datasets, parallelism is essential.
Compared to an equivalent map-reduce, we use less memory, less time, and less code.

The last point holds because slb ensures each parallel invocation recieves a unique partition of the key space. In turn, we use less memory because each folder is only tracking aggregates for its own key space and less code because we do not need to write a combiner that merges two maps.

To install locally from crates.io, run

cargo install slb

Dev Stuff

Rudimentary testing via ./test.sh.

Re-publish to crates.io with cd slb && cargo publish.

You might also like...

ASCII terminal hexagonal map roguelike written in Rust

rhex Contributors welcome! Rhex is looking for contributors. See Contributing page for details. Introduction Simple ASCII terminal hexagonal map rogue

137 Dec 2, 2022

A bash-like Unix shell written in Rust

Cicada Unix Shell Cicada is a simple Unix shell written in Rust. Documents Install cicada Environment Variables Cicada Builtins Completion RC File His

921 Dec 28, 2022

A Rust curses library, supports Unix platforms and Windows

pancurses pancurses is a curses library for Rust that supports both Linux and Windows by abstracting away the backend that it uses (ncurses-rs and pdc

360 Jan 7, 2023

A Bevy plugin for loading the LDtk 2D tile map format.

bevy_ldtk ( Tileset from "Cavernas" by Adam Saltsman ) A Bevy plugin for loading LDtk tile maps. Usage use bevy::prelude::*; use bevy_ldtk::*; fn mai

23 Jul 4, 2022

Reviving the Research Edition Unix speak command

This repository contains the source code of Unix speak program that appeared in the Third (1973) to Sixth (1975) Research Unix editions, slightly adjusted to run on a modern computer. Details on the code's provenance and the methods employed for reviving it can be found in this blog post.

31 Jul 27, 2022

Everyday-use client-side map-aware Arch Linux mirror ranking tool

Rate Arch Mirrors This is a tool, which fetches mirrors, skips outdated/syncing Arch Linux mirrors, then uses info about submarine cables and internet

196 Jan 2, 2023

Aero is a new modern, unix based operating system. It is being developed for educational purposes.

Areo Aero is a new modern, unix based operating system written in Rust and is being developed for educational purposes. Aero follows the monolithic ke

623 Dec 24, 2022

Typed any map for rust

TypedMap TypedMap is a typed HashMap. It allows you to define different value type depending on a Key type. It's useful if you want to store different

23 Oct 29, 2022

Image density/height map to mesh generator

Image density/height map to mesh generator About Crates used to generate 2D mesh from images representing density/height map. Algorithm gets source im

69 Nov 9, 2022

Pitch-perfect copy of map generation algorithm from Slay the Spire

sts_map_oracle Pitch-perfect copy of map generation algorithm from Slay the Spire Usage Prints out map layouts in console for given seed: sts_map_orac

12 Dec 20, 2022

Spawn multiple concurrent unix terminals in Discord

Using this bot can be exceedingly dangerous since you're basically granting people direct access to your shell.

11 Jun 1, 2021

the file filesystem: mount semi-structured data (like JSON) as a Unix filesystem

ffs: the file filesystem ffs, the file filessytem, let's you mount semi-structured data as a fileystem---a tree structure you already know how to work

176 Dec 31, 2022

A tool for automating terminal applications in Unix.

expectrl A tool for automating terminal applications in Unix. Using the library you can: Spawn process Control process Expect/Verify responces It was

132 Dec 14, 2022

A super minimal wrapper around unix sockets for IPC on top of tokio.

tokio-unix-ipc This crate implements a minimal abstraction over UNIX domain sockets for the purpose of IPC on top of tokio.

26 Nov 18, 2022

A simple Vec-based Map inspired on JavaScript for rust.

A simple alternative to HashMap inspired on JavaScript's Map.

2 Oct 3, 2021

Buildomat manages the provisioning of ephemeral UNIX systems on which to run software builds

B U I L D O M A T a software build labour-saving device Buildomat manages the provisioning of ephemeral UNIX systems (e.g., instances in AWS EC2) on w

33 Dec 4, 2022

A rusty, spiky, heat-seeking quake map parser.

Shalrath A rusty, spiky, heat-seeking quake map parser shalrath is a rust representation, [nom] parser and string serializer for Quake map files. It

18 Oct 11, 2022

A program written in Rust, that allows the user to find the current location of the International Space Station and see it on a map.

ISS Location ViewFinder A program written in Rust, that allows the user to find the current location of the International Space Station and see it on

2 Nov 8, 2021

A small unix and windows lib to search for executables in PATH folders.

A small unix and windows lib to search for executables in path folders.

2 Dec 25, 2021