parquet-wasm
WebAssembly bindings to read and write the Parquet format to Apache Arrow.
This is designed to be used alongside a JavaScript Arrow implementation, such as the canonical JS Arrow library or potentially arrow-wasm
.
Including all compression codecs, the generated brotli-encoded WASM bundle is 881KB.
Install
parquet-wasm
is published to NPM. Install with
yarn add parquet-wasm
# or
npm install parquet-wasm
API
readParquet
readParquet(parquet_file: Uint8Array): Uint8Array
Takes as input a Uint8Array
containing bytes from a loaded Parquet file. Returns a Uint8Array
with data in Arrow IPC Stream format 1. To parse this into an Arrow table, use arrow.tableFromIPC
in the JS bindings on the result from readParquet
.
writeParquet
writeParquet(arrow_file: Uint8Array): Uint8Array
Takes as input a Uint8Array
containing bytes in Arrow IPC File format 2. If you have an Arrow table, call arrow.tableToIPC(table, 'file')
and pass the result to writeParquet
.
For the initial release, writeParquet
is hard-coded to use Snappy compression and Plain encoding. In the future these should be made configurable.
setPanicHook
setPanicHook(): void
Sets console_error_panic_hook
in Rust, which provides better debugging of panics by having more informative console.error
messages. Initialize this first if you're getting errors such as RuntimeError: Unreachable executed
.
Using
parquet-wasm
is distributed with three bindings for use in different environments.
- Default, to be used in bundlers such as Webpack:
import * as parquet from 'parquet-wasm'
- Node, to be used with
require
in NodeJS:const parquet = require('parquet-wasm/node');
- ESM, to be used directly from the Web as an ES Module:
import * as parquet from 'parquet-wasm/web';
Example
import {tableFromArrays, tableFromIPC, tableToIPC} from 'apache-arrow'; import {readParquet, writeParquet} from "parquet-wasm"; // Create Arrow Table in JS const LENGTH = 2000; const rainAmounts = Float32Array.from( { length: LENGTH }, () => Number((Math.random() * 20).toFixed(1))); const rainDates = Array.from( { length: LENGTH }, (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)); const rainfall = tableFromArrays({ precipitation: rainAmounts, date: rainDates }); // Write Arrow Table to Parquet const parquetBuffer = writeParquet(tableToIPC(rainfall, 'file')); // Read Parquet buffer back to Arrow Table const table = tableFromIPC(readParquet(parquetBuffer)); console.log(table.schema.toString()); // Schema<{ 0: precipitation: Float32, 1: date: Date64}>
Compression support
The Parquet specification permits several compression codecs. This library currently supports:
- Uncompressed
- Snappy
- Gzip
- Brotli
- ZSTD
- LZ4
LZ4 compression appears not to work yet. When trying to parse a file with LZ4 compression I get an error: Uncaught (in promise) External format error: underlying IO error: WrongMagicNumber
.
Future work
- Tests
😄 - User-specified column-specific encodings when writing
- User-specified compression codec when writing
Development
- Install wasm-pack
- Compile:
wasm-pack build
, or change targets, e.g.wasm-pack build --target nodejs
- Publish
wasm-pack publish
.
Publishing
wasm-pack
supports three different targets:
bundler
(used with bundlers like Webpack)nodejs
(used with Node, supportsrequire
)web
(used as an ES module directly from the web)
There are good reasons to distribute as any of these... so why not distribute as all three? wasm-pack
doesn't support this directly but the build script in scripts/build.sh
calls wasm-pack
three times and merges the outputs. This means that bundler users can use the default, Node users can use parquet-wasm/node
and ES Modules users can use parquet-wasm/web
in their imports.
To publish:
bash ./scripts/build.sh
wasm-pack publish
Acknowledgements
A starting point of my work came from @my-liminal-space's read-parquet-browser
(which is also dual licensed MIT and Apache 2).
@domoritz's arrow-wasm
was a very helpful reference for bootstrapping Rust-WASM bindings.
Footnotes
-
I originally decoded Parquet files to the Arrow IPC File format, but Arrow JS occasionally produced bugs such as
Error: Expected to read 1901288 metadata bytes, but only read 644
when parsing usingarrow.tableFromIPC
. When testing the same buffer in Pyarrow,pa.ipc.open_file
succeeded butpa.ipc.open_stream
failed, leading me to believe that the Arrow JS implementation has some bugs to decide whenarrow.tableFromIPC
should internally use theRecordBatchStreamReader
vs theRecordBatchFileReader
.↩ -
I'm not great at Rust and the IPC File format seemed easier to parse in Rust than the IPC Stream format
🙂 .↩