Make ETLs Great Again!

Related tags

Command-line MEGA
Overview

A serverless ETL runtime for cloud databases

MEGA stands for Make ETLs Great Again! Checkout a video demo!

This project is a cloud-native ETL (Extract, Transform, Load) application framework based on the WasmEdge WebAssembly runtime for developers to filter, map, and transform data pipelines going into cloud databases. We are currently targetting any MySQL compatible database as the backend.

ETL tools are crucial for the modern data analytics pipeline. However, ETL for cloud databases has its own challenges. Since the public cloud is fundamentally a multi-tenancy environment, all user-defined ETL functions are isolated outside of the database in separate VMs or secure containers. That is a complex and heavyweight setup, which is not suited for simple functions that need to process sporadic streams of data.

With the MEGA framework, developers will be able to create secure, lightweight, fast and cross-platform ETL functions that are located close to or even embedded in cloud databases' infrastructure. The MEGA ETL functions can be deployed as serverless functions and receive data from a variety of sources including event queues, webhook callbacks and data streaming pipelines. The outcomes are written into the designated cloud database for later analysis.

Examples

  • examples/order is an example to take orders from an e-commerce application via a HTTP webhook, and store the orders into a database. It is the example we will go through in this document. There is also an alternative implementation for this example -- it uses a direct connection to the backend database for more flexibility.
  • examples/kafka is an example to take those e-commerce orders from a Kafka / Redpanda queue, and store the orders into a database.
  • examples/ethereum is an example to filter, transform, and store Ethereum transactions in a relational database.

Prerequisites

The WasmEdge WebAssembly Runtime is an open source project under the CNCF. It provides a safer and lighter alternative than Linux containers to run compiled (i.e., high-performance) ETL functions. They can be deployed to the edge cloud close to the data source or even colocate with the cloud database servers in the same firewall. Specially, you will need

  • Install Rust. The framework is currently written in the Rust language. A JavaScript version is in the works.
  • Install WasmEdge. You need it to run the ETL functions.
  • Install a MySQL compatible analytical database or sign up for a cloud database. We recommend TiDB Cloud. The ETL transformed data is written into this database for later analysis.

On Linux, you can use the following commands to install Rust and WasmEdge.

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Install WebAssembly target for Rust
rustup target add wasm32-wasi

# Install WasmEdge
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- -e all
source $HOME/.wasmedge/env

Create the ETL function

First, add the MEGA crate to your Rust project.

[dependencies]
mega_etl = "0.1"

Next, in your Rust code, you will need to implement the following.

  • Define a struct that models database table. Each column in the table is represented by a data field in the struct.
  • Implement a required transform() function to give the above struct the Transformer trait. The function takes a Vec<u8> byte array as input argument, and returns a SQL string for the database.
  • Set variables for the connection string to TiDB and configurations for the inbound connector where the input Vec<u8> would be retrieved (eg from a Kafka queue or a HTTP service or a temp database table in Redis).

First, let's define the data structure for the database table. It is a table for order records for an e-commerce web site.

#[derive(Serialize, Deserialize, Debug)]
struct Order {
    order_id: i32,
    product_id: i32,
    quantity: i32,
    amount: f32,
    shipping: f32,
    tax: f32,
    shipping_address: String,
}

Next, define the ETL transform() function that transforms inbound data into a set of SQL statements for the database. The inbound data is simply a byte array that is recived from any data source (e.g., a POST request on the web hook, or a message in Kafka). In this example, the inbound data is a JSON string that represents the order.

#[async_trait]
impl Transformer for Order {
    async fn transform(inbound_data: Vec<u8>) -> TransformerResult<Vec<String>> {
        let s = std::str::from_utf8(&inbound_data)
            .map_err(|e| TransformerError::Custom(e.to_string()))?;
        let order: Order = serde_json::from_str(String::from(s).as_str())
            .map_err(|e| TransformerError::Custom(e.to_string()))?;
        log::info!("{:?}", &order);
        let mut ret = vec![];
        let sql_string = format!(
            r"INSERT INTO orders VALUES ({:?}, {:?}, {:?}, {:?}, {:?}, {:?}, {:?}, CURRENT_TIMESTAMP);",
            order.order_id,
            order.product_id,
            order.quantity,
            order.amount,
            order.shipping,
            order.tax,
            order.shipping_address,
        );
        dbg!(sql_string.clone());
        ret.push(sql_string);
        Ok(ret)
    }
}

Finally, in the main application we will configure an outbound database (a cloud database instance specified in DATABASE_URL) and an inbound data source (a webhook at http://my.ip:3344). Other inbound methods are also supported. For example, you can configure the ETL function to receive messages from a Kafka or Redpanda queue or a Redis table.

#[tokio::main(flavor = "current_thread")]
async fn main() -> anyhow::Result<()> {
    env_logger::init();

    let uri = std::env::var("DATABASE_URL")?;
    let mut pipe = Pipe::new(uri, "http://0.0.0.0:3344".to_string()).await;

    // This is async because this calls the async transform() function in Order
    pipe.start::<Order>().await?;
    Ok(())
}

Optionally, you can define an init() function. It will be executed the first time when the ETL starts up. Here, we use the init() to create and empty orders table in the database.

#[async_trait]
impl Transformer for Order {
    async fn init() -> TransformerResult<String> {
        Ok(String::from(
            r"CREATE TABLE IF NOT EXISTS orders (order_id INT, product_id INT, quantity INT, amount FLOAT, shipping FLOAT, tax FLOAT, shipping_address VARCHAR(50), date_registered TIMESTAMP DEFAULT CURRENT_TIMESTAMP);",
        ))
    }
}

Build

Use the Rust cargo tool to build the ETL application.

cargo build --target wasm32-wasi --release

Optionally, you could AOT compile it to improve performance (could be 100x faster for compute-intensive ETL functions).

wasmedgec target/wasm32-wasi/release/order.wasm order.wasm

Run

With WasmEdge, you have many deployment options. You could run the compiled ETL function program in any serverless infra that supports WasmEdge, which includes almost all Kubernetes variations, Dapr, Docker, Podman and hosted function schedulers such as essa-rs and flows.network.

But in this example, we will just use the good old wasmedge CLI tool to run the ETL function-as-a-service.

wasmedge --env "DATABASE_URL=mysql://user:[email protected]:3306/mysql" order.wasm

It starts an HTTP server on port 3344 and waits for the inbound data. Open another terminal, and send it some inbound data via curl.

curl http://localhost:3344/ -X POST -d @order.json

The JSON data in order.json is sent to the ETL transform() function as inbound data. The function parses it and generates the SQL string, which is automatically executed on the connected TiDB Cloud instance. You can now connect to TiDB Cloud from your database browser and see the order record in the database.

Resources

Join us!

You might also like...
Workflows make it easy to browse, search, execute and share commands (or a series of commands)--without needing to leave your terminal.
Workflows make it easy to browse, search, execute and share commands (or a series of commands)--without needing to leave your terminal.

Workflows The repo for all public Workflows that appear within Warp and within commands.dev. To learn how to create local or repository workflows, see

Like a cell, but make lifetimes dynamic instead of ownership

LendingCell is a mutable container that allows you to get an owned reference to the same object. When the owned reference is dropped, ownership return

Croc-look is a tool to make testing and debuging proc macros easier

croc-look croc-look is a tool to make testing and debuging proc macros easier by these two features Printing the implementation specific generated cod

Experimental language build in Rust to make it fast and robust

Reg-lang Experimental language build with Rust. Its aim is : To be simple to help learning programmation with, and in a second hand, to be robust enou

A gui tool written in Dioxus to make it easy to release a workspace of crates to crates.io
A gui tool written in Dioxus to make it easy to release a workspace of crates to crates.io

Easy-Release: a visual tool for releasing workspaces of libraries A work-in-progress GUI for releasing a large workspace of crates manually, but easil

Set of tools that make it easier for the operator to manage a TAPLE network.
Set of tools that make it easier for the operator to manage a TAPLE network.

⚠️ TAPLE is in early development and should not be used in production ⚠️ TAPLE Tools TAPLE (pronounced T+ 🍎 ['tapΙ™l]) stands for Tracking (Autonomous

πŸ¦€Rust + Large Language Models - Make AI Services Freely and Easily. Inspired by LangChain

llmchain: Modern Data Transformations with LLM πŸ¦€ + Large Language Models, inspired by LangChain. Features Models: LLMs & Chat Models & Embedding Mode

First project in rust which will be to make an accounts system & Leaderboard/Score system

rust-backend this is my first project in rust which will be to make a backend for compsci project it will include: Accounts, Player Achievements (if I

Programming language made by me to learn other people how to make programming languages :3
Programming language made by me to learn other people how to make programming languages :3

Spectra programming language Programming language made for my tutorial videos (my youtube channel): Syntax Declaring a variable: var a = 3; Function

Owner
Second State
Fast, safe, portable & serverless. Deploy Rust functions in edge computing, Jamstack, SaaS and service mesh applications.
Second State
Custom module for showing the weather in Waybar, using the great wttr.io

wttrbar a simple but detailed weather indicator for Waybar using wttr.in. Installation Compile yourself using cargo build --release, or download the p

Yo'av Moshe 10 Apr 23, 2023
Command line tool to make conventional commit messages

Commit This project is a copy of cz-cli with some minor changes. I made this project for my own use, because I don't want to mess with the original cz

Pedro Mendes 80 Dec 26, 2022
A fun rust terminal program so you can make Christmas trees!

Xmastree 2021 A fun christmas tree where you can add ornaments! Then, you can export the file into either: a rust file a txt file About Taking a break

Junhao 2 Dec 21, 2021
A command line utility to easily make dank memes

meme-cli A command line utility to easily make dank memes. Yes, really. Installation cargo install meme-cli Alternatively, install from source using g

null 196 Dec 26, 2022
CLI tool that make it easier to perform multiple lighthouse runs towards a single target and output the result in a plotable format.

Lighthouse Aggregator CLI tool that make it easier to perform multiple lighthouse runs towards a single target and output the result in a "plotable" f

Polestar 1 Jan 12, 2022
A simple CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal.

Welcome to rust-qrcode-cli ?? A CLI I made while practicing rust to easily make QR codes with just one command, all in your terminal. Install git clon

Dhravya Shah 2 Mar 2, 2022
A tiny crate to make it easy to share and apply Git hooks for Rust projects

Shareable git hooks for Rust project. Sloughi is a friend of Husky from North Africa! :algeria:

Walid ZIOUCHE 24 Oct 6, 2022
Workflows make it easy to browse, search, execute and share commands (or a series of commands)--without needing to leave your terminal.

Workflows The repo for all public Workflows that appear within Warp and within commands.dev. To learn how to create local or repository workflows, see

Warp 369 Jan 2, 2023
Make data-driven table rendering easy with Dioxus

Dioxus Table Make data-driven table rendering easy with Dioxus Installation Until the next release of Dioxus this requires Dioxus nightly from git. Th

null 9 Oct 9, 2022
Attempt to make a git profile switcher to switch profiles.

Git Profile Switcher β–ˆβ–€β–€β€ƒβ–ˆβ€ƒβ–€β–ˆβ–€β€ƒ β€ƒβ–ˆβ–€β€ƒβ–ˆβ–‘β–ˆβ–‘β–ˆβ€ƒβ–ˆβ€ƒβ–€β–ˆβ–€β€ƒβ–ˆβ–€β–€β€ƒβ–ˆβ–‘β–ˆβ€ƒβ–ˆβ–€β–€β€ƒβ–ˆβ–€β–ˆ β–ˆβ–„β–ˆβ€ƒβ–ˆβ€ƒβ–‘β–ˆβ–‘β€ƒ β€ƒβ–„β–ˆβ€ƒβ–€β–„β–€β–„β–€β€ƒβ–ˆβ€ƒβ–‘β–ˆβ–‘β€ƒβ–ˆβ–„β–„β€ƒβ–ˆβ–€β–ˆ

Srikanth K 2 Sep 23, 2022