A highly efficient daemon for streaming data from Kafka into Delta Lake

Delta Lake

Last update: Dec 28, 2022

Related tags

Overview

kafka-delta-ingest

The kafka-delta-ingest project aims to build a highly efficient daemon for streaming data through Apache Kafka into Delta Lake.

This project is currently highly experimental and evolving in tandem with the delta-rs bindings.

Developing

Compile: cargo build
Launch Kafka - docker-compose up
Run kafka-delta-ingest (with a short 10s allowed_latency):

RUST_LOG=debug cargo run ingest example ./tests/data/example --allowed_latency 10 -t 'modified_date: substr(modified,`0`,`10`)' 'kafka_offset: kafka.offset'

In separate shell, produce messages to example topic, e.g.:

echo "{\"id\":\"1\",\"value\":1,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"2\",\"value\":2,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"3\",\"value\":3,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"4\",\"value\":4,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"5\",\"value\":5,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"6\",\"value\":6,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"7\",\"value\":7,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"8\",\"value\":8,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"9\",\"value\":9,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;
echo "{\"id\":\"10\",\"value\":10,\"modified\":\"2021-03-16T14:38:58Z\"}" | kafkacat -P -b localhost:9092 -t example -p -1;

Watch the delta table folder for new files
- watch ls tests/data/example
- cat tests/data/example/_delta_logs/00000000000000000001.json
Check committed offsets

$KAFKA_HOME/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group kafka-delta-ingest:example

Tests

For unit tests, run cargo test.
Integration tests that depend on Kafka are currently marked with #[ignore]. Run RUST_LOG=debug cargo test — --ignored --nocapture to run these locally after running docker-compose up to create a local Kafka.

Comments

Kafka timestamp and timestamp type meta

This PR fixes handling of the Kafka timestamp well-known meta field and adds support for "timestamp_type" so they can be extracted from the Kafka message and merged into the delta record.

opened by xianwill 9

No snapshot or version 0 found

Hello guys, I'm trying to set up kafka-delta-ingest with AWS S3, but get an error

Ingest service exited with error DeltaTable { source: NotATable("No snapshot or version 0 found, perhaps s3://mybucket/rust-deltalake/raw.test.mytopic is an empty dir?") }

I'm running the command in a local Docker container using a local Kafka broker.

command: ["/usr/local/bin/kafka-delta-ingest", "ingest",
              "raw.test.mytopic",                                                   
              "s3://mybucket/rust-deltalake/raw.test.mytopic", 
              "--allowed_latency", "300",
              "--kafka", "host.docker.internal:9092",
              "--consumer_group_id", "my_consumer_group"]

Did I miss a step or a configuration option? Should I create the empty deltalake myself?

Thanks, T

opened by timopetmanson 4

Kafka-connect sink connector

Hello, I would like to know if it's in the roadmap of this project to work on a kafka-connect sink connector more than a daemon.

kafka -> deltaLakeSinkConnector -> deltaLake

Thank you all

opened by raphaelauv 4
Use delta log transactions for offset tracking

WIP, but opening for early review + CI tests, since there's 3 possible way of this test to be executed, depending on how OS will treat threads distribution

opened by mosyp 4
Add seek-offsets config to set the starting point for the kafka ingestion

The current approach with relying on kafka seeks is unreliable since consumer could ingest topics from earliest before the actual seek and as such corrupt the delta store. Just adding the filtering on the consumed offsets/stored offsets as with txn actions is not enough since there's more places where it could crash, e.g. the rebalance event with writers reseek and state clearance.

As such, we're moving towards the well tested and bullet proof usage of txn actions which are guarded by delta protocol and optimistic concurrency loop with dynamodb locking.

The introducing changes will write first the delta log version with txn actions for each partition that are given by startingOffsets param. This is done only once by the very first succeeding writer. Once this is achieved, then the delta table is protected from duplicates and kafka seek inconsistency by delta protocol.

opened by mosyp 2

no support for gzip?

Here's a test case with redpanda and vector.

git clone https://github.com/bbigras/test-kdi-gzip

docker-compose up -d
# start kdi
# add a couple of lines to the `log` file
# wait for one "Delta write for version x has completed in x millis"
# stop kdi

# uncomment "compression" in vector.toml
docker-compose up -d --force-recreate

# start kdi again

run kdi with:

target/release/kafka-delta-ingest ingest my-topic ~/checkout_folder/delta \
  --checkpoints \
  -l 5 \
  --max_messages_per_batch 2 \
  --kafka 127.0.0.1:9092 \
  -K "auto.offset.reset=earliest" \
  -t \
  	  'date: substr(timestamp, `0`, `10`)' \
  	  'message: message' \
	  'timestamp: timestamp' \
      'meta.kafka.offset: kafka.offset' \
      'meta.kafka.partition: kafka.partition' \
      'meta.kafka.topic: kafka.topic'

and you'll get:

[2021-10-22T20:06:43Z ERROR kafka_delta_ingest] Error getting BorrowedMessage while processing stream KafkaError (Message consumption error: NotImplemented (Local: Not implemented))

opened by bbigras 2

How to ingest the key?

I have looked in the repo but could not find the configuration to include the key while ingesting. Can someone help with this please?

Sample message in my topic: student_id,{"first_name":"john","last_name":"doe","courses":["math","english"]} I want to write all these fields (including the key student_id) to the Delta table.

opened by rymail 1
Remove async API from stats logger

I need to introduce some buffering for metric reporting to reduce async overhead and this is somewhat of a lo-pro fix so this PR might linger as a draft for a while.

Eventually fixes https://github.com/delta-io/kafka-delta-ingest/issues/24

opened by xianwill 1
Update arrow_schema from metadata after write

arrow_schema_ref is stale once the arrow writer is created. However it's expected that delta table's schema could change, but it's not going to be tracked.

AC: Update arrow_schema_ref from table.get_metadata once in a while, after writers or new version for example.
bug

opened by mosyp 1
The buffer_len function in DeltaWriter is inefficient
Description

The body of buffer_len is

self.cursor.data().len()

The .data() call is copying all bytes from cursor and returns new vector. Since we just need a len() this is inefficient, moreover it's being called on each new message. I couldn't find a possible easy workaround for it, also we don't have an access to internal buffer in in-memory cursor. However it seems that introducing a new function in parquet::file::writer::InMemoryWriteableCursor should resolve that, e.g

pub fn len(&self) -> usize { let inner = self.buffer.lock().unwrap(); inner.get_ref().len() }
enhancement
opened by mosyp 1
Safe seek to the next messages

Whenever there's reset state in consumer state, it reads latest offsets from delta, adds 1 and seeks to them. However this will not work if the resulting offset is missing (offsets are no contiguous in kafka) or not yet occupied (no new messages since). In latter case the consumer will read messages from the beginning
bug

opened by mosyp 1
Can't write to a table with `DECIMAL(38,0)` field.
I have a column with type DECIMAL(38,0, but when I try to start ingestion I get the following error:

Ingest service exited with error DeltaWriteFailed { ending_offsets: "{\"0\":4999}", partition_counts: "{\"0\":5000}", source: Arrow { source: JsonError("Decimal(38, 0) type is not supported") } }

I believe this is a compatibility issue with the www.github.com/delta-io/delta-rs, a help would be appreciated. I am ready to provide more information.

Thanks in advance !
opened by denizsurmeli 1
Possibility of adding Python bindings to this project

Would it be possible to add Python bindings to this project?

I think a lot of programmers are comfortable writing Python, but aren't comfortable with Rust. I'm assuming the Python bindings could be added like they were for the delta-rs project. Thoughts?

opened by MrPowers 1
Add Dockerfile, update README
Created a Dockerfile that uses a multi-stage build to avoid a > 4GB image size

Added a note in the README about writing to S3 and doing DynamoDB locking
opened by caseybrown89 0
Kafka client collaboration

Hi there,

I see you are using rdkafka for your Kafka interaction. We (= InfluxDB IOx) implemented our own pure-Rust async client called rskafka which you might find useful. There is also an intro blog post: https://www.influxdata.com/blog/building-simple-pure-rust-async-apache-kafka-client/

I think you might be missing a feature or two, but nothing fundamental (to my knowledge).

Let me know if you have questions, or want to chat.

Cheers, Marco

PS: Also feel free to close this ticket.

opened by crepererum 1
Add in logic for "Create Delta Table if not exist"

In playing around with KDI I realize it doesn't have logic to "Create Delta Table if not exist". If you point the container to a Filesystem that has no Delta table, it complains:

Whereas if you point it to an existing Delta Table, works fine:

I think this should be trivial to add in, I created a "KDI Java" just for the heck it of it after chatting with @thovoll and it's a matter of checking if the Trx log version is !(log.snapshot().getVersion() > -1): https://github.com/mdrakiburrahman/kafka-delta-ingest-adls/blob/2802eead5174e5fc00da047470572d5fd4c76981/1.KDI-Java/src/main/java/com/microsoft/kdi/KDI.java#L261

My use case is I'm pointing KDI at a large number of Change Data Capture Topics from Debezium - so it can start pumping out Delta Tables at scale. I can't assume the Delta Tables already exist - so this would be great to have!

opened by mdrakiburrahman 0

Owner

Delta Lake

An open-source storage layer that brings ACID transactions to Apache Spark™

GitHub

Rust client for Apache Kafka

Kafka Rust Client Project Status This project is starting to be maintained by John Ward, the current status is that I am bringing the project up to da

902 Jan 2, 2023

Easy Hadoop Streaming and MapReduce interfaces in Rust

Efflux Efflux is a set of Rust interfaces for MapReduce and Hadoop Streaming. It enables Rust developers to run batch jobs on Hadoop infrastructure wh

31 Nov 22, 2022

Fluvio is a high-performance distributed streaming platform that's written in Rust

Fluvio is a high-performance distributed streaming platform that's written in Rust, built to make it easy to develop real-time applications.

1.6k Dec 30, 2022

Twitch data consumer and broadcaster

NeoTwitch Arch Network broadcaster Chat (message buffer) If the message buffer is full then shut down Channel point events If the message buffer is fu

3 Dec 3, 2021

A highly efficient daemon for streaming data from Kafka into Delta Lake

172 Dec 23, 2022

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

294 Dec 23, 2022

An AWS Lambda for automatically loading JSON files as they're created into Delta tables

Delta S3 Loader This AWS Lambda serves a singular purpose: bring JSON files from an S3 bucket into Delta Lake. This can be highly useful for legacy or

4 Jan 12, 2022

tectonicdb is a fast, highly compressed standalone database and streaming protocol for order book ticks.

tectonicdb crate docs.rs crate.io tectonicdb tdb-core tdb-server-core tdb-cli tectonicdb is a fast, highly compressed standalone database and streamin

525 Dec 23, 2022

⚡️Highly efficient data and string formatting library for Rust.

⚡️Highly efficient data and string formatting library for Rust. ?? Overview Pad and format string slices and generic vectors efficiently with minimal

3 Dec 21, 2023

New generation decentralized data warehouse and streaming data pipeline

184 Dec 22, 2022

A trading bot written in Rust based on the orderbook delta volume.

The strategy based on the concept of mean reversion. We look for large deviations in the volume delta of BTC-PERP on FTX at a depth of 1. These deviations could be caused by over-enthusiastic and over-leveraged market participants.

45 Dec 28, 2022

Quick poc of the rsync wire protocol in Rust. Supports delta transfer.

Rsync wire protocol in Rust This is a quick poc of the rsync wire protocol in Rust. It supports delta transfer. The code is really a mess right now, a

4 Feb 18, 2023

A dead-simple tool for working with data in Kafka

ktool - a tool for Kafka ktool is a dead-simple tool for working with data in Kafka: Copy partitions / topics to disk Replay messages Inspect message

5 Nov 4, 2022

defmt is a highly efficient logging framework that targets resource-constrained devices, like microcontrollers

defmt defmt ("de format", short for "deferred formatting") is a highly efficient logging framework that targets resource-constrained devices, like mic

476 Jan 2, 2023

Concrete is a simple programming language specifically crafted for creating highly scalable systems that are reliable, efficient, and easy to maintain.

The Concrete Programming Language Most ideas come from previous ideas - Alan C. Kay, The Early History Of Smalltalk In the realm of low-level programm

126 Nov 22, 2024

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

14 Oct 31, 2022

Materialize simplifies application development with streaming data. Incrementally-updated materialized views - in PostgreSQL and in real time. Materialize is powered by Timely Dataflow.

Materialize is a streaming database for real-time applications. Get started Check out our getting started guide. About Materialize lets you ask questi

4.7k Jan 8, 2023

A highly efficient daemon for streaming data from Kafka into Delta Lake

Related tags

Overview

kafka-delta-ingest

Developing

Tests

Comments

Description

Owner

Delta Lake

Rust client for Apache Kafka

Easy Hadoop Streaming and MapReduce interfaces in Rust

Fluvio is a high-performance distributed streaming platform that's written in Rust

Twitch data consumer and broadcaster

A highly efficient daemon for streaming data from Kafka into Delta Lake

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

An AWS Lambda for automatically loading JSON files as they're created into Delta tables

tectonicdb is a fast, highly compressed standalone database and streaming protocol for order book ticks.

⚡️Highly efficient data and string formatting library for Rust.

New generation decentralized data warehouse and streaming data pipeline

A trading bot written in Rust based on the orderbook delta volume.

Quick poc of the rsync wire protocol in Rust. Supports delta transfer.

A dead-simple tool for working with data in Kafka

defmt is a highly efficient logging framework that targets resource-constrained devices, like microcontrollers

Concrete is a simple programming language specifically crafted for creating highly scalable systems that are reliable, efficient, and easy to maintain.

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Materialize simplifies application development with streaming data. Incrementally-updated materialized views - in PostgreSQL and in real time. Materialize is powered by Timely Dataflow.

An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

Streaming data over unix sockets, in Rust

Library provides a simple API for Google Firestore for create/update/query/streaming/listening data