Bytewax is an open source Python framework for building highly scalable dataflows.

Bytewax

Last update: Jan 6, 2023

Related tags

Data processing bytewax

Overview

Bytewax

Bytewax is an open source Python framework for building highly scalable dataflows.

Bytewax uses PyO3 to provide Python bindings to the Timely Dataflow Rust library.

Usage

Install the latest release with pip:

pip install bytewax

Example

Here is an example of a simple dataflow program using Bytewax:

from bytewax import Executor

ec = Executor()
flow = ec.Dataflow(enumerate(range(10)))
flow.map(lambda x: x * x)
flow.inspect(print)


if __name__ == "__main__":
    ec.build_and_run()

Running the program:

python ./pyexamples/wordcount.py
0
1
4
9
16
25
36
49
81
64

For a more complete example, and documentation on the available operators, check out the User Guide.

License

Bytewax is licensed under the Apache-2.0 license.

Comments

Add additional Kafka configs

Updates the KafkaInputConfig to accept the congis possible in librdkafka as listed here - https://github.com/edenhill/librdkafka/blob/9b72ca3aa6c49f8f57eea02f70aadb1453d3ba1f/CONFIGURATION.md

Tested it locally for now by modifying the configuration.

Fails if the wrong configuration is passed in:

thread 'timely:work-0' panicked at 'Error building input Kafka consumer: KafkaError (Client config error: No such configuration property: "fail" fail fail)', src/inputs/kafka_input.rs:216:14

You can test it by adding an authentication mechanism. For example, configure ACL on your kafka or redpanda cluster and then pass the following code in the KafkaConfigInput:

additional_configs = {
        "sasl.username":"admin",
        "sasl.password":"pass",
        "sasl.mechanisms":"SCRAM-SHA-256",
    }

flow = Dataflow()
flow.input(
    "inp",
    KafkaInputConfig(
        brokers=["localhost:9092"],
        topic="test",
        starting_offset="beginning",
        tail=False,
        additional_properties=additional_configs,
    ),
)

Closes out #107 and #106

opened by awmatheson 5

New execution interface
As you can tell by the branch name, this started out small and then snowballed! Let me know how you think these examples feel and we can discuss if this is the right course for our API. I'm open to new ideas, but I also think this helps guide users along to better use cases.

Backstory

Previously, on Bytewax:

Executor.build_and_run() wasn't great because a single function is used for starting up a dataflow you want to execute locally and manually assembling a cluster of processes.

The difference between "singleton" vs "partitioned" input iterators and making the input part of the Dataflow definition wasn't great because it ties the behavior of the dataflow to the behavior of the input.

The capture operator wasn't great because it didn't give you the control you needed to return data from multiple workers or background processes.

I think all these things are symptoms of trying to handle all use cases from all points in the stack, rather than building primitives and abstraction layers.

Changes

See the docstrings and tests for examples!

Execution

Let's make explicit that there are different contexts you might run a dataflow. Removes Executor.build_and_run and adds four different entry points in a "stack" of complexity:

run_sync() which takes a dataflow and some input, runs it synchronously as a single worker in the existing Python thread, and returns the output to that thread. This is what you'd use in tests and simple notebook work.

run_cluster() which takes a dataflow and some input, starts a local cluster of processes, runs it, waits for the cluster to finish work, then collects thre results, and returns the output to that thread. This is what you'd use in a notebook if you need parallelism or higher throughput.

main_cluster() which starts up a cluster of local processes, coordinates the addresses and process IDs between them, runs a dataflow on it, and waits for it to finish. This has a partitioned "input builder" and an "output builder" (discussed below). This is what you'd use if you'd want to write a standalone script or example that does some higher throughput processing.

main_proc() which lets you manually craft a single process for use in a cluster. This is what you'd use when crafting your k8s cluster.

main_cluster() is built upon proc_main() but adds in some process pool managment.

run_cluster() is built upon main_cluster() but adds in the IPC necessary to get your data back to the calling process super conveniently.

Input

Since you can express "singleton" input as a "partitioned" input, the latter is the more fundamental concept, so that's what the lowest level main_proc() and main_cluster() functions take: an input builder function which will be called once on each worker that returns the input that worker should work on.

The higher level execution contexts that you'd want to use from a notebook or a single thread (run_sync() and run_cluster()) then handle paritioning for you and provide a nice interface where you can send in a whole Python list and not need to think more about it.

Output

Make yours, like mine.

Since the above scheme for input feels good, let's try copying the approach for output: At the lowest level there's a "partitioned output builder" which returns a callback that each worker thread can use to write the output it sees.

The higher level execution contexts can then make an output builder function that collects data to send back to the main process for you.

This change means that the capture operator doesn't need to take any functions; it just marks what parts of the dataflow are output. I think this will change slightly with the introduction of branching dataflows (something like marking different captures with a tag maybe?).

Arg Parsers

Since the above execution entry points all are Python functions, add some convenience methods which parse arguments from command line or env vars. This will make it easier to craft your "main" function in a cluster or standalone script context.

Updates all the examples to use these parsers so we can go back to using -w2 and -n2 on the command line. Although some of the examples show off using main_proc() and so would need different command line arguments.
opened by davidselassie 5
Python code formatting
Is your feature request related to a problem? Please describe. I see some code format inconsistencies in Python source files. Is the team favourable towards enforcing a styling guide for Python code so future contributions from multiple contributors can follow a single style guide easily?

Describe the solution you'd like

python-black I see that we are using python-black.

Skipping .ipynb files as Jupyter dependencies are not installed. You can fix this by running ``pip install black[jupyter]`` All done! ✨ 🍰 ✨ 37 files left unchanged.

Shall we introduce pre-commit [1] tool to the project so we can run python-black at git commits?

python-isort I see some minor import statement inconsistencies and these can be easily fixed using python-isort [2] isort can automatically sort different import types into groups ( standard lib, third party, byteswax local imports ) and also sort imports alphabetically so it's easier when reading import statements.

pep8

pytests/test_inputs.py:5:1: F401 'bytewax.Dataflow' imported but unused pytests/test_inputs.py:5:1: F401 'bytewax.Emit' imported but unused pytests/test_inputs.py:5:1: F401 'bytewax.run' imported but unused pytests/test_operators.py:1:1: F401 'os' imported but unused pytests/test_operators.py:7:1: F401 'pytest.mark' imported but unused pytests/test_execution.py:3:1: F401 'threading' imported but unused pytests/test_execution.py:62:89: E501 line too long (128 > 88 characters) pytests/test_execution.py:82:89: E501 line too long (138 > 88 characters) pytests/test_execution.py:118:89: E501 line too long (138 > 88 characters) pytests/test_execution.py:159:89: E501 line too long (138 > 88 characters)

We can use a tool like flake8 to make sure our code complies with pep8.

[1] https://pre-commit.com/ [2] https://github.com/PyCQA/isort

Describe alternatives you've considered

Additional context If the team agree to this we can further discuss the rules we need to follow and I would be happy to submit a PR for this.
type:feature
opened by kasun 4
Broken links in simple-example page

Bug Description

https://docs.bytewax.io/getting-started/simple-example

In the above page docs to operators are broken. Below are some of the links. They seem to be referring to "operators/operators" page and I guess they should be referring to "apidocs" page instead.

https://docs.bytewax.io/operators/operators/#map (correct link - https://docs.bytewax.io/apidocs#bytewax.Dataflow.map) https://docs.bytewax.io/operators/operators/#flat-map https://docs.bytewax.io/operators/operators#reduce-epoch

Python version (python -V)

3.10.0

Bytewax version (pip list | grep bytewax)

latest

Relevant log output

No response

Steps to Reproduce

https://docs.bytewax.io/getting-started/simple-example has links to docs that are broken.

opened by tirkarthi 4
Renaming of stateful operators
Here's a prototype of a set of slightly different, and hopefully more uniform, stateful operators. I took a long look at how we used exchange, accumulate, state_machine, and aggregate in our existing examples and also what kind of interfaces the build in Python types give us.

I'm excited that things fell into a shape that has some nice parallel construction and I think that means these will hopefully be easier to understand.

exchange is banished. My current thinking is that it's a "scheduling implementation detail" that shouldn't be user visible. I don't like that it was possible to write a dataflow that worked with a single worker, then broke with multiple workers. Every time we previously used exchange, we really wanted a "group by" operation, so these changes make this first class. (I understand the concept of exchange needs to be there, but I think it's something that should only be exposed when you're writing Timely operators and actively coming up with a way to orchestrate state.)

If we want to support both mutable and immutable types as state aggregators, we have to have our function interface return updated state and not modify in place. I don't think we should force users to do things like

class Counter: def __init__(self): self.c = 0

just to count things, so keep immutables working. Thankfully, Python has the operator package which lets you have access to + as a function and all the operators return updated values. This makes it possible to write in the functional style necessary without much boilerplate. But you can still write out lambdas or helper functions too.

Many of our stateful examples are of the form "collect some items and stop at some point". Since this is a dataflow, the stream could be endless, so we need a way to signal that the aggregator is complete. Noting that the epoch is complete is super common "is complete" case and we can include that for convenience, but the general case exists too. Since they're similarly shaped problems, make sure the interface looks similar.

So what happened?

I want to put forth key_fold, key_fold_epoch, key_fold_epoch_local, and key_transition as the state operators. I wrote out usage docstrings and updated examples, but tl;dr:

They all start with key because they require (key, value) as input and group by key and exchange automatically. They output (key, aggregator) too.

The ones with fold do a "fold": build an initial aggregator, then incrementaly add new values as you see them, then return the entire aggregator.

The ones with epoch automatically stop folding at the end of each epoch since that's such a common use case. Otherwise you provide a function to designate "complete".

local means only aggregate within a worker. We haven't found a use case for this yet that isn't a performance optimisation (and one that maybe could be done automatically?).

transition isn't a "fold" in that it's always emitting events (not just when the aggregation is complete).

Input comes in undefined order within an epoch, but in epoch order if multiple epochs.

If you feel like you know the shape of the old versions, then: key_fold_epoch ~= aggregate, key_transition ~= state_machine, key_fold is state_machine but shaped like aggregate, and key_fold_epoch_local is accumulate but shaped like aggregate.

I've updated the examples too. I wanted to try out seeing how concise you could get using the built in types to see how the library might feel in that dimension. But I understand that some of these examples might be a bit too terse. We don't have to necessarily go that way in the docs, but it's cool to see that it can work and you can count like:

flow.key_fold_epoch(lambda: 0, operator.add)

Let me know what you think!
opened by davidselassie 4
Introduce fate
Previously I made the questionable API design of having StatefulLogic::snapshot return a StateUpdate. This meant that the snapshotting process affected the behavior of the stateful operator, since it could return Reset to signal up to StatefulUnary to discard the logic.

This unwinds that tangle by introducing a new method with a perhaps overly poetic name StatefulLogic::fate which should return what StatefulUnary should do with the logic when it's done processing via a LogicFate enum. There are three options:

Retain it until a new item for this key comes in.

Retain it and awaken it after a timeout.

Discard it.

fate is attempting to encapsulate the problem of StatefulUnary is the owner of the logics, so they can't drop themselves. And since awakening timeouts are part of that process, they're in there too.

This is nice because it simplifies the return value of exec to just output. The awaken delay is handled in fate.

It also uses this pattern within Windower::fate with WindowerFate for the same kind of problem: the WindowStatefulLogic is the owner, and we need to communicate back when it's safe to discard a Windower. This fixes the bug of never discarding window state if a key is never seen again: we now discard that state whenever all windows for a key are closed.

A few other small changes:

Standardises on the language of "awaken" a logic. Still uses Timely's "activate" for a Timely operator, though. Renames StatefulLogic::exec to StatefulLogic::awake_with to make more explicit when it is called. Renames WindowLogic::exec to WindowLogic::with_next to make explicit when it's called.

All stateful operator tests should test recovery as part of testing the logic. We should do this to excercise the serde round-trip. I found three? bugs via this where recovery would panic because the deserialization wasn't to the correct type. As part of this, I added explicit type annotations to all StateBytes::ser and StateBytes::de calls so you can compare them.

Clarifies more comments in the giant chunk of code for StatefulUnary::stateful_unary. I was also able to optimize it a little and get rid of two of our temporary buffers. I think it makes the process slightly clearer.
opened by davidselassie 3

TypeError: KafkaInputConfig.new() got an unexpected keyword argument 'sasl.username'

Bug Description

Trying to use the SASL config with KafkaInputConfig on Bytewax 0.11.1 that was introduced in https://github.com/bytewax/bytewax/pull/119, however recieving an input validation error:

TypeError: KafkaInputConfig.__new__() got an unexpected keyword argument 'sasl.username'

Python version (`python -V`)

3.9.7

Bytewax version (pip list | grep bytewax)

0.11.1

Relevant log output

Traceback (most recent call last):
  File "/Users/shahnewazkhan/projects/dapper/data-platform/wranglers/stream_processors/auth0_logs/bytewax/sasl.py", line 24, in <module>
    KafkaInputConfig(
TypeError: KafkaInputConfig.__new__() got an unexpected keyword argument 'sasl.username'



### Steps to Reproduce

from bytewax.dataflow import Dataflow
from bytewax.inputs import KafkaInputConfig
from bytewax.execution import run_main
from bytewax.outputs import StdOutputConfig
import os
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

additional_configs = {
        "sasl.username":os.environ['kafkaReaderUsername'],
        "sasl.password":os.environ['kafkaReaderPassword'],
        "sasl.mechanisms":"PLAIN",
        "security.protocol":"SASL_SSL"
    }

flow = Dataflow()
flow.input(
    "inp",
    KafkaInputConfig(
        brokers=[f"{os.environ['kafkaURL']}"],
        topic=f"{os.environ['topic']}",
        starting_offset="beginning",
        tail=False,
        **additional_configs,
    ),
)

def deserialize(key_bytes__payload_bytes):
    key_bytes, payload_bytes = key_bytes__payload_bytes
    logger.info(f'TYPE OF BYTES {type(key_bytes)}')
    return key_bytes__payload_bytes


flow.map(deserialize)

flow.capture(StdOutputConfig())

run_main(flow)

needs triage

opened by ShahNewazKhan 3

Only use UTC datetimes

This PR forces datetimes passed to bytewax as config parameters to be in UTC, using the new chrono integration added in PyO3 (not merged yet).

We'll need to wait for the PR on PyO3 to be merged so that we can open a PR on pyo3-log to make everything work. The PyO3 author said he wants to add the chrono integration in PyO3 0.17.2 which is coming soon, so this should not take too long.

Previous description (ignore this)

This PR makes all datetimes in Bytewax timezone aware.

It uses a non-released version of PyO3 that adds the chrono integration, and removes the dependency from pyo3-chrono, which was doing the python->rust conversion deleting timezone info, without properly converting to UTC first.

The approach taken in this PR is to add a chrono_tz integration on top of the one implemented in PyO3, so that any datetime in the codebase has timezone info attached, and can be correctly compared to any other datetime.

The only edge case we have to handle, is when we receive naive datetimes from the user in python. Right now the approach is to convert the datetime to utc using the python interpreter. This means that any naive datetime coming from python will be assumed to be local (relative to where the dataflow is ran) and converted to UTC before being passed down to rust. This could be wrong in case the datetime comes from data generated somewhere with a different timezone, but we don't have enough informations to make a proper conversion anyway. Another possible approach would be to assume the datetime is UTC rather than local, and we would attach the UTC timezone info without converting first, but this would be equally wrong most of the times. The application should warn anytime we receive a naive datetime, so that users are aware of the fact and can fix the data if needed.

For (de)serialization of datetimes, I had to implement serde's traits manually, since there is no standard way to serialize datetimes with IANA timezone info. To do this I just serialize the local naivedatetime and the timezone name separately, so that we can rebuild the same ChronoDateTime when deserializing.

The PR is still in Draft because I'm waiting for PyO3 to release the new version, and depending on the versioning chosen (0.17.2 or 0.18) we might also need to wait for an update on pyo3-log which depends on PyO3 0.17.

To make things easier, for the next release I propose to implement Consensus Time: https://xkcd.com/2594/

Original description (ignore this too)

The test at the end of the file should explain the situation I'm trying to fix here: pyo3-chrono ignores timezone info even if they are present. I think we should instead take that into account. With this conversion traits, we always convert datetimes to UTC internally. If the user sends us timezone aware datetimes, we do the conversion. If the user sends us naive datetimes, we assume they are already UTC (debatable).

This is not the only possible approach, and might not be the best one, so I'm opening this PR to discuss what you think about this.

My point is that I don't think we should ignore tzinfo if present, but the flow should still work if we receive naive datetimes (and maybe warn the user about the assumption).

This PR is rough, and if we want to replace pyo3-chrono I'll need to add the conversion traits for Time and Date too, and some more tests.

edit: I just found this PR: https://github.com/chronotope/chrono/pull/542 which does the conversion properly. Reading the discussion, it looks like they plan to integrate this into PyO3 (rather than in chrono) under a feature flag, so maybe we'll have this for free in the future

opened by Psykopear 3
Adds an "order" input helper and allows tumbling "event time"

Order

Timely, and thus Bytewax, requires that the input timestamp / epoch never "goes backwards". This is because Timely is able to know that it can kick off final computation once a timestamp has closed.

Our current "input is an iterator of (epoch, item)" API doesn't have the flexibility to allow multiple "open" epochs, although Timely supports this.

To help bridge this gap, a common use case is: allow late data in a "buffering window" and drop anything after that. Here's an ordering() function which implements that and you can wrap your source input iterator in it.

Event Time

Allows tumbling_epoch to take a time_getter argument which will produce the time for each item from the data within, instead of always just looking at the wall clock time.

This will let you bucket into epoch of a delayed data stream. Also plays well with order.

Other

Also renames inp to inputs since we're renaming everything now.

opened by davidselassie 3

Traits refactoring

This is a refactoring to make the pattern we use to build concrete objects from python config objects more explicit. I also moved the clocks stuff in a submodule, but nothing should really change.

The first trait introduced is ParentClass. This trait should be implemented by any class we expose in Python that is a parent class of config objects, and it requires a method to convert from parent to one of the children:

pub(crate) trait ParentClass {
    type Children;
    fn get_subclass(&self, py: Python) -> StringResult<Self::Children>;
}

Then we have one trait for each module, that specific configurations have to implement. This trait requires a method to build a concrete object from a config one, so in the inputs modules it builds an InputReader from a child of the InputConfig class.

For example in the input module we have:

pub(crate) trait InputBuilder {
    fn build(
        &self,
        py: Python,
        worker_index: WorkerIndex,
        worker_count: usize,
        resume_snapshot: Option<StateBytes>,
    ) -> StringResult<Box<dyn InputReader<TdPyAny>>>;
}

And inputs have to implement this trait, example in ManualInputconfig:

impl InputBuilder for ManualInputConfig {
    fn build(
        &self,
        py: Python,
        worker_index: WorkerIndex,
        worker_count: usize,
        resume_snapshot: Option<StateBytes>,
    ) -> StringResult<Box<dyn InputReader<TdPyAny>>> {
        Ok(Box::new(ManualInput::new(
            py,
            self.input_builder.clone(),
            worker_index,
            worker_count,
            resume_snapshot,
        )))
    }
}

The nice thing is that this allows us to implement InputBuilder for the generic InputConfig if it implements the ParentClass trait:

impl InputBuilder for Py<InputConfig> {
    fn build(
        &self,
        py: Python,
        worker_index: WorkerIndex,
        worker_count: usize,
        resume_snapshot: Option<StateBytes>,
    ) -> StringResult<Box<dyn InputReader<TdPyAny>>> {
        self.get_subclass(py)?
            .build(py, worker_index, worker_count, resume_snapshot)
    }
}

So that we can do this:

let input_reader = input_config.build(py, worker_index, worker_count, step_resume_state)?;

instead of what we used to do:

let input_reader = build_input_reader(
    py,
    input_config,
    worker_index,
    worker_count,
    step_resume_state,
)?;

NOTE: I stopped here, but we could think of a more generic PyBuilder trait so that we don't need a new trait for each module, but after trying it I think it makes the implementation a bit harder to read:

pub(crate) trait PyBuilder {
    type Input;
    type Output;
    fn build(&self, py: Python, params: Self::Input) -> StringResult<Self::Output>;
}

Which turn the impl for ManualInputConfig into this

impl PyBuilder for ManualInputConfig {
    type Input = (WorkerIndex, usize, Option<StateBytes>);
    type Output = Box<dyn InputReader<TdPyAny>>;

    fn build(
        &self,
        py: Python,
        (worker_index, worker_count, resume_snapshot): Self::Input,
    ) -> StringResult<Self::Output> {
        Ok(Box::new(ManualInput::new(
            py,
            self.input_builder.clone(),
            worker_index,
            worker_count,
            resume_snapshot,
        )))
    }
}

And the ParentClass impl to something like:

impl ParentClass for Py<InputConfig> {
    type Children = Box<
        dyn PyBuilder<
            Input = (WorkerIndex, usize, Option<StateBytes>),
            Output = Box<dyn InputReader<TdPyAny>>,
        >,
    >;

    fn get_subclass(&self, py: Python) -> StringResult<Self::Children> {
        if let Ok(conf) = self.extract::<ManualInputConfig>(py) {
            Ok(Box::new(conf))
        } else if let Ok(conf) = self.extract::<KafkaInputConfig>(py) {
            Ok(Box::new(conf))
        } else {
            let pytype = self.as_ref(py).get_type();
            Err(format!("Unknown input_config type: {pytype}"))
        }
    }
}

Final edit: The end goal of this was to write a blanket implementation of PyBuilder for any struct that implement ParentClass with Children = Box<dyn PyBuilder>, but it doesn't work as it is. I'm going to write it here as a reference, maybe I was close enough:

impl<T, B> PyBuilder for T
where
    B: PyBuilder,
    T: ParentClass<Children = Box<B>>,
{
    type Input = B::Input;
    type Output = B::Output;

    fn build(&self, py: Python, params: Self::Input) -> StringResult<Self::Output> {
        self.get_subclass(py)?.build(py, params)
    }
}

opened by Psykopear 2

Observability
This PR adds observability in Bytewax.

This does not use the python logging module, but allows for external configuration anyway. What we gain is not having to keep the GIL when logging something, but it's not integrated with the python logging's module anymore.

By default, bytewax logs ERROR messages to stdout. The logging level can be configured as an argument to the setup_tracing function.

Logs are part of the tracing infrastructure. Every log is now structured, so we can use those logs and the additional instrumentation to send "spans" and "traces" to an opentelemetry compatible backend.

This PR adds support for two backends:

Jaeger

Opentelemetry collector

The Jaeger backend sends data directly to Jaeger. This makes the setup simpler if that's what you are using, but it only works with Jaeger.

The Opentelemetry collector backend sends data to the opentelemetry collector. This can be configured to send traces and logs to different backends, including Jaeger, and it's the most flexible way of configuring tracing, since you can change your tracing backend without touching the dataflow, just change the collector's configuration and everything should keep working.
opened by Psykopear 2
[FEATURE] Publish package for Python 3.11

Is your feature request related to a problem? Please describe. We currently publish packages for python 3.7-3.10

Describe the solution you'd like Add Python 3.11 to the CI matrix
needs triage

opened by Psykopear 0
Missing Autocomplete in VSCode / PyCharm
Bug Description

Hello together,

I use bytewax within a virtual environment. Environment was created under Python 3.10.9 using the command python -m venv .venv and then I activated the environment. I installed bytewax via pip and can create imports.

from bytewax.dataflow import Dataflow

The problem is that "Dataflow" is not found within the IDE. This means that autocomplete, etc. do not work.

However, I can execute and use Dataflow without any problems. It is only annoying within the IDEs.

Python version (python -V)

Python 3.10.9

Bytewax version (pip list | grep bytewax)

0.14.0

Relevant log output

Cannot find reference 'Dataflow' in 'dataflow.py'

Steps to Reproduce

Create new project in PyCharm / VSCode (under MacOS 13.1)

Create and activate venv

pip install bytewax

create new .py file with from bytewax.dataflow import Dataflow

needs triage
opened by wlwwt 2
Recovery state will be incorrect if any worker crashes during the initial epoch

Bug Description

During dataflow execution, we write out whenever a worker completes an epoch. We use these written epochs to re-derive the progress of the entire cluster. Unfortunately, though, on resume we do not know beforehand how many worker's progress messages we're waiting for. Thus if the cluster crashes after worker 1 has written progress but worker 2 has not, then upon resume, it'll look like you're resuming from just a cluster of size 1 and will silently skip data.

We need to have some way of synchronously waiting for all workers to confirm they've written a marker of their existence to the recovery store before each execution (resume or not).

Python version (python -V)

Python 3.10.6

Bytewax version (pip list | grep bytewax)

0.13.1

Relevant log output

No response

Steps to Reproduce

I don't have some example code yet, but this should theoretically be possible.
type:bug

opened by davidselassie 0
[FEATURE] Support stream/table joins

Is it possible to use bytewax for joining content of different kafka topics (similar to what ksqldb is doing) ?

doing this will be an example of integrating "persistent queries" (permanent background processes that never stops). is this a good use case for bytewax?

also comparing to ksqldb, what happens if the bytewax workers are killed? will those persistent queries restart automatically when workers come back up and will they use the latest/earliest committed offset on the kafka topics ? or is the restart manual?

cheers
question

opened by dberardo-com 1
[FEATURE] Support session window

Is your feature request related to a problem? Please describe.

https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions#session-window

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.
type:feature

opened by judahrand 0

Releases(v0.14.0)

v0.14.0(Dec 22, 2022)
Overview

Dataflow continuation now works. If you run a dataflow over a finite input, all state will be persisted via recovery so if you re-run the same dataflow pointing at the same input, but with more data appended at the end, it will correctly continue processing from the previous end-of-stream.

Fixes issue with multi-worker recovery. Previously resume data was being routed to the wrong worker so state would be missing.

The above two changes require that the recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.

Adds an introspection web server to dataflow workers.

Adds collect_window operator.

What's Changed

Adding manylinux_2_27 wheel building to CI by @miccioest in https://github.com/bytewax/bytewax/pull/169

Adds webserver by @whoahbot in https://github.com/bytewax/bytewax/pull/175

Added EXPOSE command in Dockerfiles by @Psykopear in https://github.com/bytewax/bytewax/pull/176

Adding 3.8, 3.9, and 3.10 python versions to colab CI job by @miccioest in https://github.com/bytewax/bytewax/pull/179

Use cbfmt with pre-commit by @whoahbot in https://github.com/bytewax/bytewax/pull/180

Fix issue with resume state not being routed to correct worker by @davidselassie in https://github.com/bytewax/bytewax/pull/182

Adds collect window operator by @davidselassie in https://github.com/bytewax/bytewax/pull/183

Preps for v0.14.0 release by @davidselassie in https://github.com/bytewax/bytewax/pull/184

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.13.1...v0.14.0
Source code(tar.gz)
Source code(zip)
v0.13.1(Nov 10, 2022)
Overview

Added Google Colab support.

What's Changed

Prepare for v0.13.1 release. by @miccioest in https://github.com/bytewax/bytewax/pull/168

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.13.0...v0.13.1
Source code(tar.gz)
Source code(zip)
v0.13.0(Nov 9, 2022)
Overview

Added tracing instrumentation and configurations for tracing backends.

What's Changed

CDC-esque recovery operators by @davidselassie in https://github.com/bytewax/bytewax/pull/160

Observability by @Psykopear in https://github.com/bytewax/bytewax/pull/157

Fix dockerfile by @Psykopear in https://github.com/bytewax/bytewax/pull/163

Traits refactoring by @Psykopear in https://github.com/bytewax/bytewax/pull/162

Prepare for v0.13.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/165

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.12.0...v0.13.0
Source code(tar.gz)
Source code(zip)
v0.12.0(Oct 21, 2022)
Overview

Fixes bug where window is never closed if recovery occurs after last item but before window close.

Recovery logging is reduced.

Breaking change Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.

Adds a DynamoDB and BigQuery output connector.

What's Changed

New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19

Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24

Restructure Python imports by @whoahbot in https://github.com/bytewax/bytewax/pull/22

Restructure rust by @whoahbot in https://github.com/bytewax/bytewax/pull/25

Renames test_run to test_execution by @davidselassie in https://github.com/bytewax/bytewax/pull/27

Adds an "order" input helper and allows tumbling "event time" by @davidselassie in https://github.com/bytewax/bytewax/pull/26

Maturin develop before running tests. by @whoahbot in https://github.com/bytewax/bytewax/pull/28

Tests ability to interrupt execution by @davidselassie in https://github.com/bytewax/bytewax/pull/23

Jupyter Notebook Anomaly Detection Example by @awmatheson in https://github.com/bytewax/bytewax/pull/30

Debug operators by @blakestier in https://github.com/bytewax/bytewax/pull/29

Running tests using whl file already build by @miccioest in https://github.com/bytewax/bytewax/pull/32

More comprehensive readme by @awmatheson in https://github.com/bytewax/bytewax/pull/33

modify inputs by @awmatheson in https://github.com/bytewax/bytewax/pull/35

Remove Criterion benchmark by @whoahbot in https://github.com/bytewax/bytewax/pull/37

Add a key param to the stateful_map operator. by @whoahbot in https://github.com/bytewax/bytewax/pull/36

Runs doctests in CI by @davidselassie in https://github.com/bytewax/bytewax/pull/38

API Docs by @davidselassie in https://github.com/bytewax/bytewax/pull/39

Fixes sorted_window() to support items with identical times by @davidselassie in https://github.com/bytewax/bytewax/pull/41

Apidocs templates by @konradsienkowski in https://github.com/bytewax/bytewax/pull/40

Testable examples by @davidselassie in https://github.com/bytewax/bytewax/pull/34

Examples metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/43

Update to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/45

Runs CI on release publish by @davidselassie in https://github.com/bytewax/bytewax/pull/46

Feast example by @blakestier in https://github.com/bytewax/bytewax/pull/42

Adds run_main() by @davidselassie in https://github.com/bytewax/bytewax/pull/48

Example of Apriori algorithm by @TheBits in https://github.com/bytewax/bytewax/pull/50

Dockerfile enhancements by @miccioest in https://github.com/bytewax/bytewax/pull/52

Integrate docs with main repository by @konradsienkowski in https://github.com/bytewax/bytewax/pull/51

Fix le feast typos by @blakestier in https://github.com/bytewax/bytewax/pull/54

Adding a Kubernetes example based on manual_cluster by @miccioest in https://github.com/bytewax/bytewax/pull/55

Update docs with capture operator by @awmatheson in https://github.com/bytewax/bytewax/pull/56

Update Notebook Example to Bytewax Version 0.8 by @awmatheson in https://github.com/bytewax/bytewax/pull/57

Re-enable documentation tests and upgrade longform docs to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/58

Frontier psychiatry by @whoahbot in https://github.com/bytewax/bytewax/pull/59

Move operator descriptions into API docs by @davidselassie in https://github.com/bytewax/bytewax/pull/60

Prepare for v0.9.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/61

Moves deployment docs into repo by @davidselassie in https://github.com/bytewax/bytewax/pull/62

State Recovery by @davidselassie in https://github.com/bytewax/bytewax/pull/53

Uploading wheels to S3 in CI and using them in CD without building again by @miccioest in https://github.com/bytewax/bytewax/pull/64

Fix broken links in longform documentation by @konradsienkowski in https://github.com/bytewax/bytewax/pull/63

Clean up error messages for inputs by @whoahbot in https://github.com/bytewax/bytewax/pull/69

Adds distribute() helper by @davidselassie in https://github.com/bytewax/bytewax/pull/71

Order Book example Using Websockets by @awmatheson in https://github.com/bytewax/bytewax/pull/67

Automatic state recovery / garbage collection by @davidselassie in https://github.com/bytewax/bytewax/pull/73

Pre commit integration by @kasun in https://github.com/bytewax/bytewax/pull/74

Fix getting started guide documentation by @kasun in https://github.com/bytewax/bytewax/pull/76

Fix inputs helper by @blakestier in https://github.com/bytewax/bytewax/pull/77

cargo fmt hook by @davidselassie in https://github.com/bytewax/bytewax/pull/79

Uses new JoinHandle::is_finished by @davidselassie in https://github.com/bytewax/bytewax/pull/78

Kafka recovery store by @davidselassie in https://github.com/bytewax/bytewax/pull/81

Require state keys to be strings by @davidselassie in https://github.com/bytewax/bytewax/pull/82

Move recovery_wordcount.py into examples by @davidselassie in https://github.com/bytewax/bytewax/pull/83

A few touch ups to docs by @blakestier in https://github.com/bytewax/bytewax/pull/85

Fix API Docs crawling by @konradsienkowski in https://github.com/bytewax/bytewax/pull/88

Kafka Consumer by @blakestier in https://github.com/bytewax/bytewax/pull/84

Multi-worker recovery by @davidselassie in https://github.com/bytewax/bytewax/pull/89

Prep for 0.10.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/91

Updates release steps by @davidselassie in https://github.com/bytewax/bytewax/pull/92

Fix argument formatting in API docs by @davidselassie in https://github.com/bytewax/bytewax/pull/93

Windowing by @davidselassie in https://github.com/bytewax/bytewax/pull/96

Input source operators by @davidselassie in https://github.com/bytewax/bytewax/pull/98

Cooperative manual input by @davidselassie in https://github.com/bytewax/bytewax/pull/99

Fold window operator by @Psykopear in https://github.com/bytewax/bytewax/pull/95

Added cargo tests, doctests, small fixes by @Psykopear in https://github.com/bytewax/bytewax/pull/108

Update examples by @blakestier in https://github.com/bytewax/bytewax/pull/105

Solved some clippy warnings, enabled doctests by @Psykopear in https://github.com/bytewax/bytewax/pull/109

Stateful input sources by @davidselassie in https://github.com/bytewax/bytewax/pull/103

Prepare for 0.11.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/110

[Fix]: Remove 'Understanding epochs' article from docs metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/111

Update and fix documentation by @Psykopear in https://github.com/bytewax/bytewax/pull/112

Break up large modules into smaller files. by @whoahbot in https://github.com/bytewax/bytewax/pull/113

Upgrade dependencies by @Psykopear in https://github.com/bytewax/bytewax/pull/114

Update KafkaInput to prevent listening to non-existent topics by @awmatheson in https://github.com/bytewax/bytewax/pull/117

Add additional Kafka configs by @awmatheson in https://github.com/bytewax/bytewax/pull/119

Kafka Output by @blakestier in https://github.com/bytewax/bytewax/pull/118

Prepare for v0.11.1 release by @whoahbot in https://github.com/bytewax/bytewax/pull/125

Adding a CD step to upgrade pip by @miccioest in https://github.com/bytewax/bytewax/pull/126

Add CONTRIBUTING.md by @awmatheson in https://github.com/bytewax/bytewax/pull/123

downcast instead of extract by @whoahbot in https://github.com/bytewax/bytewax/pull/128

Only snapshot for recovery once per state key by @davidselassie in https://github.com/bytewax/bytewax/pull/129

Add readme about profiling by @davidselassie in https://github.com/bytewax/bytewax/pull/131

Fix key usage in docstring by @awmatheson in https://github.com/bytewax/bytewax/pull/133

Re-enable ssl and gssapi by @whoahbot in https://github.com/bytewax/bytewax/pull/134

Prepare for v0.11.2 by @whoahbot in https://github.com/bytewax/bytewax/pull/136

Allow manually adjusting TestingClockConfig by @davidselassie in https://github.com/bytewax/bytewax/pull/139

Eagerly build logic and assert built by @davidselassie in https://github.com/bytewax/bytewax/pull/141

Fixed examples for dataflow reduce and fold window. by @PiePra in https://github.com/bytewax/bytewax/pull/142

Fixes recovery serde by @davidselassie in https://github.com/bytewax/bytewax/pull/143

Better recovery tests by @davidselassie in https://github.com/bytewax/bytewax/pull/144

Only use UTC datetimes by @Psykopear in https://github.com/bytewax/bytewax/pull/115

Introduce fate by @davidselassie in https://github.com/bytewax/bytewax/pull/137

Event time processing by @Psykopear in https://github.com/bytewax/bytewax/pull/127

Persist awake times by @davidselassie in https://github.com/bytewax/bytewax/pull/147

Event time in tests by @Psykopear in https://github.com/bytewax/bytewax/pull/146

Updated to PyO3 0.17.2 by @Psykopear in https://github.com/bytewax/bytewax/pull/148

Docs v0.11.2 - update link schema by @konradsienkowski in https://github.com/bytewax/bytewax/pull/149

Update Readme and Add Examples by @awmatheson in https://github.com/bytewax/bytewax/pull/135

Adding a docs article of waxctl aws command by @miccioest in https://github.com/bytewax/bytewax/pull/151

Add a DynamoDB output adapter. by @whoahbot in https://github.com/bytewax/bytewax/pull/152

Adding an article about gcp command by @miccioest in https://github.com/bytewax/bytewax/pull/153

Add a Bigquery output adapter by @blakestier in https://github.com/bytewax/bytewax/pull/154

Cleanup examples by @blakestier in https://github.com/bytewax/bytewax/pull/155

Prepare for v0.12.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/156

New Contributors

@konradsienkowski made their first contribution in https://github.com/bytewax/bytewax/pull/40

@TheBits made their first contribution in https://github.com/bytewax/bytewax/pull/50

@kasun made their first contribution in https://github.com/bytewax/bytewax/pull/74

@Psykopear made their first contribution in https://github.com/bytewax/bytewax/pull/95

@PiePra made their first contribution in https://github.com/bytewax/bytewax/pull/142

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.12.0
Source code(tar.gz)
Source code(zip)
v0.11.2(Sep 20, 2022)
Performance improvements. :sparkles:

Support SASL and SSL for bytewax.inputs.KafkaInputConfig.

What's Changed

Add CONTRIBUTING.md by @awmatheson in https://github.com/bytewax/bytewax/pull/123

downcast instead of extract by @whoahbot in https://github.com/bytewax/bytewax/pull/128

Only snapshot for recovery once per state key by @davidselassie in https://github.com/bytewax/bytewax/pull/129

Add readme about profiling by @davidselassie in https://github.com/bytewax/bytewax/pull/131

Fix key usage in docstring by @awmatheson in https://github.com/bytewax/bytewax/pull/133

Re-enable ssl and gssapi by @whoahbot in https://github.com/bytewax/bytewax/pull/134

Prepare for v0.11.2 by @whoahbot in https://github.com/bytewax/bytewax/pull/136

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.11.1...0.11.2
Source code(tar.gz)
Source code(zip)
v0.11.1(Sep 8, 2022)
KafkaInputConfig now accepts additional properties. See bytewax.inputs.KafkaInputConfig.

Support for a pre-built Kafka output component. See bytewax.outputs.KafkaOutputConfig.

What's Changed

Break up large modules into smaller files. by @whoahbot in https://github.com/bytewax/bytewax/pull/113

Upgrade dependencies by @Psykopear in https://github.com/bytewax/bytewax/pull/114

Update KafkaInput to prevent listening to non-existent topics by @awmatheson in https://github.com/bytewax/bytewax/pull/117

Add additional Kafka configs by @awmatheson in https://github.com/bytewax/bytewax/pull/119

Kafka Output by @blakestier in https://github.com/bytewax/bytewax/pull/118

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.11.0...v0.11.1
Source code(tar.gz)
Source code(zip)
v0.11.0(Aug 26, 2022)
Added the fold_window operator, works like reduce_window but allows the user to build the initial accumulator for each key in a builder function.

Output is no longer specified using an output_builder for the entire dataflow, but you supply an "output config" per capture. See bytewax.outputs for more info.

Input is no longer specified on the execution entry point (like run_main), it is instead using the Dataflow.input operator.

Epochs are no longer user-facing as part of the input system. Any custom Python-based input components you write just need to be iterators and emit items. Recovery snapshots and backups now happen periodically, defaulting to every 10 seconds.

Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.

The reduce_epoch operator has been replaced with reduce_window. It takes a "clock" and a "windower" to define the kind of aggregation you want to do.

run and run_cluster have been removed and the remaining execution entry points moved into bytewax.execution. You can now get similar prototyping functionality with bytewax.execution.run_main and bytewax.execution.spawn_cluster using Testing{Input,Output}Configs.

Dataflow has been moved into bytewax.dataflow.Dataflow.

What's Changed

Windowing by @davidselassie in https://github.com/bytewax/bytewax/pull/96

Input source operators by @davidselassie in https://github.com/bytewax/bytewax/pull/98

Cooperative manual input by @davidselassie in https://github.com/bytewax/bytewax/pull/99

Fold window operator by @Psykopear in https://github.com/bytewax/bytewax/pull/95

Added cargo tests, doctests, small fixes by @Psykopear in https://github.com/bytewax/bytewax/pull/108

Update examples by @blakestier in https://github.com/bytewax/bytewax/pull/105

Solved some clippy warnings, enabled doctests by @Psykopear in https://github.com/bytewax/bytewax/pull/109

Stateful input sources by @davidselassie in https://github.com/bytewax/bytewax/pull/103

Prepare for 0.11.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/110

New Contributors

@Psykopear made their first contribution in https://github.com/bytewax/bytewax/pull/95

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.10.0...v0.11.0
Source code(tar.gz)
Source code(zip)
v0.10.1(Aug 17, 2022)
Overview

Bugfix: Resolves pickling error. KafkaInputConfig now works with spawn_cluster.

What's Changed

Pickling logic for auto_commit in KafkaInputConfig by @blakestier in #102

Source code(tar.gz)
Source code(zip)
v0.10.0(Jul 13, 2022)
Overview

Input is no longer specified using an input_builder, but now an input_config which allows you to use pre-built input components. See bytewax.inputs for more info.

Preliminary support for a pre-built Kafka input component. See bytewax.inputs.KafkaInputConfig.

Keys used in the (key, value) 2-tuples to route data for stateful operators (like stateful_map and reduce_epoch) must now be strings. Because of this bytewax.exhash is no longer necessary and has been removed.

Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.

Slight changes to bytewax.recovery.RecoveryConfig config options due to recovery system changes.

bytewax.run() and bytewax.run_cluster() no longer take recovery_config as they don't support recovery.

What's Changed

New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19

Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24

Restructure Python imports by @whoahbot in https://github.com/bytewax/bytewax/pull/22

Restructure rust by @whoahbot in https://github.com/bytewax/bytewax/pull/25

Renames test_run to test_execution by @davidselassie in https://github.com/bytewax/bytewax/pull/27

Adds an "order" input helper and allows tumbling "event time" by @davidselassie in https://github.com/bytewax/bytewax/pull/26

Maturin develop before running tests. by @whoahbot in https://github.com/bytewax/bytewax/pull/28

Tests ability to interrupt execution by @davidselassie in https://github.com/bytewax/bytewax/pull/23

Jupyter Notebook Anomaly Detection Example by @awmatheson in https://github.com/bytewax/bytewax/pull/30

Debug operators by @blakestier in https://github.com/bytewax/bytewax/pull/29

Running tests using whl file already build by @miccioest in https://github.com/bytewax/bytewax/pull/32

More comprehensive readme by @awmatheson in https://github.com/bytewax/bytewax/pull/33

modify inputs by @awmatheson in https://github.com/bytewax/bytewax/pull/35

Remove Criterion benchmark by @whoahbot in https://github.com/bytewax/bytewax/pull/37

Add a key param to the stateful_map operator. by @whoahbot in https://github.com/bytewax/bytewax/pull/36

Runs doctests in CI by @davidselassie in https://github.com/bytewax/bytewax/pull/38

API Docs by @davidselassie in https://github.com/bytewax/bytewax/pull/39

Fixes sorted_window() to support items with identical times by @davidselassie in https://github.com/bytewax/bytewax/pull/41

Apidocs templates by @konradsienkowski in https://github.com/bytewax/bytewax/pull/40

Testable examples by @davidselassie in https://github.com/bytewax/bytewax/pull/34

Examples metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/43

Update to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/45

Runs CI on release publish by @davidselassie in https://github.com/bytewax/bytewax/pull/46

Feast example by @blakestier in https://github.com/bytewax/bytewax/pull/42

Adds run_main() by @davidselassie in https://github.com/bytewax/bytewax/pull/48

Example of Apriori algorithm by @TheBits in https://github.com/bytewax/bytewax/pull/50

Dockerfile enhancements by @miccioest in https://github.com/bytewax/bytewax/pull/52

Integrate docs with main repository by @konradsienkowski in https://github.com/bytewax/bytewax/pull/51

Fix le feast typos by @blakestier in https://github.com/bytewax/bytewax/pull/54

Adding a Kubernetes example based on manual_cluster by @miccioest in https://github.com/bytewax/bytewax/pull/55

Update docs with capture operator by @awmatheson in https://github.com/bytewax/bytewax/pull/56

Update Notebook Example to Bytewax Version 0.8 by @awmatheson in https://github.com/bytewax/bytewax/pull/57

Re-enable documentation tests and upgrade longform docs to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/58

Frontier psychiatry by @whoahbot in https://github.com/bytewax/bytewax/pull/59

Move operator descriptions into API docs by @davidselassie in https://github.com/bytewax/bytewax/pull/60

Prepare for v0.9.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/61

Moves deployment docs into repo by @davidselassie in https://github.com/bytewax/bytewax/pull/62

State Recovery by @davidselassie in https://github.com/bytewax/bytewax/pull/53

Uploading wheels to S3 in CI and using them in CD without building again by @miccioest in https://github.com/bytewax/bytewax/pull/64

Fix broken links in longform documentation by @konradsienkowski in https://github.com/bytewax/bytewax/pull/63

Clean up error messages for inputs by @whoahbot in https://github.com/bytewax/bytewax/pull/69

Adds distribute() helper by @davidselassie in https://github.com/bytewax/bytewax/pull/71

Order Book example Using Websockets by @awmatheson in https://github.com/bytewax/bytewax/pull/67

Automatic state recovery / garbage collection by @davidselassie in https://github.com/bytewax/bytewax/pull/73

Pre commit integration by @kasun in https://github.com/bytewax/bytewax/pull/74

Fix getting started guide documentation by @kasun in https://github.com/bytewax/bytewax/pull/76

Fix inputs helper by @blakestier in https://github.com/bytewax/bytewax/pull/77

cargo fmt hook by @davidselassie in https://github.com/bytewax/bytewax/pull/79

Uses new JoinHandle::is_finished by @davidselassie in https://github.com/bytewax/bytewax/pull/78

Kafka recovery store by @davidselassie in https://github.com/bytewax/bytewax/pull/81

Require state keys to be strings by @davidselassie in https://github.com/bytewax/bytewax/pull/82

Move recovery_wordcount.py into examples by @davidselassie in https://github.com/bytewax/bytewax/pull/83

A few touch ups to docs by @blakestier in https://github.com/bytewax/bytewax/pull/85

Fix API Docs crawling by @konradsienkowski in https://github.com/bytewax/bytewax/pull/88

Kafka Consumer by @blakestier in https://github.com/bytewax/bytewax/pull/84

Multi-worker recovery by @davidselassie in https://github.com/bytewax/bytewax/pull/89

Prep for 0.10.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/91

New Contributors

@blakestier made their first contribution in https://github.com/bytewax/bytewax/pull/29

@konradsienkowski made their first contribution in https://github.com/bytewax/bytewax/pull/40

@TheBits made their first contribution in https://github.com/bytewax/bytewax/pull/50

@kasun made their first contribution in https://github.com/bytewax/bytewax/pull/74

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.10.0
Source code(tar.gz)
Source code(zip)
v0.9.0(Apr 22, 2022)
Overview

Adds bytewax.AdvanceTo and bytewax.Emit to control when processing happens.

Adds bytewax.run_main() as a way to test input and output builders without starting a cluster.

Adds a bytewax.testing module with helpers for testing.

bytewax.run_cluster() and bytewax.spawn_cluster() now take a mp_ctx argument to allow you to change the multiprocessing behavior. E.g. from "fork" to "spawn". Defaults now to "spawn".

What's Changed

New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19

Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24

Restructure Python imports by @whoahbot in https://github.com/bytewax/bytewax/pull/22

Restructure rust by @whoahbot in https://github.com/bytewax/bytewax/pull/25

Renames test_run to test_execution by @davidselassie in https://github.com/bytewax/bytewax/pull/27

Adds an "order" input helper and allows tumbling "event time" by @davidselassie in https://github.com/bytewax/bytewax/pull/26

Maturin develop before running tests. by @whoahbot in https://github.com/bytewax/bytewax/pull/28

Tests ability to interrupt execution by @davidselassie in https://github.com/bytewax/bytewax/pull/23

Jupyter Notebook Anomaly Detection Example by @awmatheson in https://github.com/bytewax/bytewax/pull/30

Debug operators by @blakestier in https://github.com/bytewax/bytewax/pull/29

Running tests using whl file already build by @miccioest in https://github.com/bytewax/bytewax/pull/32

More comprehensive readme by @awmatheson in https://github.com/bytewax/bytewax/pull/33

modify inputs by @awmatheson in https://github.com/bytewax/bytewax/pull/35

Remove Criterion benchmark by @whoahbot in https://github.com/bytewax/bytewax/pull/37

Add a key param to the stateful_map operator. by @whoahbot in https://github.com/bytewax/bytewax/pull/36

Runs doctests in CI by @davidselassie in https://github.com/bytewax/bytewax/pull/38

API Docs by @davidselassie in https://github.com/bytewax/bytewax/pull/39

Fixes sorted_window() to support items with identical times by @davidselassie in https://github.com/bytewax/bytewax/pull/41

Apidocs templates by @konradsienkowski in https://github.com/bytewax/bytewax/pull/40

Testable examples by @davidselassie in https://github.com/bytewax/bytewax/pull/34

Examples metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/43

Update to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/45

Runs CI on release publish by @davidselassie in https://github.com/bytewax/bytewax/pull/46

Feast example by @blakestier in https://github.com/bytewax/bytewax/pull/42

Adds run_main() by @davidselassie in https://github.com/bytewax/bytewax/pull/48

Example of Apriori algorithm by @TheBits in https://github.com/bytewax/bytewax/pull/50

Dockerfile enhancements by @miccioest in https://github.com/bytewax/bytewax/pull/52

Integrate docs with main repository by @konradsienkowski in https://github.com/bytewax/bytewax/pull/51

Fix le feast typos by @blakestier in https://github.com/bytewax/bytewax/pull/54

Adding a Kubernetes example based on manual_cluster by @miccioest in https://github.com/bytewax/bytewax/pull/55

Update docs with capture operator by @awmatheson in https://github.com/bytewax/bytewax/pull/56

Update Notebook Example to Bytewax Version 0.8 by @awmatheson in https://github.com/bytewax/bytewax/pull/57

Re-enable documentation tests and upgrade longform docs to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/58

Frontier psychiatry by @whoahbot in https://github.com/bytewax/bytewax/pull/59

Move operator descriptions into API docs by @davidselassie in https://github.com/bytewax/bytewax/pull/60

Prepare for v0.9.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/61

New Contributors

@blakestier made their first contribution in https://github.com/bytewax/bytewax/pull/29

@konradsienkowski made their first contribution in https://github.com/bytewax/bytewax/pull/40

@TheBits made their first contribution in https://github.com/bytewax/bytewax/pull/50

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.9.0
Source code(tar.gz)
Source code(zip)
v0.8.0(Mar 29, 2022)
Overview

Capture operator no longer takes arguments. Items that flow through those points in the dataflow graph will be processed by the output handlers setup by each execution entry point. Every dataflow requires at least one capture.

Executor.build_and_run() is replaced with four entry points for specific use cases:

run() for exeuction in the current process. It returns all captured items to the calling process for you. Use this for prototyping in notebooks and basic tests.

run_cluster() for execution on a temporary machine-local cluster that Bytewax coordinates for you. It returns all captured items to the calling process for you. Use this for notebook analysis where you need parallelism.

spawn_cluster() for starting a machine-local cluster with more control over input and output. Use this for standalone scripts where you might need partitioned input and output.

cluster_main() for starting a process that will participate in a cluster you are coordinating manually. Use this when starting a Kubernetes cluster.

Adds bytewax.parse module to help with reading command line arguments and environment variables for the above entrypoints.

Renames bytewax.inp to bytewax.inputs.

What's Changed

New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19

Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24

Restructure Python imports by @whoahbot in https://github.com/bytewax/bytewax/pull/22

Restructure rust by @whoahbot in https://github.com/bytewax/bytewax/pull/25

Renames test_run to test_execution by @davidselassie in https://github.com/bytewax/bytewax/pull/27

Adds an "order" input helper and allows tumbling "event time" by @davidselassie in https://github.com/bytewax/bytewax/pull/26

Maturin develop before running tests. by @whoahbot in https://github.com/bytewax/bytewax/pull/28

Tests ability to interrupt execution by @davidselassie in https://github.com/bytewax/bytewax/pull/23

Jupyter Notebook Anomaly Detection Example by @awmatheson in https://github.com/bytewax/bytewax/pull/30

Debug operators by @blakestier in https://github.com/bytewax/bytewax/pull/29

Running tests using whl file already build by @miccioest in https://github.com/bytewax/bytewax/pull/32

More comprehensive readme by @awmatheson in https://github.com/bytewax/bytewax/pull/33

modify inputs by @awmatheson in https://github.com/bytewax/bytewax/pull/35

Remove Criterion benchmark by @whoahbot in https://github.com/bytewax/bytewax/pull/37

Add a key param to the stateful_map operator. by @whoahbot in https://github.com/bytewax/bytewax/pull/36

Runs doctests in CI by @davidselassie in https://github.com/bytewax/bytewax/pull/38

API Docs by @davidselassie in https://github.com/bytewax/bytewax/pull/39

Fixes sorted_window() to support items with identical times by @davidselassie in https://github.com/bytewax/bytewax/pull/41

Apidocs templates by @konradsienkowski in https://github.com/bytewax/bytewax/pull/40

Testable examples by @davidselassie in https://github.com/bytewax/bytewax/pull/34

Examples metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/43

Update to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/45

Runs CI on release publish by @davidselassie in https://github.com/bytewax/bytewax/pull/46

New Contributors

@blakestier made their first contribution in https://github.com/bytewax/bytewax/pull/29

@konradsienkowski made their first contribution in https://github.com/bytewax/bytewax/pull/40

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.8.0
Source code(tar.gz)
Source code(zip)
v0.8.0-beta.2(Feb 24, 2022)
Beta release

What's Changed

New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19

Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24

Updated execution interface

run() now takes a dataflow and some input, runs it synchronously as a single worker in the existing Python thread, and returns the output to that thread. This is what you'd use in tests and simple notebook work.

run_cluster() takes a dataflow and some input, starts a local cluster of processes, runs it, waits for the cluster to finish work, then collects thre results, and returns the output to that thread. This is what you'd use in a notebook if you need parallelism or higher throughput.

cluster_main() starts up a cluster of local processes, coordinates the addresses and process IDs between them, runs a dataflow on it, and waits for it to finish. This has a partitioned "input builder" and an "output builder" (discussed below). This is what you'd use if you'd want to write a standalone script or example that does some higher throughput processing.

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.8.0-beta.0
Source code(tar.gz)
Source code(zip)
v0.7.1(Feb 17, 2022)
v0.7.1

Updates to build_and_run() to support running in notebook environments.

What's Changed

Parse arguments in build_and_run. by @whoahbot in https://github.com/bytewax/bytewax/pull/10

Adds a capture operator to capture output by @davidselassie in https://github.com/bytewax/bytewax/pull/9

Introducing: exhash by @davidselassie in https://github.com/bytewax/bytewax/pull/14

Exceptions and interrupts in multiple worker threads by @davidselassie in https://github.com/bytewax/bytewax/pull/16

Fix typing for inp.py by @mttcnnff in https://github.com/bytewax/bytewax/pull/17

Add in tests for inputs by @whoahbot in https://github.com/bytewax/bytewax/pull/18

New Contributors

@mttcnnff made their first contribution in https://github.com/bytewax/bytewax/pull/17

Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.0...v0.7.1
Source code(tar.gz)
Source code(zip)