Bytewax is an open source Python framework for building highly scalable dataflows.

Overview

Bytewax

Actions Status PyPI Bytewax User Guide

Bytewax is an open source Python framework for building highly scalable dataflows.

Bytewax uses PyO3 to provide Python bindings to the Timely Dataflow Rust library.

Usage

Install the latest release with pip:

pip install bytewax

Example

Here is an example of a simple dataflow program using Bytewax:

from bytewax import Executor

ec = Executor()
flow = ec.Dataflow(enumerate(range(10)))
flow.map(lambda x: x * x)
flow.inspect(print)


if __name__ == "__main__":
    ec.build_and_run()

Running the program:

python ./pyexamples/wordcount.py
0
1
4
9
16
25
36
49
81
64

For a more complete example, and documentation on the available operators, check out the User Guide.

License

Bytewax is licensed under the Apache-2.0 license.

Comments
  • Add additional Kafka configs

    Add additional Kafka configs

    Updates the KafkaInputConfig to accept the congis possible in librdkafka as listed here - https://github.com/edenhill/librdkafka/blob/9b72ca3aa6c49f8f57eea02f70aadb1453d3ba1f/CONFIGURATION.md

    Tested it locally for now by modifying the configuration.

    Fails if the wrong configuration is passed in:

    thread 'timely:work-0' panicked at 'Error building input Kafka consumer: KafkaError (Client config error: No such configuration property: "fail" fail fail)', src/inputs/kafka_input.rs:216:14
    

    You can test it by adding an authentication mechanism. For example, configure ACL on your kafka or redpanda cluster and then pass the following code in the KafkaConfigInput:

    additional_configs = {
            "sasl.username":"admin",
            "sasl.password":"pass",
            "sasl.mechanisms":"SCRAM-SHA-256",
        }
    
    flow = Dataflow()
    flow.input(
        "inp",
        KafkaInputConfig(
            brokers=["localhost:9092"],
            topic="test",
            starting_offset="beginning",
            tail=False,
            additional_properties=additional_configs,
        ),
    )
    

    Closes out #107 and #106

    opened by awmatheson 5
  • New execution interface

    New execution interface

    As you can tell by the branch name, this started out small and then snowballed! Let me know how you think these examples feel and we can discuss if this is the right course for our API. I'm open to new ideas, but I also think this helps guide users along to better use cases.

    Backstory

    Previously, on Bytewax:

    Executor.build_and_run() wasn't great because a single function is used for starting up a dataflow you want to execute locally and manually assembling a cluster of processes.

    The difference between "singleton" vs "partitioned" input iterators and making the input part of the Dataflow definition wasn't great because it ties the behavior of the dataflow to the behavior of the input.

    The capture operator wasn't great because it didn't give you the control you needed to return data from multiple workers or background processes.

    I think all these things are symptoms of trying to handle all use cases from all points in the stack, rather than building primitives and abstraction layers.

    Changes

    See the docstrings and tests for examples!

    Execution

    Let's make explicit that there are different contexts you might run a dataflow. Removes Executor.build_and_run and adds four different entry points in a "stack" of complexity:

    • run_sync() which takes a dataflow and some input, runs it synchronously as a single worker in the existing Python thread, and returns the output to that thread. This is what you'd use in tests and simple notebook work.

    • run_cluster() which takes a dataflow and some input, starts a local cluster of processes, runs it, waits for the cluster to finish work, then collects thre results, and returns the output to that thread. This is what you'd use in a notebook if you need parallelism or higher throughput.

    • main_cluster() which starts up a cluster of local processes, coordinates the addresses and process IDs between them, runs a dataflow on it, and waits for it to finish. This has a partitioned "input builder" and an "output builder" (discussed below). This is what you'd use if you'd want to write a standalone script or example that does some higher throughput processing.

    • main_proc() which lets you manually craft a single process for use in a cluster. This is what you'd use when crafting your k8s cluster.

    main_cluster() is built upon proc_main() but adds in some process pool managment.

    run_cluster() is built upon main_cluster() but adds in the IPC necessary to get your data back to the calling process super conveniently.

    Input

    Since you can express "singleton" input as a "partitioned" input, the latter is the more fundamental concept, so that's what the lowest level main_proc() and main_cluster() functions take: an input builder function which will be called once on each worker that returns the input that worker should work on.

    The higher level execution contexts that you'd want to use from a notebook or a single thread (run_sync() and run_cluster()) then handle paritioning for you and provide a nice interface where you can send in a whole Python list and not need to think more about it.

    Output

    Make yours, like mine.

    Since the above scheme for input feels good, let's try copying the approach for output: At the lowest level there's a "partitioned output builder" which returns a callback that each worker thread can use to write the output it sees.

    The higher level execution contexts can then make an output builder function that collects data to send back to the main process for you.

    This change means that the capture operator doesn't need to take any functions; it just marks what parts of the dataflow are output. I think this will change slightly with the introduction of branching dataflows (something like marking different captures with a tag maybe?).

    Arg Parsers

    Since the above execution entry points all are Python functions, add some convenience methods which parse arguments from command line or env vars. This will make it easier to craft your "main" function in a cluster or standalone script context.

    Updates all the examples to use these parsers so we can go back to using -w2 and -n2 on the command line. Although some of the examples show off using main_proc() and so would need different command line arguments.

    opened by davidselassie 5
  • Python code formatting

    Python code formatting

    Is your feature request related to a problem? Please describe. I see some code format inconsistencies in Python source files. Is the team favourable towards enforcing a styling guide for Python code so future contributions from multiple contributors can follow a single style guide easily?

    Describe the solution you'd like

    • python-black I see that we are using python-black.
    Skipping .ipynb files as Jupyter dependencies are not installed.
    You can fix this by running ``pip install black[jupyter]``
    All done! ✨ 🍰 ✨
    37 files left unchanged.
    

    Shall we introduce pre-commit [1] tool to the project so we can run python-black at git commits?

    • python-isort I see some minor import statement inconsistencies and these can be easily fixed using python-isort [2] isort can automatically sort different import types into groups ( standard lib, third party, byteswax local imports ) and also sort imports alphabetically so it's easier when reading import statements.

    • pep8

    pytests/test_inputs.py:5:1: F401 'bytewax.Dataflow' imported but unused
    pytests/test_inputs.py:5:1: F401 'bytewax.Emit' imported but unused
    pytests/test_inputs.py:5:1: F401 'bytewax.run' imported but unused
    pytests/test_operators.py:1:1: F401 'os' imported but unused
    pytests/test_operators.py:7:1: F401 'pytest.mark' imported but unused
    pytests/test_execution.py:3:1: F401 'threading' imported but unused
    pytests/test_execution.py:62:89: E501 line too long (128 > 88 characters)
    pytests/test_execution.py:82:89: E501 line too long (138 > 88 characters)
    pytests/test_execution.py:118:89: E501 line too long (138 > 88 characters)
    pytests/test_execution.py:159:89: E501 line too long (138 > 88 characters)
    

    We can use a tool like flake8 to make sure our code complies with pep8.

    [1] https://pre-commit.com/ [2] https://github.com/PyCQA/isort

    Describe alternatives you've considered

    Additional context If the team agree to this we can further discuss the rules we need to follow and I would be happy to submit a PR for this.

    type:feature 
    opened by kasun 4
  • Broken links in simple-example page

    Broken links in simple-example page

    Bug Description

    https://docs.bytewax.io/getting-started/simple-example

    In the above page docs to operators are broken. Below are some of the links. They seem to be referring to "operators/operators" page and I guess they should be referring to "apidocs" page instead.

    https://docs.bytewax.io/operators/operators/#map (correct link - https://docs.bytewax.io/apidocs#bytewax.Dataflow.map) https://docs.bytewax.io/operators/operators/#flat-map https://docs.bytewax.io/operators/operators#reduce-epoch

    Python version (python -V)

    3.10.0

    Bytewax version (pip list | grep bytewax)

    latest

    Relevant log output

    No response

    Steps to Reproduce

    https://docs.bytewax.io/getting-started/simple-example has links to docs that are broken.

    opened by tirkarthi 4
  • Renaming of stateful operators

    Renaming of stateful operators

    Here's a prototype of a set of slightly different, and hopefully more uniform, stateful operators. I took a long look at how we used exchange, accumulate, state_machine, and aggregate in our existing examples and also what kind of interfaces the build in Python types give us.

    I'm excited that things fell into a shape that has some nice parallel construction and I think that means these will hopefully be easier to understand.

    exchange is banished. My current thinking is that it's a "scheduling implementation detail" that shouldn't be user visible. I don't like that it was possible to write a dataflow that worked with a single worker, then broke with multiple workers. Every time we previously used exchange, we really wanted a "group by" operation, so these changes make this first class. (I understand the concept of exchange needs to be there, but I think it's something that should only be exposed when you're writing Timely operators and actively coming up with a way to orchestrate state.)

    If we want to support both mutable and immutable types as state aggregators, we have to have our function interface return updated state and not modify in place. I don't think we should force users to do things like

    class Counter:
        def __init__(self):
            self.c = 0
    

    just to count things, so keep immutables working. Thankfully, Python has the operator package which lets you have access to + as a function and all the operators return updated values. This makes it possible to write in the functional style necessary without much boilerplate. But you can still write out lambdas or helper functions too.

    Many of our stateful examples are of the form "collect some items and stop at some point". Since this is a dataflow, the stream could be endless, so we need a way to signal that the aggregator is complete. Noting that the epoch is complete is super common "is complete" case and we can include that for convenience, but the general case exists too. Since they're similarly shaped problems, make sure the interface looks similar.

    So what happened?

    I want to put forth key_fold, key_fold_epoch, key_fold_epoch_local, and key_transition as the state operators. I wrote out usage docstrings and updated examples, but tl;dr:

    • They all start with key because they require (key, value) as input and group by key and exchange automatically. They output (key, aggregator) too.

    • The ones with fold do a "fold": build an initial aggregator, then incrementaly add new values as you see them, then return the entire aggregator.

    • The ones with epoch automatically stop folding at the end of each epoch since that's such a common use case. Otherwise you provide a function to designate "complete".

    • local means only aggregate within a worker. We haven't found a use case for this yet that isn't a performance optimisation (and one that maybe could be done automatically?).

    • transition isn't a "fold" in that it's always emitting events (not just when the aggregation is complete).

    • Input comes in undefined order within an epoch, but in epoch order if multiple epochs.

    If you feel like you know the shape of the old versions, then: key_fold_epoch ~= aggregate, key_transition ~= state_machine, key_fold is state_machine but shaped like aggregate, and key_fold_epoch_local is accumulate but shaped like aggregate.

    I've updated the examples too. I wanted to try out seeing how concise you could get using the built in types to see how the library might feel in that dimension. But I understand that some of these examples might be a bit too terse. We don't have to necessarily go that way in the docs, but it's cool to see that it can work and you can count like:

    flow.key_fold_epoch(lambda: 0, operator.add)
    

    Let me know what you think!

    opened by davidselassie 4
  • Introduce fate

    Introduce fate

    Previously I made the questionable API design of having StatefulLogic::snapshot return a StateUpdate. This meant that the snapshotting process affected the behavior of the stateful operator, since it could return Reset to signal up to StatefulUnary to discard the logic.

    This unwinds that tangle by introducing a new method with a perhaps overly poetic name StatefulLogic::fate which should return what StatefulUnary should do with the logic when it's done processing via a LogicFate enum. There are three options:

    1. Retain it until a new item for this key comes in.
    2. Retain it and awaken it after a timeout.
    3. Discard it.

    fate is attempting to encapsulate the problem of StatefulUnary is the owner of the logics, so they can't drop themselves. And since awakening timeouts are part of that process, they're in there too.

    This is nice because it simplifies the return value of exec to just output. The awaken delay is handled in fate.

    It also uses this pattern within Windower::fate with WindowerFate for the same kind of problem: the WindowStatefulLogic is the owner, and we need to communicate back when it's safe to discard a Windower. This fixes the bug of never discarding window state if a key is never seen again: we now discard that state whenever all windows for a key are closed.

    A few other small changes:

    • Standardises on the language of "awaken" a logic. Still uses Timely's "activate" for a Timely operator, though. Renames StatefulLogic::exec to StatefulLogic::awake_with to make more explicit when it is called. Renames WindowLogic::exec to WindowLogic::with_next to make explicit when it's called.

    • All stateful operator tests should test recovery as part of testing the logic. We should do this to excercise the serde round-trip. I found three? bugs via this where recovery would panic because the deserialization wasn't to the correct type. As part of this, I added explicit type annotations to all StateBytes::ser and StateBytes::de calls so you can compare them.

    • Clarifies more comments in the giant chunk of code for StatefulUnary::stateful_unary. I was also able to optimize it a little and get rid of two of our temporary buffers. I think it makes the process slightly clearer.

    opened by davidselassie 3
  • TypeError: KafkaInputConfig.__new__() got an unexpected keyword argument 'sasl.username'

    TypeError: KafkaInputConfig.__new__() got an unexpected keyword argument 'sasl.username'

    Bug Description

    Trying to use the SASL config with KafkaInputConfig on Bytewax 0.11.1 that was introduced in https://github.com/bytewax/bytewax/pull/119, however recieving an input validation error:

    TypeError: KafkaInputConfig.__new__() got an unexpected keyword argument 'sasl.username'
    

    Python version (python -V)

    3.9.7

    Bytewax version (pip list | grep bytewax)

    0.11.1

    Relevant log output

    Traceback (most recent call last):
      File "/Users/shahnewazkhan/projects/dapper/data-platform/wranglers/stream_processors/auth0_logs/bytewax/sasl.py", line 24, in <module>
        KafkaInputConfig(
    TypeError: KafkaInputConfig.__new__() got an unexpected keyword argument 'sasl.username'
    
    
    
    ### Steps to Reproduce
    
    from bytewax.dataflow import Dataflow
    from bytewax.inputs import KafkaInputConfig
    from bytewax.execution import run_main
    from bytewax.outputs import StdOutputConfig
    import os
    import logging
    
    logger = logging.getLogger(__name__)
    logging.basicConfig(level=logging.INFO)
    
    additional_configs = {
            "sasl.username":os.environ['kafkaReaderUsername'],
            "sasl.password":os.environ['kafkaReaderPassword'],
            "sasl.mechanisms":"PLAIN",
            "security.protocol":"SASL_SSL"
        }
    
    flow = Dataflow()
    flow.input(
        "inp",
        KafkaInputConfig(
            brokers=[f"{os.environ['kafkaURL']}"],
            topic=f"{os.environ['topic']}",
            starting_offset="beginning",
            tail=False,
            **additional_configs,
        ),
    )
    
    def deserialize(key_bytes__payload_bytes):
        key_bytes, payload_bytes = key_bytes__payload_bytes
        logger.info(f'TYPE OF BYTES {type(key_bytes)}')
        return key_bytes__payload_bytes
    
    
    flow.map(deserialize)
    
    flow.capture(StdOutputConfig())
    
    run_main(flow)
    needs triage 
    opened by ShahNewazKhan 3
  • Only use UTC datetimes

    Only use UTC datetimes

    This PR forces datetimes passed to bytewax as config parameters to be in UTC, using the new chrono integration added in PyO3 (not merged yet).

    We'll need to wait for the PR on PyO3 to be merged so that we can open a PR on pyo3-log to make everything work. The PyO3 author said he wants to add the chrono integration in PyO3 0.17.2 which is coming soon, so this should not take too long.

    Previous description (ignore this)

    This PR makes all datetimes in Bytewax timezone aware.

    It uses a non-released version of PyO3 that adds the chrono integration, and removes the dependency from pyo3-chrono, which was doing the python->rust conversion deleting timezone info, without properly converting to UTC first.

    The approach taken in this PR is to add a chrono_tz integration on top of the one implemented in PyO3, so that any datetime in the codebase has timezone info attached, and can be correctly compared to any other datetime.

    The only edge case we have to handle, is when we receive naive datetimes from the user in python. Right now the approach is to convert the datetime to utc using the python interpreter. This means that any naive datetime coming from python will be assumed to be local (relative to where the dataflow is ran) and converted to UTC before being passed down to rust. This could be wrong in case the datetime comes from data generated somewhere with a different timezone, but we don't have enough informations to make a proper conversion anyway. Another possible approach would be to assume the datetime is UTC rather than local, and we would attach the UTC timezone info without converting first, but this would be equally wrong most of the times. The application should warn anytime we receive a naive datetime, so that users are aware of the fact and can fix the data if needed.

    For (de)serialization of datetimes, I had to implement serde's traits manually, since there is no standard way to serialize datetimes with IANA timezone info. To do this I just serialize the local naivedatetime and the timezone name separately, so that we can rebuild the same ChronoDateTime when deserializing.

    The PR is still in Draft because I'm waiting for PyO3 to release the new version, and depending on the versioning chosen (0.17.2 or 0.18) we might also need to wait for an update on pyo3-log which depends on PyO3 0.17.

    To make things easier, for the next release I propose to implement Consensus Time: https://xkcd.com/2594/

    Original description (ignore this too)

    The test at the end of the file should explain the situation I'm trying to fix here: pyo3-chrono ignores timezone info even if they are present. I think we should instead take that into account. With this conversion traits, we always convert datetimes to UTC internally. If the user sends us timezone aware datetimes, we do the conversion. If the user sends us naive datetimes, we assume they are already UTC (debatable).

    This is not the only possible approach, and might not be the best one, so I'm opening this PR to discuss what you think about this.

    My point is that I don't think we should ignore tzinfo if present, but the flow should still work if we receive naive datetimes (and maybe warn the user about the assumption).

    This PR is rough, and if we want to replace pyo3-chrono I'll need to add the conversion traits for Time and Date too, and some more tests.

    edit: I just found this PR: https://github.com/chronotope/chrono/pull/542 which does the conversion properly. Reading the discussion, it looks like they plan to integrate this into PyO3 (rather than in chrono) under a feature flag, so maybe we'll have this for free in the future

    opened by Psykopear 3
  • Adds an

    Adds an "order" input helper and allows tumbling "event time"

    Order

    Timely, and thus Bytewax, requires that the input timestamp / epoch never "goes backwards". This is because Timely is able to know that it can kick off final computation once a timestamp has closed.

    Our current "input is an iterator of (epoch, item)" API doesn't have the flexibility to allow multiple "open" epochs, although Timely supports this.

    To help bridge this gap, a common use case is: allow late data in a "buffering window" and drop anything after that. Here's an ordering() function which implements that and you can wrap your source input iterator in it.

    Event Time

    Allows tumbling_epoch to take a time_getter argument which will produce the time for each item from the data within, instead of always just looking at the wall clock time.

    This will let you bucket into epoch of a delayed data stream. Also plays well with order.

    Other

    Also renames inp to inputs since we're renaming everything now.

    opened by davidselassie 3
  • Traits refactoring

    Traits refactoring

    This is a refactoring to make the pattern we use to build concrete objects from python config objects more explicit. I also moved the clocks stuff in a submodule, but nothing should really change.

    The first trait introduced is ParentClass. This trait should be implemented by any class we expose in Python that is a parent class of config objects, and it requires a method to convert from parent to one of the children:

    pub(crate) trait ParentClass {
        type Children;
        fn get_subclass(&self, py: Python) -> StringResult<Self::Children>;
    }
    

    Then we have one trait for each module, that specific configurations have to implement. This trait requires a method to build a concrete object from a config one, so in the inputs modules it builds an InputReader from a child of the InputConfig class.

    For example in the input module we have:

    pub(crate) trait InputBuilder {
        fn build(
            &self,
            py: Python,
            worker_index: WorkerIndex,
            worker_count: usize,
            resume_snapshot: Option<StateBytes>,
        ) -> StringResult<Box<dyn InputReader<TdPyAny>>>;
    }
    

    And inputs have to implement this trait, example in ManualInputconfig:

    impl InputBuilder for ManualInputConfig {
        fn build(
            &self,
            py: Python,
            worker_index: WorkerIndex,
            worker_count: usize,
            resume_snapshot: Option<StateBytes>,
        ) -> StringResult<Box<dyn InputReader<TdPyAny>>> {
            Ok(Box::new(ManualInput::new(
                py,
                self.input_builder.clone(),
                worker_index,
                worker_count,
                resume_snapshot,
            )))
        }
    }
    

    The nice thing is that this allows us to implement InputBuilder for the generic InputConfig if it implements the ParentClass trait:

    impl InputBuilder for Py<InputConfig> {
        fn build(
            &self,
            py: Python,
            worker_index: WorkerIndex,
            worker_count: usize,
            resume_snapshot: Option<StateBytes>,
        ) -> StringResult<Box<dyn InputReader<TdPyAny>>> {
            self.get_subclass(py)?
                .build(py, worker_index, worker_count, resume_snapshot)
        }
    }
    

    So that we can do this:

    let input_reader = input_config.build(py, worker_index, worker_count, step_resume_state)?;
    

    instead of what we used to do:

    let input_reader = build_input_reader(
        py,
        input_config,
        worker_index,
        worker_count,
        step_resume_state,
    )?;
    

    NOTE: I stopped here, but we could think of a more generic PyBuilder trait so that we don't need a new trait for each module, but after trying it I think it makes the implementation a bit harder to read:

    pub(crate) trait PyBuilder {
        type Input;
        type Output;
        fn build(&self, py: Python, params: Self::Input) -> StringResult<Self::Output>;
    }
    

    Which turn the impl for ManualInputConfig into this

    impl PyBuilder for ManualInputConfig {
        type Input = (WorkerIndex, usize, Option<StateBytes>);
        type Output = Box<dyn InputReader<TdPyAny>>;
    
        fn build(
            &self,
            py: Python,
            (worker_index, worker_count, resume_snapshot): Self::Input,
        ) -> StringResult<Self::Output> {
            Ok(Box::new(ManualInput::new(
                py,
                self.input_builder.clone(),
                worker_index,
                worker_count,
                resume_snapshot,
            )))
        }
    }
    

    And the ParentClass impl to something like:

    impl ParentClass for Py<InputConfig> {
        type Children = Box<
            dyn PyBuilder<
                Input = (WorkerIndex, usize, Option<StateBytes>),
                Output = Box<dyn InputReader<TdPyAny>>,
            >,
        >;
    
        fn get_subclass(&self, py: Python) -> StringResult<Self::Children> {
            if let Ok(conf) = self.extract::<ManualInputConfig>(py) {
                Ok(Box::new(conf))
            } else if let Ok(conf) = self.extract::<KafkaInputConfig>(py) {
                Ok(Box::new(conf))
            } else {
                let pytype = self.as_ref(py).get_type();
                Err(format!("Unknown input_config type: {pytype}"))
            }
        }
    }
    
    

    Final edit: The end goal of this was to write a blanket implementation of PyBuilder for any struct that implement ParentClass with Children = Box<dyn PyBuilder>, but it doesn't work as it is. I'm going to write it here as a reference, maybe I was close enough:

    impl<T, B> PyBuilder for T
    where
        B: PyBuilder,
        T: ParentClass<Children = Box<B>>,
    {
        type Input = B::Input;
        type Output = B::Output;
    
        fn build(&self, py: Python, params: Self::Input) -> StringResult<Self::Output> {
            self.get_subclass(py)?.build(py, params)
        }
    }
    
    opened by Psykopear 2
  • Observability

    Observability

    This PR adds observability in Bytewax.

    This does not use the python logging module, but allows for external configuration anyway. What we gain is not having to keep the GIL when logging something, but it's not integrated with the python logging's module anymore.

    By default, bytewax logs ERROR messages to stdout. The logging level can be configured as an argument to the setup_tracing function.

    Logs are part of the tracing infrastructure. Every log is now structured, so we can use those logs and the additional instrumentation to send "spans" and "traces" to an opentelemetry compatible backend.

    This PR adds support for two backends:

    • Jaeger
    • Opentelemetry collector

    The Jaeger backend sends data directly to Jaeger. This makes the setup simpler if that's what you are using, but it only works with Jaeger.

    The Opentelemetry collector backend sends data to the opentelemetry collector. This can be configured to send traces and logs to different backends, including Jaeger, and it's the most flexible way of configuring tracing, since you can change your tracing backend without touching the dataflow, just change the collector's configuration and everything should keep working.

    opened by Psykopear 2
  • [FEATURE] Publish package for Python 3.11

    [FEATURE] Publish package for Python 3.11

    Is your feature request related to a problem? Please describe. We currently publish packages for python 3.7-3.10

    Describe the solution you'd like Add Python 3.11 to the CI matrix

    needs triage 
    opened by Psykopear 0
  • Missing Autocomplete in VSCode / PyCharm

    Missing Autocomplete in VSCode / PyCharm

    Bug Description

    Hello together,

    I use bytewax within a virtual environment. Environment was created under Python 3.10.9 using the command python -m venv .venv and then I activated the environment. I installed bytewax via pip and can create imports.

    from bytewax.dataflow import Dataflow

    The problem is that "Dataflow" is not found within the IDE. This means that autocomplete, etc. do not work.

    However, I can execute and use Dataflow without any problems. It is only annoying within the IDEs.

    Python version (python -V)

    Python 3.10.9

    Bytewax version (pip list | grep bytewax)

    0.14.0

    Relevant log output

    Cannot find reference 'Dataflow' in 'dataflow.py'
    

    Steps to Reproduce

    1. Create new project in PyCharm / VSCode (under MacOS 13.1)
    2. Create and activate venv
    3. pip install bytewax
    4. create new .py file with from bytewax.dataflow import Dataflow
    needs triage 
    opened by wlwwt 2
  • Recovery state will be incorrect if any worker crashes during the initial epoch

    Recovery state will be incorrect if any worker crashes during the initial epoch

    Bug Description

    During dataflow execution, we write out whenever a worker completes an epoch. We use these written epochs to re-derive the progress of the entire cluster. Unfortunately, though, on resume we do not know beforehand how many worker's progress messages we're waiting for. Thus if the cluster crashes after worker 1 has written progress but worker 2 has not, then upon resume, it'll look like you're resuming from just a cluster of size 1 and will silently skip data.

    We need to have some way of synchronously waiting for all workers to confirm they've written a marker of their existence to the recovery store before each execution (resume or not).

    Python version (python -V)

    Python 3.10.6

    Bytewax version (pip list | grep bytewax)

    0.13.1

    Relevant log output

    No response

    Steps to Reproduce

    I don't have some example code yet, but this should theoretically be possible.

    type:bug 
    opened by davidselassie 0
  • [FEATURE] Support stream/table joins

    [FEATURE] Support stream/table joins

    Is it possible to use bytewax for joining content of different kafka topics (similar to what ksqldb is doing) ?

    doing this will be an example of integrating "persistent queries" (permanent background processes that never stops). is this a good use case for bytewax?

    also comparing to ksqldb, what happens if the bytewax workers are killed? will those persistent queries restart automatically when workers come back up and will they use the latest/earliest committed offset on the kafka topics ? or is the restart manual?

    cheers

    question 
    opened by dberardo-com 1
  • [FEATURE] Support session window

    [FEATURE] Support session window

    Is your feature request related to a problem? Please describe.

    https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions#session-window

    Describe the solution you'd like A clear and concise description of what you want to happen.

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Additional context Add any other context or screenshots about the feature request here.

    type:feature 
    opened by judahrand 0
Releases(v0.14.0)
  • v0.14.0(Dec 22, 2022)

    Overview

    • Dataflow continuation now works. If you run a dataflow over a finite input, all state will be persisted via recovery so if you re-run the same dataflow pointing at the same input, but with more data appended at the end, it will correctly continue processing from the previous end-of-stream.

    • Fixes issue with multi-worker recovery. Previously resume data was being routed to the wrong worker so state would be missing.

    • The above two changes require that the recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.

    • Adds an introspection web server to dataflow workers.

    • Adds collect_window operator.

    What's Changed

    • Adding manylinux_2_27 wheel building to CI by @miccioest in https://github.com/bytewax/bytewax/pull/169
    • Adds webserver by @whoahbot in https://github.com/bytewax/bytewax/pull/175
    • Added EXPOSE command in Dockerfiles by @Psykopear in https://github.com/bytewax/bytewax/pull/176
    • Adding 3.8, 3.9, and 3.10 python versions to colab CI job by @miccioest in https://github.com/bytewax/bytewax/pull/179
    • Use cbfmt with pre-commit by @whoahbot in https://github.com/bytewax/bytewax/pull/180
    • Fix issue with resume state not being routed to correct worker by @davidselassie in https://github.com/bytewax/bytewax/pull/182
    • Adds collect window operator by @davidselassie in https://github.com/bytewax/bytewax/pull/183
    • Preps for v0.14.0 release by @davidselassie in https://github.com/bytewax/bytewax/pull/184

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.13.1...v0.14.0

    Source code(tar.gz)
    Source code(zip)
  • v0.13.1(Nov 10, 2022)

    Overview

    • Added Google Colab support.

    What's Changed

    • Prepare for v0.13.1 release. by @miccioest in https://github.com/bytewax/bytewax/pull/168

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.13.0...v0.13.1

    Source code(tar.gz)
    Source code(zip)
  • v0.13.0(Nov 9, 2022)

    Overview

    • Added tracing instrumentation and configurations for tracing backends.

    What's Changed

    • CDC-esque recovery operators by @davidselassie in https://github.com/bytewax/bytewax/pull/160
    • Observability by @Psykopear in https://github.com/bytewax/bytewax/pull/157
    • Fix dockerfile by @Psykopear in https://github.com/bytewax/bytewax/pull/163
    • Traits refactoring by @Psykopear in https://github.com/bytewax/bytewax/pull/162
    • Prepare for v0.13.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/165

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.12.0...v0.13.0

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Oct 21, 2022)

    Overview

    • Fixes bug where window is never closed if recovery occurs after last item but before window close.

    • Recovery logging is reduced.

    • Breaking change Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.

    • Adds a DynamoDB and BigQuery output connector.

    What's Changed

    • New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19
    • Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24
    • Restructure Python imports by @whoahbot in https://github.com/bytewax/bytewax/pull/22
    • Restructure rust by @whoahbot in https://github.com/bytewax/bytewax/pull/25
    • Renames test_run to test_execution by @davidselassie in https://github.com/bytewax/bytewax/pull/27
    • Adds an "order" input helper and allows tumbling "event time" by @davidselassie in https://github.com/bytewax/bytewax/pull/26
    • Maturin develop before running tests. by @whoahbot in https://github.com/bytewax/bytewax/pull/28
    • Tests ability to interrupt execution by @davidselassie in https://github.com/bytewax/bytewax/pull/23
    • Jupyter Notebook Anomaly Detection Example by @awmatheson in https://github.com/bytewax/bytewax/pull/30
    • Debug operators by @blakestier in https://github.com/bytewax/bytewax/pull/29
    • Running tests using whl file already build by @miccioest in https://github.com/bytewax/bytewax/pull/32
    • More comprehensive readme by @awmatheson in https://github.com/bytewax/bytewax/pull/33
    • modify inputs by @awmatheson in https://github.com/bytewax/bytewax/pull/35
    • Remove Criterion benchmark by @whoahbot in https://github.com/bytewax/bytewax/pull/37
    • Add a key param to the stateful_map operator. by @whoahbot in https://github.com/bytewax/bytewax/pull/36
    • Runs doctests in CI by @davidselassie in https://github.com/bytewax/bytewax/pull/38
    • API Docs by @davidselassie in https://github.com/bytewax/bytewax/pull/39
    • Fixes sorted_window() to support items with identical times by @davidselassie in https://github.com/bytewax/bytewax/pull/41
    • Apidocs templates by @konradsienkowski in https://github.com/bytewax/bytewax/pull/40
    • Testable examples by @davidselassie in https://github.com/bytewax/bytewax/pull/34
    • Examples metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/43
    • Update to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/45
    • Runs CI on release publish by @davidselassie in https://github.com/bytewax/bytewax/pull/46
    • Feast example by @blakestier in https://github.com/bytewax/bytewax/pull/42
    • Adds run_main() by @davidselassie in https://github.com/bytewax/bytewax/pull/48
    • Example of Apriori algorithm by @TheBits in https://github.com/bytewax/bytewax/pull/50
    • Dockerfile enhancements by @miccioest in https://github.com/bytewax/bytewax/pull/52
    • Integrate docs with main repository by @konradsienkowski in https://github.com/bytewax/bytewax/pull/51
    • Fix le feast typos by @blakestier in https://github.com/bytewax/bytewax/pull/54
    • Adding a Kubernetes example based on manual_cluster by @miccioest in https://github.com/bytewax/bytewax/pull/55
    • Update docs with capture operator by @awmatheson in https://github.com/bytewax/bytewax/pull/56
    • Update Notebook Example to Bytewax Version 0.8 by @awmatheson in https://github.com/bytewax/bytewax/pull/57
    • Re-enable documentation tests and upgrade longform docs to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/58
    • Frontier psychiatry by @whoahbot in https://github.com/bytewax/bytewax/pull/59
    • Move operator descriptions into API docs by @davidselassie in https://github.com/bytewax/bytewax/pull/60
    • Prepare for v0.9.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/61
    • Moves deployment docs into repo by @davidselassie in https://github.com/bytewax/bytewax/pull/62
    • State Recovery by @davidselassie in https://github.com/bytewax/bytewax/pull/53
    • Uploading wheels to S3 in CI and using them in CD without building again by @miccioest in https://github.com/bytewax/bytewax/pull/64
    • Fix broken links in longform documentation by @konradsienkowski in https://github.com/bytewax/bytewax/pull/63
    • Clean up error messages for inputs by @whoahbot in https://github.com/bytewax/bytewax/pull/69
    • Adds distribute() helper by @davidselassie in https://github.com/bytewax/bytewax/pull/71
    • Order Book example Using Websockets by @awmatheson in https://github.com/bytewax/bytewax/pull/67
    • Automatic state recovery / garbage collection by @davidselassie in https://github.com/bytewax/bytewax/pull/73
    • Pre commit integration by @kasun in https://github.com/bytewax/bytewax/pull/74
    • Fix getting started guide documentation by @kasun in https://github.com/bytewax/bytewax/pull/76
    • Fix inputs helper by @blakestier in https://github.com/bytewax/bytewax/pull/77
    • cargo fmt hook by @davidselassie in https://github.com/bytewax/bytewax/pull/79
    • Uses new JoinHandle::is_finished by @davidselassie in https://github.com/bytewax/bytewax/pull/78
    • Kafka recovery store by @davidselassie in https://github.com/bytewax/bytewax/pull/81
    • Require state keys to be strings by @davidselassie in https://github.com/bytewax/bytewax/pull/82
    • Move recovery_wordcount.py into examples by @davidselassie in https://github.com/bytewax/bytewax/pull/83
    • A few touch ups to docs by @blakestier in https://github.com/bytewax/bytewax/pull/85
    • Fix API Docs crawling by @konradsienkowski in https://github.com/bytewax/bytewax/pull/88
    • Kafka Consumer by @blakestier in https://github.com/bytewax/bytewax/pull/84
    • Multi-worker recovery by @davidselassie in https://github.com/bytewax/bytewax/pull/89
    • Prep for 0.10.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/91
    • Updates release steps by @davidselassie in https://github.com/bytewax/bytewax/pull/92
    • Fix argument formatting in API docs by @davidselassie in https://github.com/bytewax/bytewax/pull/93
    • Windowing by @davidselassie in https://github.com/bytewax/bytewax/pull/96
    • Input source operators by @davidselassie in https://github.com/bytewax/bytewax/pull/98
    • Cooperative manual input by @davidselassie in https://github.com/bytewax/bytewax/pull/99
    • Fold window operator by @Psykopear in https://github.com/bytewax/bytewax/pull/95
    • Added cargo tests, doctests, small fixes by @Psykopear in https://github.com/bytewax/bytewax/pull/108
    • Update examples by @blakestier in https://github.com/bytewax/bytewax/pull/105
    • Solved some clippy warnings, enabled doctests by @Psykopear in https://github.com/bytewax/bytewax/pull/109
    • Stateful input sources by @davidselassie in https://github.com/bytewax/bytewax/pull/103
    • Prepare for 0.11.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/110
    • [Fix]: Remove 'Understanding epochs' article from docs metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/111
    • Update and fix documentation by @Psykopear in https://github.com/bytewax/bytewax/pull/112
    • Break up large modules into smaller files. by @whoahbot in https://github.com/bytewax/bytewax/pull/113
    • Upgrade dependencies by @Psykopear in https://github.com/bytewax/bytewax/pull/114
    • Update KafkaInput to prevent listening to non-existent topics by @awmatheson in https://github.com/bytewax/bytewax/pull/117
    • Add additional Kafka configs by @awmatheson in https://github.com/bytewax/bytewax/pull/119
    • Kafka Output by @blakestier in https://github.com/bytewax/bytewax/pull/118
    • Prepare for v0.11.1 release by @whoahbot in https://github.com/bytewax/bytewax/pull/125
    • Adding a CD step to upgrade pip by @miccioest in https://github.com/bytewax/bytewax/pull/126
    • Add CONTRIBUTING.md by @awmatheson in https://github.com/bytewax/bytewax/pull/123
    • downcast instead of extract by @whoahbot in https://github.com/bytewax/bytewax/pull/128
    • Only snapshot for recovery once per state key by @davidselassie in https://github.com/bytewax/bytewax/pull/129
    • Add readme about profiling by @davidselassie in https://github.com/bytewax/bytewax/pull/131
    • Fix key usage in docstring by @awmatheson in https://github.com/bytewax/bytewax/pull/133
    • Re-enable ssl and gssapi by @whoahbot in https://github.com/bytewax/bytewax/pull/134
    • Prepare for v0.11.2 by @whoahbot in https://github.com/bytewax/bytewax/pull/136
    • Allow manually adjusting TestingClockConfig by @davidselassie in https://github.com/bytewax/bytewax/pull/139
    • Eagerly build logic and assert built by @davidselassie in https://github.com/bytewax/bytewax/pull/141
    • Fixed examples for dataflow reduce and fold window. by @PiePra in https://github.com/bytewax/bytewax/pull/142
    • Fixes recovery serde by @davidselassie in https://github.com/bytewax/bytewax/pull/143
    • Better recovery tests by @davidselassie in https://github.com/bytewax/bytewax/pull/144
    • Only use UTC datetimes by @Psykopear in https://github.com/bytewax/bytewax/pull/115
    • Introduce fate by @davidselassie in https://github.com/bytewax/bytewax/pull/137
    • Event time processing by @Psykopear in https://github.com/bytewax/bytewax/pull/127
    • Persist awake times by @davidselassie in https://github.com/bytewax/bytewax/pull/147
    • Event time in tests by @Psykopear in https://github.com/bytewax/bytewax/pull/146
    • Updated to PyO3 0.17.2 by @Psykopear in https://github.com/bytewax/bytewax/pull/148
    • Docs v0.11.2 - update link schema by @konradsienkowski in https://github.com/bytewax/bytewax/pull/149
    • Update Readme and Add Examples by @awmatheson in https://github.com/bytewax/bytewax/pull/135
    • Adding a docs article of waxctl aws command by @miccioest in https://github.com/bytewax/bytewax/pull/151
    • Add a DynamoDB output adapter. by @whoahbot in https://github.com/bytewax/bytewax/pull/152
    • Adding an article about gcp command by @miccioest in https://github.com/bytewax/bytewax/pull/153
    • Add a Bigquery output adapter by @blakestier in https://github.com/bytewax/bytewax/pull/154
    • Cleanup examples by @blakestier in https://github.com/bytewax/bytewax/pull/155
    • Prepare for v0.12.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/156

    New Contributors

    • @konradsienkowski made their first contribution in https://github.com/bytewax/bytewax/pull/40
    • @TheBits made their first contribution in https://github.com/bytewax/bytewax/pull/50
    • @kasun made their first contribution in https://github.com/bytewax/bytewax/pull/74
    • @Psykopear made their first contribution in https://github.com/bytewax/bytewax/pull/95
    • @PiePra made their first contribution in https://github.com/bytewax/bytewax/pull/142

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.12.0

    Source code(tar.gz)
    Source code(zip)
  • v0.11.2(Sep 20, 2022)

    • Performance improvements. :sparkles:

    • Support SASL and SSL for bytewax.inputs.KafkaInputConfig.

    What's Changed

    • Add CONTRIBUTING.md by @awmatheson in https://github.com/bytewax/bytewax/pull/123
    • downcast instead of extract by @whoahbot in https://github.com/bytewax/bytewax/pull/128
    • Only snapshot for recovery once per state key by @davidselassie in https://github.com/bytewax/bytewax/pull/129
    • Add readme about profiling by @davidselassie in https://github.com/bytewax/bytewax/pull/131
    • Fix key usage in docstring by @awmatheson in https://github.com/bytewax/bytewax/pull/133
    • Re-enable ssl and gssapi by @whoahbot in https://github.com/bytewax/bytewax/pull/134
    • Prepare for v0.11.2 by @whoahbot in https://github.com/bytewax/bytewax/pull/136

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.11.1...0.11.2

    Source code(tar.gz)
    Source code(zip)
  • v0.11.1(Sep 8, 2022)

    • KafkaInputConfig now accepts additional properties. See bytewax.inputs.KafkaInputConfig.

    • Support for a pre-built Kafka output component. See bytewax.outputs.KafkaOutputConfig.

    What's Changed

    • Break up large modules into smaller files. by @whoahbot in https://github.com/bytewax/bytewax/pull/113
    • Upgrade dependencies by @Psykopear in https://github.com/bytewax/bytewax/pull/114
    • Update KafkaInput to prevent listening to non-existent topics by @awmatheson in https://github.com/bytewax/bytewax/pull/117
    • Add additional Kafka configs by @awmatheson in https://github.com/bytewax/bytewax/pull/119
    • Kafka Output by @blakestier in https://github.com/bytewax/bytewax/pull/118

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.11.0...v0.11.1

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Aug 26, 2022)

    • Added the fold_window operator, works like reduce_window but allows the user to build the initial accumulator for each key in a builder function.

    • Output is no longer specified using an output_builder for the entire dataflow, but you supply an "output config" per capture. See bytewax.outputs for more info.

    • Input is no longer specified on the execution entry point (like run_main), it is instead using the Dataflow.input operator.

    • Epochs are no longer user-facing as part of the input system. Any custom Python-based input components you write just need to be iterators and emit items. Recovery snapshots and backups now happen periodically, defaulting to every 10 seconds.

    • Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.

    • The reduce_epoch operator has been replaced with reduce_window. It takes a "clock" and a "windower" to define the kind of aggregation you want to do.

    • run and run_cluster have been removed and the remaining execution entry points moved into bytewax.execution. You can now get similar prototyping functionality with bytewax.execution.run_main and bytewax.execution.spawn_cluster using Testing{Input,Output}Configs.

    • Dataflow has been moved into bytewax.dataflow.Dataflow.

    What's Changed

    • Windowing by @davidselassie in https://github.com/bytewax/bytewax/pull/96
    • Input source operators by @davidselassie in https://github.com/bytewax/bytewax/pull/98
    • Cooperative manual input by @davidselassie in https://github.com/bytewax/bytewax/pull/99
    • Fold window operator by @Psykopear in https://github.com/bytewax/bytewax/pull/95
    • Added cargo tests, doctests, small fixes by @Psykopear in https://github.com/bytewax/bytewax/pull/108
    • Update examples by @blakestier in https://github.com/bytewax/bytewax/pull/105
    • Solved some clippy warnings, enabled doctests by @Psykopear in https://github.com/bytewax/bytewax/pull/109
    • Stateful input sources by @davidselassie in https://github.com/bytewax/bytewax/pull/103
    • Prepare for 0.11.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/110

    New Contributors

    • @Psykopear made their first contribution in https://github.com/bytewax/bytewax/pull/95

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.10.0...v0.11.0

    Source code(tar.gz)
    Source code(zip)
  • v0.10.1(Aug 17, 2022)

    Overview

    • Bugfix: Resolves pickling error. KafkaInputConfig now works with spawn_cluster.

    What's Changed

    • Pickling logic for auto_commit in KafkaInputConfig by @blakestier in #102
    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Jul 13, 2022)

    Overview

    • Input is no longer specified using an input_builder, but now an input_config which allows you to use pre-built input components. See bytewax.inputs for more info.

    • Preliminary support for a pre-built Kafka input component. See bytewax.inputs.KafkaInputConfig.

    • Keys used in the (key, value) 2-tuples to route data for stateful operators (like stateful_map and reduce_epoch) must now be strings. Because of this bytewax.exhash is no longer necessary and has been removed.

    • Recovery format has been changed for all recovery stores. You cannot resume from recovery data written with an older version.

    • Slight changes to bytewax.recovery.RecoveryConfig config options due to recovery system changes.

    • bytewax.run() and bytewax.run_cluster() no longer take recovery_config as they don't support recovery.

    What's Changed

    • New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19
    • Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24
    • Restructure Python imports by @whoahbot in https://github.com/bytewax/bytewax/pull/22
    • Restructure rust by @whoahbot in https://github.com/bytewax/bytewax/pull/25
    • Renames test_run to test_execution by @davidselassie in https://github.com/bytewax/bytewax/pull/27
    • Adds an "order" input helper and allows tumbling "event time" by @davidselassie in https://github.com/bytewax/bytewax/pull/26
    • Maturin develop before running tests. by @whoahbot in https://github.com/bytewax/bytewax/pull/28
    • Tests ability to interrupt execution by @davidselassie in https://github.com/bytewax/bytewax/pull/23
    • Jupyter Notebook Anomaly Detection Example by @awmatheson in https://github.com/bytewax/bytewax/pull/30
    • Debug operators by @blakestier in https://github.com/bytewax/bytewax/pull/29
    • Running tests using whl file already build by @miccioest in https://github.com/bytewax/bytewax/pull/32
    • More comprehensive readme by @awmatheson in https://github.com/bytewax/bytewax/pull/33
    • modify inputs by @awmatheson in https://github.com/bytewax/bytewax/pull/35
    • Remove Criterion benchmark by @whoahbot in https://github.com/bytewax/bytewax/pull/37
    • Add a key param to the stateful_map operator. by @whoahbot in https://github.com/bytewax/bytewax/pull/36
    • Runs doctests in CI by @davidselassie in https://github.com/bytewax/bytewax/pull/38
    • API Docs by @davidselassie in https://github.com/bytewax/bytewax/pull/39
    • Fixes sorted_window() to support items with identical times by @davidselassie in https://github.com/bytewax/bytewax/pull/41
    • Apidocs templates by @konradsienkowski in https://github.com/bytewax/bytewax/pull/40
    • Testable examples by @davidselassie in https://github.com/bytewax/bytewax/pull/34
    • Examples metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/43
    • Update to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/45
    • Runs CI on release publish by @davidselassie in https://github.com/bytewax/bytewax/pull/46
    • Feast example by @blakestier in https://github.com/bytewax/bytewax/pull/42
    • Adds run_main() by @davidselassie in https://github.com/bytewax/bytewax/pull/48
    • Example of Apriori algorithm by @TheBits in https://github.com/bytewax/bytewax/pull/50
    • Dockerfile enhancements by @miccioest in https://github.com/bytewax/bytewax/pull/52
    • Integrate docs with main repository by @konradsienkowski in https://github.com/bytewax/bytewax/pull/51
    • Fix le feast typos by @blakestier in https://github.com/bytewax/bytewax/pull/54
    • Adding a Kubernetes example based on manual_cluster by @miccioest in https://github.com/bytewax/bytewax/pull/55
    • Update docs with capture operator by @awmatheson in https://github.com/bytewax/bytewax/pull/56
    • Update Notebook Example to Bytewax Version 0.8 by @awmatheson in https://github.com/bytewax/bytewax/pull/57
    • Re-enable documentation tests and upgrade longform docs to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/58
    • Frontier psychiatry by @whoahbot in https://github.com/bytewax/bytewax/pull/59
    • Move operator descriptions into API docs by @davidselassie in https://github.com/bytewax/bytewax/pull/60
    • Prepare for v0.9.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/61
    • Moves deployment docs into repo by @davidselassie in https://github.com/bytewax/bytewax/pull/62
    • State Recovery by @davidselassie in https://github.com/bytewax/bytewax/pull/53
    • Uploading wheels to S3 in CI and using them in CD without building again by @miccioest in https://github.com/bytewax/bytewax/pull/64
    • Fix broken links in longform documentation by @konradsienkowski in https://github.com/bytewax/bytewax/pull/63
    • Clean up error messages for inputs by @whoahbot in https://github.com/bytewax/bytewax/pull/69
    • Adds distribute() helper by @davidselassie in https://github.com/bytewax/bytewax/pull/71
    • Order Book example Using Websockets by @awmatheson in https://github.com/bytewax/bytewax/pull/67
    • Automatic state recovery / garbage collection by @davidselassie in https://github.com/bytewax/bytewax/pull/73
    • Pre commit integration by @kasun in https://github.com/bytewax/bytewax/pull/74
    • Fix getting started guide documentation by @kasun in https://github.com/bytewax/bytewax/pull/76
    • Fix inputs helper by @blakestier in https://github.com/bytewax/bytewax/pull/77
    • cargo fmt hook by @davidselassie in https://github.com/bytewax/bytewax/pull/79
    • Uses new JoinHandle::is_finished by @davidselassie in https://github.com/bytewax/bytewax/pull/78
    • Kafka recovery store by @davidselassie in https://github.com/bytewax/bytewax/pull/81
    • Require state keys to be strings by @davidselassie in https://github.com/bytewax/bytewax/pull/82
    • Move recovery_wordcount.py into examples by @davidselassie in https://github.com/bytewax/bytewax/pull/83
    • A few touch ups to docs by @blakestier in https://github.com/bytewax/bytewax/pull/85
    • Fix API Docs crawling by @konradsienkowski in https://github.com/bytewax/bytewax/pull/88
    • Kafka Consumer by @blakestier in https://github.com/bytewax/bytewax/pull/84
    • Multi-worker recovery by @davidselassie in https://github.com/bytewax/bytewax/pull/89
    • Prep for 0.10.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/91

    New Contributors

    • @blakestier made their first contribution in https://github.com/bytewax/bytewax/pull/29
    • @konradsienkowski made their first contribution in https://github.com/bytewax/bytewax/pull/40
    • @TheBits made their first contribution in https://github.com/bytewax/bytewax/pull/50
    • @kasun made their first contribution in https://github.com/bytewax/bytewax/pull/74

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.10.0

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Apr 22, 2022)

    Overview

    • Adds bytewax.AdvanceTo and bytewax.Emit to control when processing happens.

    • Adds bytewax.run_main() as a way to test input and output builders without starting a cluster.

    • Adds a bytewax.testing module with helpers for testing.

    • bytewax.run_cluster() and bytewax.spawn_cluster() now take a mp_ctx argument to allow you to change the multiprocessing behavior. E.g. from "fork" to "spawn". Defaults now to "spawn".

    What's Changed

    • New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19
    • Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24
    • Restructure Python imports by @whoahbot in https://github.com/bytewax/bytewax/pull/22
    • Restructure rust by @whoahbot in https://github.com/bytewax/bytewax/pull/25
    • Renames test_run to test_execution by @davidselassie in https://github.com/bytewax/bytewax/pull/27
    • Adds an "order" input helper and allows tumbling "event time" by @davidselassie in https://github.com/bytewax/bytewax/pull/26
    • Maturin develop before running tests. by @whoahbot in https://github.com/bytewax/bytewax/pull/28
    • Tests ability to interrupt execution by @davidselassie in https://github.com/bytewax/bytewax/pull/23
    • Jupyter Notebook Anomaly Detection Example by @awmatheson in https://github.com/bytewax/bytewax/pull/30
    • Debug operators by @blakestier in https://github.com/bytewax/bytewax/pull/29
    • Running tests using whl file already build by @miccioest in https://github.com/bytewax/bytewax/pull/32
    • More comprehensive readme by @awmatheson in https://github.com/bytewax/bytewax/pull/33
    • modify inputs by @awmatheson in https://github.com/bytewax/bytewax/pull/35
    • Remove Criterion benchmark by @whoahbot in https://github.com/bytewax/bytewax/pull/37
    • Add a key param to the stateful_map operator. by @whoahbot in https://github.com/bytewax/bytewax/pull/36
    • Runs doctests in CI by @davidselassie in https://github.com/bytewax/bytewax/pull/38
    • API Docs by @davidselassie in https://github.com/bytewax/bytewax/pull/39
    • Fixes sorted_window() to support items with identical times by @davidselassie in https://github.com/bytewax/bytewax/pull/41
    • Apidocs templates by @konradsienkowski in https://github.com/bytewax/bytewax/pull/40
    • Testable examples by @davidselassie in https://github.com/bytewax/bytewax/pull/34
    • Examples metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/43
    • Update to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/45
    • Runs CI on release publish by @davidselassie in https://github.com/bytewax/bytewax/pull/46
    • Feast example by @blakestier in https://github.com/bytewax/bytewax/pull/42
    • Adds run_main() by @davidselassie in https://github.com/bytewax/bytewax/pull/48
    • Example of Apriori algorithm by @TheBits in https://github.com/bytewax/bytewax/pull/50
    • Dockerfile enhancements by @miccioest in https://github.com/bytewax/bytewax/pull/52
    • Integrate docs with main repository by @konradsienkowski in https://github.com/bytewax/bytewax/pull/51
    • Fix le feast typos by @blakestier in https://github.com/bytewax/bytewax/pull/54
    • Adding a Kubernetes example based on manual_cluster by @miccioest in https://github.com/bytewax/bytewax/pull/55
    • Update docs with capture operator by @awmatheson in https://github.com/bytewax/bytewax/pull/56
    • Update Notebook Example to Bytewax Version 0.8 by @awmatheson in https://github.com/bytewax/bytewax/pull/57
    • Re-enable documentation tests and upgrade longform docs to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/58
    • Frontier psychiatry by @whoahbot in https://github.com/bytewax/bytewax/pull/59
    • Move operator descriptions into API docs by @davidselassie in https://github.com/bytewax/bytewax/pull/60
    • Prepare for v0.9.0 release. by @whoahbot in https://github.com/bytewax/bytewax/pull/61

    New Contributors

    • @blakestier made their first contribution in https://github.com/bytewax/bytewax/pull/29
    • @konradsienkowski made their first contribution in https://github.com/bytewax/bytewax/pull/40
    • @TheBits made their first contribution in https://github.com/bytewax/bytewax/pull/50

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.9.0

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Mar 29, 2022)

    Overview

    • Capture operator no longer takes arguments. Items that flow through those points in the dataflow graph will be processed by the output handlers setup by each execution entry point. Every dataflow requires at least one capture.

    • Executor.build_and_run() is replaced with four entry points for specific use cases:

      • run() for exeuction in the current process. It returns all captured items to the calling process for you. Use this for prototyping in notebooks and basic tests.

      • run_cluster() for execution on a temporary machine-local cluster that Bytewax coordinates for you. It returns all captured items to the calling process for you. Use this for notebook analysis where you need parallelism.

      • spawn_cluster() for starting a machine-local cluster with more control over input and output. Use this for standalone scripts where you might need partitioned input and output.

      • cluster_main() for starting a process that will participate in a cluster you are coordinating manually. Use this when starting a Kubernetes cluster.

    • Adds bytewax.parse module to help with reading command line arguments and environment variables for the above entrypoints.

    • Renames bytewax.inp to bytewax.inputs.

    What's Changed

    • New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19
    • Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24
    • Restructure Python imports by @whoahbot in https://github.com/bytewax/bytewax/pull/22
    • Restructure rust by @whoahbot in https://github.com/bytewax/bytewax/pull/25
    • Renames test_run to test_execution by @davidselassie in https://github.com/bytewax/bytewax/pull/27
    • Adds an "order" input helper and allows tumbling "event time" by @davidselassie in https://github.com/bytewax/bytewax/pull/26
    • Maturin develop before running tests. by @whoahbot in https://github.com/bytewax/bytewax/pull/28
    • Tests ability to interrupt execution by @davidselassie in https://github.com/bytewax/bytewax/pull/23
    • Jupyter Notebook Anomaly Detection Example by @awmatheson in https://github.com/bytewax/bytewax/pull/30
    • Debug operators by @blakestier in https://github.com/bytewax/bytewax/pull/29
    • Running tests using whl file already build by @miccioest in https://github.com/bytewax/bytewax/pull/32
    • More comprehensive readme by @awmatheson in https://github.com/bytewax/bytewax/pull/33
    • modify inputs by @awmatheson in https://github.com/bytewax/bytewax/pull/35
    • Remove Criterion benchmark by @whoahbot in https://github.com/bytewax/bytewax/pull/37
    • Add a key param to the stateful_map operator. by @whoahbot in https://github.com/bytewax/bytewax/pull/36
    • Runs doctests in CI by @davidselassie in https://github.com/bytewax/bytewax/pull/38
    • API Docs by @davidselassie in https://github.com/bytewax/bytewax/pull/39
    • Fixes sorted_window() to support items with identical times by @davidselassie in https://github.com/bytewax/bytewax/pull/41
    • Apidocs templates by @konradsienkowski in https://github.com/bytewax/bytewax/pull/40
    • Testable examples by @davidselassie in https://github.com/bytewax/bytewax/pull/34
    • Examples metadata by @konradsienkowski in https://github.com/bytewax/bytewax/pull/43
    • Update to 0.8.0 by @davidselassie in https://github.com/bytewax/bytewax/pull/45
    • Runs CI on release publish by @davidselassie in https://github.com/bytewax/bytewax/pull/46

    New Contributors

    • @blakestier made their first contribution in https://github.com/bytewax/bytewax/pull/29
    • @konradsienkowski made their first contribution in https://github.com/bytewax/bytewax/pull/40

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.8.0

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0-beta.2(Feb 24, 2022)

    Beta release

    What's Changed

    • New execution interface by @davidselassie in https://github.com/bytewax/bytewax/pull/19
    • Container and Kubernetes related improvements by @miccioest in https://github.com/bytewax/bytewax/pull/24

    Updated execution interface

    run() now takes a dataflow and some input, runs it synchronously as a single worker in the existing Python thread, and returns the output to that thread. This is what you'd use in tests and simple notebook work.

    run_cluster() takes a dataflow and some input, starts a local cluster of processes, runs it, waits for the cluster to finish work, then collects thre results, and returns the output to that thread. This is what you'd use in a notebook if you need parallelism or higher throughput.

    cluster_main() starts up a cluster of local processes, coordinates the addresses and process IDs between them, runs a dataflow on it, and waits for it to finish. This has a partitioned "input builder" and an "output builder" (discussed below). This is what you'd use if you'd want to write a standalone script or example that does some higher throughput processing.

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.1...v0.8.0-beta.0

    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Feb 17, 2022)

    v0.7.1

    Updates to build_and_run() to support running in notebook environments.

    What's Changed

    • Parse arguments in build_and_run. by @whoahbot in https://github.com/bytewax/bytewax/pull/10
    • Adds a capture operator to capture output by @davidselassie in https://github.com/bytewax/bytewax/pull/9
    • Introducing: exhash by @davidselassie in https://github.com/bytewax/bytewax/pull/14
    • Exceptions and interrupts in multiple worker threads by @davidselassie in https://github.com/bytewax/bytewax/pull/16
    • Fix typing for inp.py by @mttcnnff in https://github.com/bytewax/bytewax/pull/17
    • Add in tests for inputs by @whoahbot in https://github.com/bytewax/bytewax/pull/18

    New Contributors

    • @mttcnnff made their first contribution in https://github.com/bytewax/bytewax/pull/17

    Full Changelog: https://github.com/bytewax/bytewax/compare/v0.7.0...v0.7.1

    Source code(tar.gz)
    Source code(zip)
Owner
Bytewax
Bytewax
An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

An example repository on how to start building graph applications on streaming data. Just clone and start building ?? ??

Memgraph 40 Dec 20, 2022
A highly efficient daemon for streaming data from Kafka into Delta Lake

A highly efficient daemon for streaming data from Kafka into Delta Lake

Delta Lake 172 Dec 23, 2022
AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

null 30.7k Jan 7, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

Syed Vilayat Ali Rizvi 5 Aug 31, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

null 5 Sep 6, 2023
A rust library built to support building time-series based projection models

TimeSeries TimeSeries is a framework for building analytical models in Rust that have a time dimension. Inspiration The inspiration for writing this i

James MacAdie 12 Dec 7, 2022
ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

SFU Database Group 939 Jan 5, 2023
Read specialized NGS formats as data frames in R, Python, and more.

oxbow Read specialized bioinformatic file formats as data frames in R, Python, and more. File formats create a lot of friction for computational biolo

null 12 Jun 7, 2023
Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

Markus Klein 21 Dec 27, 2022
Orkhon: ML Inference Framework and Server Runtime

Orkhon: ML Inference Framework and Server Runtime Latest Release License Build Status Downloads Gitter What is it? Orkhon is Rust framework for Machin

Theo M. Bulut 129 Dec 21, 2022
Xaynet represents an agnostic Federated Machine Learning framework to build privacy-preserving AI applications.

xaynet Xaynet: Train on the Edge with Federated Learning Want a framework that supports federated learning on the edge, in desktop browsers, integrate

XayNet 196 Dec 22, 2022
Arrow User-Defined Functions Framework on WebAssembly.

Arrow User-Defined Functions Framework on WebAssembly Example Build the WebAssembly module: cargo build --release -p arrow-udf-wasm-example --target w

RisingWave Labs 3 Dec 14, 2023
Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

null 294 Dec 23, 2022
A highly scalable MySQL Proxy framework written in Rust

mysql-proxy-rs An implementation of a MySQL proxy server built on top of tokio-core. Overview This crate provides a MySQL proxy server that you can ex

AgilData 175 Dec 19, 2022
Open-source Rust framework for building event-driven live-trading & backtesting systems

Barter Barter is an open-source Rust framework for building event-driven live-trading & backtesting systems. Algorithmic trade with the peace of mind

Barter 157 Feb 18, 2023
H2O Open Source Kubernetes operator and a command-line tool to ease deployment (and undeployment) of H2O open-source machine learning platform H2O-3 to Kubernetes.

H2O Kubernetes Repository with official tools to aid the deployment of H2O Machine Learning platform to Kubernetes. There are two essential tools to b

H2O.ai 16 Nov 12, 2022
Elemental System Designs is an open source project to document system architecture design of popular apps and open source projects that we want to study

Elemental System Designs is an open source project to document system architecture design of popular apps and open source projects that we want to study

Jason Shin 9 Apr 10, 2022
Fish Game for Macroquad is an online multiplayer game, created as a demonstration of Nakama, an open-source scalable game server, using Rust-lang and the Macroquad game engine.

Fish Game for Macroquad is an online multiplayer game, created as a demonstration of Nakama, an open-source scalable game server, using Rust-lang and the Macroquad game engine.

Heroic Labs 130 Dec 29, 2022
📊 Cube.js — Open-Source Analytics API for Building Data Apps

?? Cube.js — Open-Source Analytics API for Building Data Apps

Cube.js 14.4k Jan 8, 2023
Create, open, manage your Python projects with ease, a project aimed to make python development experience a little better

Create, open, manage your Python projects with ease, a project aimed to make python development experience a little better

Dhravya Shah 7 Nov 18, 2022