Distributed compute platform implemented in Rust, and powered by Apache Arrow.

Overview

Ballista: Distributed Compute Platform

License Crates.io Discord chat

Overview

Ballista is a distributed compute platform primarily implemented in Rust, powered by Apache Arrow. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs.

Technologies

The foundational technologies in Ballista are:

Ballista can be deployed as a standalone cluster and also supports Kubernetes. In either case, the scheduler can be configured to use etcd as a backing store to (eventually) provide redundancy in the case of a scheduler failing.

Architecture

The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.

Ballista Architecture Diagram

How does this compare to Apache Spark?

Although Ballista is largely inspired by Apache Spark, there are some key differences.

  • The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
  • Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.
  • The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
  • The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.

Examples

The following examples should help illustrate the current capabilities of Ballista:

Project Status

To follow the progress of this project, please refer to the "This Week in Ballista" series of blog posts. Follow @BallistaCompute on Twitter to receive notifications when the blog is updated.

Releases

Ballista releases are now available on crates.io, Maven Central and Docker Hub. Please refer to the user guide for instructions on using a released version of Ballista.

Documentation

The user guide is hosted at https://ballistacompute.org, along with the blog where news and release notes are posted.

Developer documentation can be found in the docs directory.

Contributing

See CONTRIBUTING.md for information on contributing to this project.

Comments
  • Donate Ballista to Apache Arrow project

    Donate Ballista to Apache Arrow project

    The Ballista project has recently reached a point where I believe that the basic architecture has been proven to work. The project has also suddenly become very popular and generated a lot of interest (more than 2k stars on GitHub).

    For these reasons, I think that the project has grown too large for me to continue maintaining as a personal project and I think it is now time to move the code to a foundation to ensure its continued success.

    Given the deep dependencies on Apache Arrow (the core Arrow, DataFusion, and Parquet crates) and the fact that there is already some overlap between Arrow and Ballista committers, I believe that the obvious choice would be to donate the project to Apache Arrow.

    Some of the benefits of donating the project to Arrow are:

    • Ballista unit tests will be part of Arrow CI which means that any changes to Arrow or DataFusion APIs that Ballista depends on will also require that the corresponding Ballista code is updated as part of the same PR.
    • DataFusion also needs a scheduler and it would probably make sense to push some parts of the Ballista scheduler down a level in the stack so that the same approach is used to scale across cores in DataFusion and to scale across nodes in Ballista.
    • There is a team of committers that understand Arrow and DataFusion that can help with PR reviews so I will no longer be a bottleneck.
    • Companies are more likely to commit resources to contributing to an Apache project compared to a personal project.

    I am going to start a discussion on the Arrow mailing list to propose this idea as well.

    help wanted 
    opened by andygrove 31
  • Decouple planning from execution

    Decouple planning from execution

    Big PR... I got carried away, sorry!

    Here's what this PR does: it takes a step in the direction of https://github.com/ballista-compute/ballista/issues/463 and decouples the planning phase from the execution phase.

    When a client submits a query/plan, the only thing the scheduler does is calculate the stages and saves them to etcd. That enables submitting queries even when there are no executors available: once an executor comes online the job will transition to the Running state. It also enables adding more executors to the cluster and the scheduler will be able to dynamically send them whichever tasks are ready to be run.

    How are tasks run? The executor registration endpoint in the scheduler has been renamed to "poll_work", and it now does various things:

    1. It works as a heartbeat for executor liveness
    2. It allows executors to report the status of the tasks that have finished since the last call to poll_work
    3. It allows executors to request more work from the scheduler

    The changes in the scheduler have been made keeping in mind that we should support a scheduler cluster. I haven't tested it, but right now I'm basically doing a big fat distributed lock through etcd, which is very inefficient but also very easy to get right ๐Ÿ˜‚

    The bad parts of this PR:

    1. ~No unit tests.~ (added!)
    2. ~There seems to be a big perf hit when using etcd. Query 1 takes 5 seconds on standalone and 10 seconds with etcd. Not sure why. The scheduler code can still be heavily optimized (I've preferred correctess and simplicity as a first approach), but that huge impact is somewhat surprising.~ (see next comment)
    opened by edrevo 17
  • Implement web UI for scheduler

    Implement web UI for scheduler

    We should implement a web UI in the scheduler that can be used to monitor the state of the cluster and view information on queries that are pending/running/completed/failed.

    This issue is for creating the basic mechanism and then we can file additional issues for functionality, such as:

    • view list of executors
    • view query plan
    • view query metrics
    • cancel query
    help wanted rust 
    opened by andygrove 14
  • Research supporting WASM for custom code as part of a physical plan

    Research supporting WASM for custom code as part of a physical plan

    As a user of Ballista, I would like the ability to execute arbitrary code as part of my distributed job. I want the ability to use multiple languages depending on my requirements (perhaps there are third party libraries I want to use).

    WASM seems like a good potential choice for this?

    design 
    opened by andygrove 10
  • Add helm chart to the project

    Add helm chart to the project

    Currently we have 4 kubernetes configurations, each with small variations of each other. I think that it would be beneficial to provide an helm chart that templates those configurations thus allowing users to configure the cluster and manage releases.

    IMO this will also allow us to more easily benchmark with a variable number of executors and spark vs rust.

    opened by jorgecarleitao 8
  • Implement minimal REST API in the scheduler

    Implement minimal REST API in the scheduler

    We should start building a REST API in the scheduler using rocket. The web UI can use this REST API and it would also allow users to use other tools (such as curl) to write scripts that get metadata from the scheduler.

    For this issue, I would suggest just implementing one method to return a list of executors. We can then write up issues for additional functionality.

    rust 
    opened by andygrove 7
  • Implement physical plan serde for these expressions

    Implement physical plan serde for these expressions

    See ballista/src/serde/physical_plan/to_proto.rs

            } else if let Some(_expr) = expr.downcast_ref::<Literal>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<CaseExpr>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<CastExpr>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<NotExpr>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<IsNullExpr>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<IsNotNullExpr>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<InListExpr>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<InListExpr>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<NegativeExpr>() {
                unimplemented!()
            } else if let Some(_expr) = expr.downcast_ref::<ScalarFunctionExpr>() {
                unimplemented!()
    
    help wanted rust 
    opened by andygrove 7
  • Review use of HashMap in ColumnarBatch in branch-0.4

    Review use of HashMap in ColumnarBatch in branch-0.4

    I tried merging latest from main branch into branch-0.4 this morning and there was an issue with the changes in ColumnarBatch to use a HashMap of field name to column rather than just using a vec. I'd like to understand the motivation for this change and whether we can revert to using a vec.

    The output from any operator should be deterministic and have a fixed schema, IMO, but maybe I am missing something?

    rust 
    opened by andygrove 7
  • [design] generic system testing infrastructure

    [design] generic system testing infrastructure

    Going forward I think Ballista will need some generic system testing infrastructure. I am proposing we write this in Python and make use of the PyUnit framework. While this library is intended for unit testing, I have found it works quite well for generic system testing. I think python is a good choice for this because it is system agnostic and will allow us to write system tests for all of the Ballista components in a single place.

    I have three high level goals:

    • test correctness
    • make debugging for developers easy
    • portability, we don't want to tightly integrate testing infrastructure with CI system and tests should be able to run locally

    The core component is the Harness class which derives from unittest.TestCase. Harnesses for specific tests can derive from our base Harness class, e.g for testing python bindings, rust bindings and JDBC. This provides a nice testing life cycle. For example if we have a harness with testing methods test_x() and test_y() the life cycle looks like:

    • setUpClass() # Set up for the entire Harness - create resources on a kubernetes cluster.
    • setUp() # set up before each test -
    • test_x()
    • tearDown() # tear down after each test
    • setUp()
    • test_y()
    • tearDown()
    • tearDownClass() # tear down after all tests are complete - remove created resources

    In setUpClass() we can spin up a local kubernetes cluster using minikube/microk8s. I plan to write a nice wrapper script for cluster utilities. (cluster.py in the diagram)

    We need an entry point to run our tests (runner.py in the diagram). This script should be usable in CI and locally for debugging. I think we could use something like xml runner to get xml reports which can be consumed by a CI server.

    I am not sure how to integrate this with buildkite yet, but my goal is to write infrastructure that should work anywhere and not integrate too tightly with any one CI system.

    I am going to start with the cluster.py utility.

    image

    design 
    opened by Shamazo 6
  • Can't compile because of `packed_simd` version 0.3.3

    Can't compile because of `packed_simd` version 0.3.3

    Hi guys, the dependency of packed_simd v0.3.3 seemed broken on crates.io[0]

    When I follow the user guide, it can't compile because of packed_simd v0.3.3:

    error[E0432]: unresolved import `crate::arch::x86_64::_mm_shuffle_pi8`
       --> /Users/yxie/.cargo/registry/src/github.com-1ecc6299db9ec823/packed_simd-0.3.3/src/codegen/shuffle1_dyn.rs:40:29
        |
    40  |                         use crate::arch::x86_64::_mm_shuffle_pi8;
        |                             ^^^^^^^^^^^^^^^^^^^^^---------------
        |                             |                    |
        |                             |                    help: a similar name exists in the module: `_mm_shuffle_epi8`
        |                             no `_mm_shuffle_pi8` in `arch::x86_64`
    

    ...

    error: aborting due to 7 previous errors
    
    For more information about this error, try `rustc --explain E0432`.
    error: could not compile `packed_simd`
    

    And it seems that the original packed_simd package will no longer be updated, they choose to use another package name packed_simd_2[1] instead.

    So do we have plans to update this dependency?

    [0] https://github.com/rust-lang/packed_simd#the-cratesio-version-can-no-longer-be-updated [1] https://crates.io/crates/packed_simd_2

    bug rust 
    opened by pengye91 6
  • [Design] High level API proposal for ballista

    [Design] High level API proposal for ballista

    I've been thinking lately about what the best API is for the different parts of ballista. For now, I'm still tabling the discussion around resource managers (K8s, Mesos, YARN, etc.) and I'll focus on scheduler, executors and clients.

    I see that right now the executors implement the flight protocol, which makes perfect sense for arrow data transmission. I think this is a great fit for the "data plane" in ballista: executor<->executor communication and executor <-> client communication (for .collect).

    When it comes to the "control plane", though, I think stream-based mechanisms like the flight protocol (or even the bidirectional streams in gRPC) aren't great: they pin the communication of the client to a specific server (scheduler, in this case), so that makes it harder to dynamically increase the number of instances in a scheduler cluster (you could increase the instances, but all existing clients would still be talking the the same initial scheduler).

    I also think that unary RPCs are easier to implement, and have a higher degree of self-documentation through the protobuf definition.

    So, here's my proposal for the control plane:

    image

    The arrows go from client to server in this above image.

    The scheduler would have an API that would look something like this:

    service SchedulerGrpc {
      // This is the only call that executors need. It works similar to how Kafka's consumer/producers poll the brokers.
      // The executor needs to poll the scheduler through this method every X seconds to be part of the cluster.
      // As a reply, it gets information about the cluster state and configuration, along with a potential work item
      // that needs to be added to the executor's work queue.
      rpc PollWork (ExecutorState) returns (GetExecutorMetadataResult) {}
    
    
      // These methods are for scheduler-client interaction. These are the methods that would need to be called from
      // any client wrappers we want to provide, so they should be kept as simple as possible.
    
      rpc Execute (Action) returns (JobId) {}
    
      // JobStatus includes metadata about the job's progress along with the necessary Flight tickets to be able to
      // retrieve results through the data plane
      rpc GetJobStatus (JobId) returns (JobStatus) {}
    }
    

    I can drill down into the message definitions if you want, but I'm still not 100% sure of the complete amount of info that needs to go there (I know part of it).

    The proposal for data plane is much simpler:

    image

    All of these would be based on the Flight Protocol, but we wouldn't use any of the control messages that flight offers (i.e. no DoAction).

    Thoughts?

    design 
    opened by edrevo 6
Releases(v0.4.1)
Owner
Ballista
Distributed compute platform based on Apache Arrow
Ballista
A cross-platform, OpenGL terminal emulator.

Alacritty - A fast, cross-platform, OpenGL terminal emulator About Alacritty is a modern terminal emulator that comes with sensible defaults, but allo

Alacritty 43.7k Dec 28, 2022
Cross-platform tool to update DNS such as Gandi.net with your dynamic IP address

GDU | Generic DNS Update A cross-platform tool to update DNS zonefiles (such as Gandi.net) when you have a dynamic public IP address. It's a DynDNS or

Damien Lecan 10 Jan 20, 2022
A Rust serverless function to retrieve and relay a playlist for Twitch livestreams/VODs.

City17 A Rust serverless function to retrieve and relay a playlist for Twitch livestreams/VODs. By running this in specific countries and using a brow

Malloc Voidstar 5 Dec 15, 2021
A secure JavaScript and TypeScript runtime

Deno Deno is a simple, modern and secure runtime for JavaScript and TypeScript that uses V8 and is built in Rust. Features Secure by default. No file,

Deno Land 87.1k Jan 5, 2023
Provides a single TUI-based registry for drm-free, wine and steam games on linux, accessed through a rofi launch menu.

eidolon A conversion of steam_suite to rust with additional features. Provides a single TUI-based registry for drm-free, wine and steam games on linux

Nico Hickman 113 Dec 27, 2022
๐Ÿ‘พ Modern and minimalist pixel editor

โ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ โ–ˆโ–ˆ `rx` is a modern and minimalist pixel editor. Designed with g

Alexis Sellier 2.6k Jan 1, 2023
A genetic algorithm for bechmark problems, written to learn Rust.

Genetic Algorithm A genetic algorithm in Rust for the following benchmark problems: Ackley Griewangk Rastrigin Rosenbrock Schwefel Sphere Usage: Insta

Andrew Schwartzmeyer 73 Dec 25, 2022
interative assembly shell written in rust

Overview this project is inspired by https://github.com/poppycompass/asmshell Preview Build from source git clone https://github.com/keystone-engine/k

Xargin 236 Dec 23, 2022
Userspace WireGuardยฎ Implementation in Rust

BoringTun BoringTun is an implementation of the WireGuardยฎ protocol designed for portability and speed. BoringTun is successfully deployed on millions

Cloudflare 4.8k Jan 8, 2023
Drill is a HTTP load testing application written in Rust inspired by Ansible syntax

Drill Drill is a HTTP load testing application written in Rust. The main goal for this project is to build a really lightweight tool as alternative to

Ferran Basora 1.5k Dec 28, 2022
An experimental HTTP load testing application written in Rust.

Herd Herd was a small side project in building a HTTP load testing application in Rust with a main focus on being easy to use and low on OS level depe

Jacob Clark 100 Dec 27, 2022
A fast data collector in Rust

Flowgger is a fast, simple and lightweight data collector written in Rust. It reads log entries over a given protocol, extracts them, decodes them usi

Amazon Web Services - Labs 739 Jan 7, 2023
kytan: High Performance Peer-to-Peer VPN in Rust

kytan: High Performance Peer-to-Peer VPN kytan is a high performance peer to peer VPN written in Rust. The goal is to to minimize the hassle of config

Chang Lan 368 Dec 31, 2022
A purpose-built proxy for the Linkerd service mesh. Written in Rust.

This repo contains the transparent proxy component of Linkerd2. While the Linkerd2 proxy is heavily influenced by the Linkerd 1.X proxy, it comprises

Linkerd 1.7k Jan 7, 2023
Full fake REST API generator written with Rust

Weld Full fake REST API generator. This project is heavily inspired by json-server, written with rust. Synopsis Our first aim is to generate a fake ap

Seray Uzgur 243 Dec 31, 2022
pastebin written in pure rust. A rewrite of ptpb/pb.

rspb rust fork of ptpb/pb TL;DR Create a new paste from the output of cmd: cmd | curl -F c=@- https://pb.mgt.moe/ Usage Creating pastes > echo hi | c

mgt 39 Jan 4, 2023
The LibreTranslate API for Rust.

libretranslate-rs A LibreTranslate API for Rust. libretranslate = "0.2.4" libretranslate allows you to use open source machine translation in your pr

Grant Handy 51 Jan 5, 2023
Yet another pager in Rust

rust-pager Yet another pager in Rust Features Vim like keybindings Search substring Mouse wheel support Install cargo install rust-pager Usage <comman

null 19 Dec 7, 2022
Fork of async-raft, the Tokio-based Rust implementation of the Raft protocol.

Agreed Fork of async-raft, the Tokio-based Rust implementation of the Raft distributed consensus protocol. Agreed is an implementation of the Raft con

NLV8 Technologies 8 Jul 5, 2022