Diagnostic tools for timely dataflow computations

Related tags

IDEs diagnostics
Overview

Timely Diagnostics

Diagnostic tools for timely dataflow computations. Timely dataflows are data-parallel and scale from single threaded execution on your laptop to distributed execution across clusters of computers. Each thread of execution is called a worker.

The tools in this repository have the shared goal of providing insights into timely dataflows of any scale, in order to understand the structure and resource usage of a dataflow.

Each timely worker can be instructed to publish low-level event streams over a TCP socket, by setting the TIMELY_WORKER_LOG_ADDR environment variable. In order to cope with the high volume of these logging streams the diagnostic tools in this repository are themselves timely computations that we can scale out. In order to avoid confusion, we will refer to the workers of the dataflow that is being analysed as the source peers. The workers of the diagnostic computation we will refer to as inspector peers.

This repository contains a library, tdiag-connect, and a command line interface to the diagnostic tools, tdiag.

tdiag-connect (in /connect) is a library of utiltities that can be used by inspector peers to source event streams from source peers.

tdiag (in /tdiag) is an unified command line interface to all diagnostic tools (only one is currently available, more are coming).

Getting Started with tdiag

tdiag Crates.io is the CLI to all diagnostic tools. Install it via cargo:

cargo install tdiag

All diagnostic computations require you to specify the number of workers running in the source computation via the source-peers parameter. This is required in order to know when all source event streams are connected.

graph - Visualize the Source Dataflow

In order to better understand what is happening inside of a dataflow computation, it can be invaluable to visualize the structure of the dataflow. Start the graph diagnosis:

tdiag --source-peers 2 graph --out graph.html

You should be presented with a notice, informing you that tdiag is waiting for as many connections as specified via --source-peers (two in this case).

In a separate shell, start your source computation. In this case, we will analyse the Timely PageRank example. From inside the timely-dataflow/timely sub-directory, run:

env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --example pagerank 1000 1000000 -w 2

Most importantly, env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" will cause the source workers to connect to our diagnostic computation. The -w parameter specifies the number of workers we want to run the PageRank example with. Whatever we specify here therefore has to match the --source-peers parameter we used when starting tdiag.

Once the computation is running, head back to the diagnostic shell, where you should now see something like the following:

$ tdiag --source-peers 2 graph --out graph.html

Listening for 2 connections on 127.0.0.1:51317
Trace sources connected
Press enter to generate graph (this will crash the source computation if it hasn't terminated).

At any point, press enter as instructed. This will produce a fully self-contained HTML file at the path specified via --out (graph.html in this example). Open that file in any modern browser and you should see a rendering of the dataflow graph at the time you pressed enter. For the PageRank computation, the rendering should look similar to the following:

PageRank Graph

You can use your mouse or touchpad to move the graph around, and to zoom in and out.

profile - Profile the Source Dataflow

The profile subcommand reports aggregate runtime for each scope/operator.

tdiag --source-peers 2 profile

You should be presented with a notice informing you that tdiag is waiting for as many connections as specified via --source-peers (two in this case).

In a separate shell, start your source computation. In this case, we will analyse the Timely PageRank example. From inside the timely-dataflow/timely sub-directory, run:

env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --example pagerank 1000 1000000 -w 2

Most importantly, env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" will cause the source workers to connect to our diagnostic computation. The -w parameter specifies the number of workers we want to run the PageRank example with. Whatever we specify here therefore has to match the --source-peers parameter we used when starting tdiag.

Once the computation is running, head back to the diagnostic shell, where you should now see something like the following:

$ tdiag --source-peers 2 profile

Listening for 2 connections on 127.0.0.1:51317
Trace sources connected
Press enter to stop collecting profile data (this will crash the source computation if it hasn't terminated).

At any point, press enter as instructed. This will produce an aggregate summary of runtime for each scope/operator. Note that the aggregates for the scopes (denoted by [scope]) include the time of all contained operators.

[scope]	Dataflow	(id=0, addr=[0]):	1.17870668e-1 s
	PageRank	(id=3, addr=[0, 3]):	1.17197194e-1 s
	Feedback	(id=2, addr=[0, 2]):	3.56249e-4 s
	Probe	(id=6, addr=[0, 4]):	7.86e-6 s
	Input	(id=1, addr=[0, 1]):	3.408e-6 s

Diagnosing Differential Dataflows

The differential subcommand groups diagnostic tools that are only relevant to timely dataflows that make use of differential dataflow. To enable Differential logging in your own computation, add the following snippet to your code:

if let Ok(addr) = ::std::env::var("DIFFERENTIAL_LOG_ADDR") {
    if let Ok(stream) = ::std::net::TcpStream::connect(&addr) {
        differential_dataflow::logging::enable(worker, stream);
        info!("enabled DIFFERENTIAL logging to {}", addr);
    } else {
        panic!("Could not connect to differential log address: {:?}", addr);
    }
}

With this snippet included in your executable, you can use any of the following tools to analyse differential-specific aspects of your computation.

differential arrangements - Track the Size of Differential Arrangements

Stateful differential dataflow operators often maintain indexed input traces called arrangements. You will want to understand how these traces grow (through the accumulation of new inputs) and shrink (through compaction) in size, as your computation executes.

tdiag --source-peers differential arrangements

You should be presented with a notice informing you that tdiag is waiting for as many connections as specified via --source-peers (two in this case).

In a separate shell, start your source computation. In this case, we will analyse the Differential BFS example. From inside the differential dataflow repository, run:

export TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317"
export DIFFERENTIAL_LOG_ADDR="127.0.0.1:51318"

cargo run --example bfs 1000 10000 100 20 false -w 2

When analysing differential dataflows (in contrast to pure timely computations), both TIMELY_WORKER_LOG_ADDR and DIFFERENTIAL_LOG_ADDR must be set for the source workers to connect to our diagnostic computation. The -w parameter specifies the number of workers we want to run the PageRank example with. Whatever we specify here therefore has to match the --source-peers parameter we used when starting tdiag.

Once the computation is running, head back to the diagnostic shell, where you should now see something like the following:

$ tdiag --source-peers 2 differential arrangements

Listening for 2 Timely connections on 127.0.0.1:51317
Listening for 2 Differential connections on 127.0.0.1:51319
Will report every 1000ms
Trace sources connected

ms	Worker	Op. Id	Name	# of tuples
1000	0	18	Arrange ([0, 4, 6])	654
1000	0	20	Arrange ([0, 4, 7])	5944
1000	0	28	Arrange ([0, 4, 10])	3790
1000	0	30	Reduce ([0, 4, 11])	654
1000	1	18	Arrange ([0, 4, 6])	679
1000	1	20	Arrange ([0, 4, 7])	6006
1000	1	28	Arrange ([0, 4, 10])	3913
1000	1	30	Reduce ([0, 4, 11])	678
2000	0	18	Arrange ([0, 4, 6])	654
2000	0	18	Arrange ([0, 4, 6])	950
2000	0	20	Arrange ([0, 4, 7])	5944
2000	0	20	Arrange ([0, 4, 7])	6937
2000	0	28	Arrange ([0, 4, 10])	3790

Each row of output specifies the time of the measurement, worker and operator ids, the name of the arrangement and the number of tuples it maintains. Updated sizes will be reported every second by default, this can be controlled via the output-interval parameter.

The tdiag-connect library

Crates.io Docs

tdiag-connect (in /connect) is a library of utiltities that can be used by inspector peers to source event streams from source peers.

Documentation is at docs.rs/tdiag-connect.

Comments
  • A browser-based tool to display the dataflow graph

    A browser-based tool to display the dataflow graph

    This is a work-in-progress tool to gather a timely dataflow trace and render the dataflow graph.

    Please give it a try!

    • clone this repo, switch to this branch (graph-tool);
    • start the tool with cargo run -- --source-peers 2 graph --out graph.html;
    • in another shell, start the computation you'd like to visualize, and instruct timely to send trace data to the tool: TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run ... -- -w 2
    • once the computation you'd like to visualize has started, press enter at the tool prompt;
    • open graph.html with a modern browser.

    It should look something like this: Screenshot 2019-05-17 at 17 00 57

    opened by utaal 20
  • Command to log logical trace sizes over time

    Command to log logical trace sizes over time

    Goal of this is to track the number of arranged tuples across all arrangements in the computation, as the computation runs. Right now the only thing is doing is ingesting Batch and Merge events into a Differential collection (arrangement_id, t, +-tuples) and reporting accumulated counts every second.

    Remaining TODOs.

    • [x] join with Operates to extract arrangement names
    • [x] make output interval configurable
    • [x] make differential logging port configurable

    This is not as plug-n-play, because it requires augmenting a Differential computation with the following snippet:

    if let Ok(addr) = ::std::env::var("DIFFERENTIAL_LOG_ADDR") {
        info!("enabled DIFFERENTIAL logging to {}", addr);
    
        if let Ok(stream) = ::std::net::TcpStream::connect(&addr) {
            use differential_dataflow::logging::DifferentialEvent;
    
            let writer = ::timely::dataflow::operators::capture::EventWriter::new(stream);
            let mut logger = ::timely::logging::BatchLogger::new(writer);
    
            worker.log_register()
               .insert::<DifferentialEvent,_>("differential/arrange", move |time, data| logger.publish_batch(time, data));
        } else {
            panic!("Could not connect to differential log address: {:?}", addr);
        }
    }
    
    opened by comnik 14
  • Diagnostics PageRank Example Stuck

    Diagnostics PageRank Example Stuck

    Hello. I was trying to follow the example on the README.md but tdiag gets stuck.

    1. One one terminal I execute: cargo run --release -- --source-peers 2 graph --out graph.html
    2. One second terminal I execute: env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --release --example pagerank 1000 100000 -w 2
    3. pagerank runs to completion.
    4. tdiag acknowledges connection via:
    Listening for 2 connections on 127.0.0.1:51317
    Trace sources connected
    Press enter to generate graph (this will crash the source computation if it hasn't terminated).
    
    1. I press enter but tdiag hangs indefinitely.

    Looking at the stack trace of tdiag there are two threads. The main thread is waiting on a thread join. Thread2 also seems stuck on await_events. Stack trace for Thread2:

    futex_wait_cancelable 0x00007ffff7d9c376
    __pthread_cond_wait_common 0x00007ffff7d9c376
    __pthread_cond_wait 0x00007ffff7d9c376
    std::sys::unix::condvar::Condvar::wait condvar.rs:73
    std::sys_common::condvar::Condvar::wait condvar.rs:50
    std::sync::condvar::Condvar::wait condvar.rs:200
    std::thread::park mod.rs:923
    <timely_communication::allocator::thread::Thread as timely_communication::allocator::Allocate>::await_events thread.rs:44
    <timely_communication::allocator::generic::Generic as timely_communication::allocator::Allocate>::await_events generic.rs:99
    timely::worker::Worker<A>::step_or_park worker.rs:216
    timely::execute::execute::{{closure}} execute.rs:206
    timely_communication::initialize::initialize_from::{{closure}} initialize.rs:269
    std::sys_common::backtrace::__rust_begin_short_backtrace backtrace.rs:130
    std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} mod.rs:475
    <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once panic.rs:318
    std::panicking::try::do_call panicking.rs:297
    __rust_try 0x000055555661a74d
    std::panicking::try panicking.rs:274
    std::panic::catch_unwind panic.rs:394
    std::thread::Builder::spawn_unchecked::{{closure}} mod.rs:474
    core::ops::function::FnOnce::call_once{{vtable-shim}} function.rs:232
    <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once boxed.rs:1034
    <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once boxed.rs:1034
    std::sys::unix::thread::Thread::new::thread_start thread.rs:87
    start_thread 0x00007ffff7d95609
    clone 0x00007ffff7ed1103
    

    Please lmk if I missed something when executing the commands?

    opened by gatoWololo 3
  • Command line tool to

    Command line tool to "profile" timely

    Reports aggregate runtime for each scope/operator.

    See the main README for usage instructions (no need for --output here).

    Sample output:

    $ cargo run -- --source-peers 4 profile
    Listening for 4 connections on 127.0.0.1:51317
    Trace sources connected
    Press enter to stop collecting profile data (this will crash the source computation if it hasn't terminated).
    [scope]	Dataflow	(id=0, addr=[0]):	1.7661819e-2 s
    [scope]	Dataflow	(id=120, addr=[2]):	8.231392e-3 s
    [scope]	Dataflow	(id=82, addr=[1]):	6.199795e-3 s
    [scope]	Region	(id=116, addr=[1, 3]):	3.12838e-3 s
    	Reduce	(id=50, addr=[0, 26]):	4.49143e-4 s
    	Reduce	(id=60, addr=[0, 31]):	4.41773e-4 s
    	Join	(id=147, addr=[2, 15]):	4.17792e-4 s
    	Arrange	(id=8, addr=[0, 6]):	3.81e-4 s
    	Concatenate	(id=66, addr=[0, 34]):	3.76647e-4 s
    	Reduce	(id=10, addr=[0, 7]):	3.4767e-4 s
    	Join	(id=99, addr=[1, 3, 7]):	3.43986e-4 s
    	Arrange	(id=21, addr=[0, 12]):	3.39887e-4 s
    	Concatenate	(id=16, addr=[0, 10]):	3.30095e-4 s
    	Concatenate	(id=37, addr=[0, 20]):	3.1384e-4 s
    	Arrange	(id=154, addr=[2, 18]):	2.75343e-4 s
    	Arrange	(id=42, addr=[0, 22]):	2.70947e-4 s
    	Map	(id=19, addr=[0, 11]):	2.66121e-4 s
    [snip]
    
    opened by utaal 3
  • Functions to construct EventReaders for diagnostic streams

    Functions to construct EventReaders for diagnostic streams

    This is a subset of #1. It contains the functions to create EventReaders and a Replay operator that support early shutdown. make_readers supports both files and sockets as sources, borrowing from https://github.com/li1/snailtrail.

    @li1 please give these a try! We may/will publish these as a crate, but for now you can use a git dependency.

    opened by utaal 3
  • Correct filtering of retractions

    Correct filtering of retractions

    Retractions were being filtered while still assessing the number of records, causing the calculations to not reflect progress made by merges. At the same time, the filtering was not applied to count outputs, so retractions there (when counts increased) were presented as positive occurrences (though they should all have been immediately followed by new larger values, as the previous filtering ensured no counts ever decreased).

    opened by frankmcsherry 2
  • Extend README

    Extend README

    A first draft of extending the README with usage information.

    (easiest to look at directly: https://github.com/TimelyDataflow/diagnostics/tree/readme)

    opened by comnik 2
  • Upgrade to 0.11 and support drop events

    Upgrade to 0.11 and support drop events

    This PR upgrades the Timely and DD dependencies to 0.11, fixing a breaking change to the builder interface, and adds support for batch drop events, as mentioned in #12.

    (@frankmcsherry )

    opened by comnik 1
  • [differential arrangements] Support batch drop events

    [differential arrangements] Support batch drop events

    Differential just got https://github.com/TimelyDataflow/differential-dataflow/pull/201, which makes it possible to track batches created in transient trances. Seems like differential arrangements could do something useful with that.

    enhancement 
    opened by comnik 1
  • bump timely & differential versions to 0.10

    bump timely & differential versions to 0.10

    This fixes #8 .

    Currently, tdiag won't build since it relies on the crates.io 0.1 release which still uses 0.9 dependencies. If you'd like me to, I can also bump the crate versions & connect dependency from 0.1.1-pre to e.g. 0.1.2-pre?

    opened by li1 1
  • Fix arrangement profiling's --output-interval

    Fix arrangement profiling's --output-interval

    The documentation was manually overridden (I changed it to append the manual part to the auto-generated part), causing documentation for --output-interval to be omitted. Sub-second intervals were not supported. I fixed this, so now millisecond-granularity windows are supported, instead of just whole seconds.

    opened by namibj 0
  • [graph] visualization to indicate where a dataflow is stuck

    [graph] visualization to indicate where a dataflow is stuck

    @frankmcsherry:

    It is hard to diagnose a "stuck" timely dataflow computation, where for some reason there is a capability (or perhaps message) in the system that prevents forward progress. In the system there is fairly clear information (in the progress tracking) about which pointstamps have non-zero accumulation, and although perhaps not strictly speaking a "visualization" we could imagine extracting and presenting this information.

    @antiguru recently had a similar issue, in which he wanted to "complete" a dataflow without simply exiting the worker (to take some measurements), and when he attempts this the dataflow never reports completion. The root cause was ultimately that a forgotten input was left un-closed.

    One idiom that seemed helpful here was to imagine a version of the dataflow graph that reports e.g. whether operators have been tombstoned or not (closed completely, memory reclaimed). This would reveal who was keeping a dataflow open, which is a rougher version of what is holding a dataflow back. We might also look for similar idioms that allow people to ask, for a given timestamp/frontier, which operators have moved past that frontier and which have not, revealing where in the dataflow graph a time is "stuck".

    enhancement 
    opened by utaal 3
  • [graph] Support filtering/collapsing of data flow regions to make huge graphs readable

    [graph] Support filtering/collapsing of data flow regions to make huge graphs readable

    This may be needed for complex graphs like Frank's epic doop graph.

    Various snippets from #1:


    @frankmcsherry

    I've been thinking a bit about how to present these, and one thought was: maybe it makes sense to have two nodes for the feedback node, and to not connect them other than visually. This maybe allows the graph to dangle a bit better, and reveals the acyclic definitions.

    @comnik

    I think this could be a good use case for a touch of interactivity, e.g. draw the nodes somewhat differently to indicate an outgoing / incoming feedback edge, and then highlight the pair when the user hovers on either of the two nodes.

    As an experiment, I built an extra script for adding DataScript into the mix. This is intended to be completely opt-in, without changing anything about the current representation.

    I also added a hook to re-render the whole thing reactively.

    This should give us a low-overhead (no React!) way to experiment with a few more dynamic features, such as highlighting feedback edges.

    It would be helpful to have scopes be exported as well, which would allow us to do things such as collapsing / expanding scopes.

    enhancement 
    opened by utaal 1
Owner
Timely Dataflow
Low latency, high throughput, dataflow computation
Timely Dataflow
IDE tools for writing pest grammars, using the Language Server Protocol for Visual Studio Code, Vim and other editors

Pest IDE Tools IDE support for Pest, via the LSP. This repository contains an implementation of the Language Server Protocol in Rust, for the Pest par

pest 20 Apr 8, 2023
Visualization for Timely Dataflow and Differential Dataflow programs

DDShow Visualization for Timely Dataflow and Differential Dataflow programs Getting started with ddshow First, install ddshow via cargo. As of now dds

Chase Wilson 61 Nov 25, 2022
Materialize simplifies application development with streaming data. Incrementally-updated materialized views - in PostgreSQL and in real time. Materialize is powered by Timely Dataflow.

Materialize is a streaming database for real-time applications. Get started Check out our getting started guide. About Materialize lets you ask questi

Materialize, Inc. 4.7k Jan 8, 2023
An implementation of Ngo et al's GenericJoin in timely dataflow.

dataflow-join A streaming implementation of Ngo et al's GenericJoin in timely dataflow. Ngo et al presented a very cool join algorithm, some details o

Frank McSherry 76 Sep 24, 2022
A modular implementation of timely dataflow in Rust

Timely Dataflow Timely dataflow is a low-latency cyclic dataflow computational model, introduced in the paper Naiad: a timely dataflow system. This pr

Timely Dataflow 2.7k Dec 30, 2022
Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

null 294 Dec 23, 2022
Rust DLT (Diagnostic Log and Trace) packet parser

dlt_parse A zero allocation rust library for basic parsing & writing DLT (Diagnostic Log and Trace) packets. Currently only the parsing and writing of

Julian Schmid 23 Dec 16, 2022
Opensource diagnostic software for Daimler vehicles, inspired by Xentry and DAS, written in Rust

OPENSTAR An opensource diagnostic application for Daimler vehicles inspired by DAS and Xentry. Some of the work here is based on OpenVehicleDiag If yo

Ashcon Mohseninia 21 Nov 20, 2022
miette is a diagnostic library for Rust. It includes a series of traits/protocols that allow you to hook into its error reporting facilities, and even write your own error reports!

miette is a diagnostic library for Rust. It includes a series of traits/protocols that allow you to hook into its error reporting facilities, and even write your own error reports!

Kat Marchán 1.2k Jan 1, 2023
tracing - a framework for instrumenting Rust programs to collect structured, event-based diagnostic information

tracing-appender Writers for logging events and spans Documentation | Chat Overview tracing is a framework for instrumenting Rust programs to collect

Cris Liao 1 Mar 9, 2022
Rust crate providing a variety of automotive related libraries, such as communicating with CAN interfaces and diagnostic APIs

The Automotive Crate Welcome to the automotive crate documentation. The purpose of this crate is to help you with all things automotive related. Most

I CAN Hack 29 Mar 11, 2024
Dataflow system for building self-driving car and robotics applications.

ERDOS ERDOS is a platform for developing self-driving cars and robotics applications. Getting started The easiest way to get ERDOS running is to use o

ERDOS 163 Dec 29, 2022
Dataflow is a data processing library, primarily for machine learning

Dataflow Dataflow is a data processing library, primarily for machine learning. It provides efficient pipeline primitives to build a directed acyclic

Sidekick AI 9 Dec 19, 2022
Fast and correct computations with uncertain values.

Uncertain<T> Fast and correct computations with uncertain values. When working with values which are not exactly determined, such as sensor data, it c

Tilman Roeder 89 Nov 28, 2022
A STARK prover and verifier for arbitrary computations.

A STARK is a novel proof-of-computation scheme to create efficiently verifiable proofs of the correct execution of a computation. The scheme was developed by Eli Ben-Sasson, Michael Riabzev et al. at Technion - Israel Institute of Technology. STARKs do not require an initial trusted setup, and rely on very few cryptographic assumptions. See references for more info.

Novi 512 Jan 3, 2023
A library for advanced finite element computations in Rust

A Rust library for building advanced applications with the Finite Element Method (FEM). Although developed with a special emphasis on solid mechanics

Interactive Computer Graphics 55 Dec 26, 2022
zenoh-flow aims at providing a zenoh-based data-flow programming framework for computations that span from the cloud to the device.

Eclipse Zenoh-Flow Zenoh-Flow provides a zenoh-based dataflow programming framework for computations that span from the cloud to the device. ⚠️ This s

null 35 Dec 12, 2022
ANISE provides an open-source and open-governed library and algorithmic specification for most computations for astrodynamics

ANISE provides an open-source and open-governed library and algorithmic specification for most computations for astrodynamics. It is heavily inspired by NAIF SPICE, and may be considered as an open-source modern rewrite of SPICE.

ANISE 4 Mar 9, 2022
This crate is an implementation of Sonic, a protocol for quickly verifiable, compact zero-knowledge proofs of arbitrary computations

Sonic This crate is an implementation of Sonic, a protocol for quickly verifiable, compact zero-knowledge proofs of arbitrary computations. Sonic is i

null 75 Jul 4, 2022
A repo for learning how to parallelize computations in the GPU using Apple's Metal, in Rust.

Metal playground in rust Made for learning how to parallelize computations in the GPU using Apple's Metal, in Rust, via the metal crate. Overview The

Lambdaclass 5 Feb 20, 2023