Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

Facebook Research

Last update: Dec 8, 2022

Related tags

Machine learning narwhal

Overview

Narwhal and Tusk

This repo contains a prototype of Narwhal and Tusk. It supplements the paper Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus.

Overview

We propose separating the task of transaction dissemination from transaction ordering, to enable high-performance Byzantine fault-tolerant consensus in a permissioned setting. To this end, we design and evaluate a mempool protocol, Narwhal, specializing in high-throughput reliable dissemination and storage of causal histories of transactions. Narwhal tolerates an asynchronous network and maintains its performance despite failures. We demonstrate that composing Narwhal with a partially synchronous consensus protocol (HotStuff) yields significantly better throughput even in the presence of faults. However, loss of liveness during view-changes can result in high latency. To achieve overall good performance when faults occur we propose Tusk, a zero-message overhead asynchronous consensus protocol embedded within Narwhal. We demonstrate its high performance under a variety of configurations and faults. Further, Narwhal is designed to easily scale-out using multiple workers at each validator, and we demonstrate that there is no foreseeable limit to the throughput we can achieve for consensus, with a few seconds latency.

As a summary of results, on a Wide Area Network (WAN), Hotstuff over Narwhal achieves 170,000 tx/sec with a 2.5-sec latency instead of 1,800 tx/sec with 1-sec latency of Hotstuff. Additional workers increase throughput linearly to 600,000 tx/sec without any latency increase. Tusk achieves 140,000 tx/sec with 4 seconds latency or 20x better than the state-of-the-art asynchronous protocol. Under faults, both Narwhal based protocols maintain high throughput, but the HotStuff variant suffers from slightly higher latency.

Getting Started

The core protocols are written in Rust, but all benchmarking scripts are written in Python and run with Fabric. To deploy and benchmark a testbed of 4 nodes on your local machine, clone the repo and compile it in release mode:

$ git clone https://github.com/facebookresearch/narwhal.git
$ cd rust
$ cargo build --release

Then install the Python dependencies:

$ cd ../scripts
$ pip install -r requirements.txt

You also need to install Clang (required by rocksdb) and tmux (which runs all nodes and clients in the background). Finally, run a local benchmark using fabric:

$ fab local

This command may take a long time the first time you run it (compiling rust code in release mode may be slow) and you can customize a number of benchmark parameters in fabfile.py. When the benchmark terminates, it displays a summary of the execution similarly to the one below.

-----------------------------------------
 SUMMARY:
-----------------------------------------
 Committee size: 4 nodes
 Number of workers: 1 worker(s) per node
 Faults: 0 nodes
 Transaction size: 512 B
 Max batch size: 1,000 txs
 Transaction rate: 60,000 tx/s

 Dag Results:
 + Total certified bytes: 799,468,544 B
 + Execution time: 29,646 ms
 + Estimated BPS: 26,967,619 B/s
 + Estimated TPS: 52,671 txs/s
 + Block Latency: 6 ms
 + Client Latency: 93 ms

 Consensus Results:
 + Total committed bytes: 786,986,496 B
 + Execution time: 29,542 ms
 + Estimated BPS: 26,639,130 B/s
 + Estimated TPS: 52,030 txs/s
 + Block Latency: 395 ms
 + Client Latency: 482 ms
-----------------------------------------

License

This software is licensed as Apache 2.0.

Comments

Redundant Dag Traverse

Hello there,

I am new to this project and have one question about the consensus crate.

I feel the 'Consensus:order_leaders' is redundant. The traverse of the dag is repeated twice(or more if some leader has no link to the later blocks) in both order_leader and order_dag. Is there any purpose for doing that?

opened by LKY-stephen 1
Cleanup worker
What this PR does

[x] Cleanup worker but keeping the same overall structure

[x] Simplify the structure but keeping the same number of tasks (avoid code repetition)

[x] Write good unit tests

[x] Facilities to allow the primary to cleanup the worker (we didn't have that before)

[x] Fallback workers sync in case the target recipient is dishonest (we didn't have that before)

[x] Pipe the new worker with the existing primary (untouched)

[x] Adapt and improve the benchmark scripts (make them print errors if anything is fishy)

[x] Add meaningful comments

[x] Run a few measurements (on AWS) to ensure we do not loose performance

Tips for review

Start with worker.rs. The worker has three main jobs, handle the primary messages, process client transactions, and listen to others' workers messages.

Check the function handle_primary_messages, then handle_clients_transactions, then handle_workers_messages

Dive into each component use by these functions: Synchronizer, BatchMaker, QuorumWaiter, Processor, and Helper.

Each we call the function spawn of a component, it means we are starting a new tokio task.

Check the network crate.

Pay special attention at ReliableSender (it is delicate and easy to get wrong): https://github.com/facebookresearch/narwhal/blob/cleanup-worker/rust/network/src/reliable_sender.rs

Check the config crate (nothing special here).

CLA Signed
opened by asonnino 1
How do the replicas respond to a client?

Hello! I am working on a secondary development of this project.

Reading the code, I see that the client (from benchmark_client.rs) sends data to the replicas here. A replica accepts it while makes and commit a new block. During this call, the client does not wait for a reply from the replica and thus it receives no data from the replica.

Consider a scenario where the client needs to fetch the data stored on the state machine replication. Could you please give me some hints on how I can modify your code so that replicas could respond to client requests?

Thanks a lot!

opened by maghsk 0

configuration and logs of test results

Hello, We conducted a 100-node test on the WAN, and the test configuration and results are as follows:

+ CONFIG:
Faults: 33 node(s)
Committee size: 100 node(s)
Worker(s) per node: 1 worker(s)
Collocate primary and workers: True
Input rate: 234,500 tx/s
Transaction size: 200 B
Execution time: 41 s
Header size: 1,000 B
Max header delay: 200 ms
GC depth: 50 round(s)
Sync retry delay: 10,000 ms
Sync retry nodes: 33 node(s)
batch size: 200,000 B
Max batch delay: 200 ms
+ RESULTS:
Consensus TPS: 213,986 tx/s
Consensus BPS: 42,797,208 B/s
Consensus latency: 4,771 ms
End-to-end TPS: 207,890 tx/s
End-to-end BPS: 41,577,920 B/s
End-to-end latency: 7,852 ms

We have some questions about the test log and configuration.

First, according to the configuration, the sending rate of each client is 3500 tx/s, and our test time is 30s. But according to client's log, each client sends about 800 tx in this period of time.
According to the worker's log, every 4 tx make up a batch, but batch contains 14000B is displayed, which doesn't seem to match the configured tx_size = 200 B.

Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains sample tx 786
Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains sample tx 787
Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains sample tx 788
Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains sample tx 789
Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains 140000 B

I would also like to ask about the meaning of the same number after each "Committed B" in the primary log.

Committed B97(mZFTSr1a8XJoClO4) -> WK0oFGTH44pm3PYAtejZ05EDysdIuDJ1MZuphZQe3m4=
Committed B97(mZFTSr1a8XJoClO4) -> isS9EtiKzZ2qm3DfL37mt8o02TPS519+/aEDggnzTTE=
Committed B97(mZFTSr1a8XJoClO4) -> vI69CmxG2PlTMDc5GYZreMZVUIFIhS8zSIQCQmBxAlk=

opened by xuyi9 0

Cleanup primary
[x] Clean up the primary (without changing its structure)

[x] Clean up consensus

[x] Write meaningful tests

[x] Run a few measurements to ensure we don't loose performance

Could you pay special attention to the consensus?
CLA Signed
opened by asonnino 0
Reliable sender's connection replies has a potential ordering issue

Hello.

I think there is a bug in reliable sender's connection's pending replies queue ordering, but please let me know if it is my understanding that is lacking.

If we send multiple messages to a peer, there is no reason that the acknowledgements will be received in the same order. Yet the code seems to assume that because it is sending the received ACK message to the first handler in the queue (pop_front).

opened by samlaf 0
aws ec2 bandwidth between different regions

It seems that a peer connection is needed to establish between any two regions to achieve high bandwidth in aws ec2, while i didn't find in benchmark scripts.

opened by shan-chen 15
Committee update
What is the best way to:

Update the committee (change of authorities)

Change authorities network info (not the authorities per se, but for instance their ip addresses)

enhancement question
opened by asonnino 0
Remove redundant serialize-deserialize on the Primary

The Primary serializes and deserializes data many times for different reasons (storage, net, replying to sync requests). It would be nice to use the same trick as the worker to avoid/limit that.
enhancement

opened by asonnino 0

Releases(v0.1.1)

v0.1.1(Jul 29, 2021)
Fix minor path bug in bench scripts

Upload some benchmark results

Source code(tar.gz)
Source code(zip)
v0.1.0(Jun 30, 2021)
Cleanup the initial prototype of Narwhal and Tusk.

Source code(tar.gz)
Source code(zip)

Owner

Facebook Research

GitHub

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

405 Dec 20, 2022

HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

hnsw Hierarchical Navigable Small World Graph for fast ANN search Enable the serde feature to serialize and deserialize HNSW. Tips A good default for

93 Dec 30, 2022

🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

?? efficient approximate nearest neighbor search algorithm collections library written in Rust ?? .

2.3k Jan 3, 2023

Efficient ML solutions for long-tailed demands.

MegFlow MegFlow 是一个面向视觉应用的流式计算框架, 目标是简单、高性能、帮助机器学习应用快速落地。 Features 基于 async-std[features=tokio1] 的高效异步运行时调度器简洁的基于 toml 的建图描述格式支持静态、动态、共享子图支持 Rust/P

371 Dec 21, 2022

An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

Omikuji An efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, writte

73 Nov 7, 2022

Efficient argmin & argmax (in 1 function)

ArgMinMax Efficient argmin & argmax (in 1 function) with SIMD (SSE, AVX(2), AVX512, NEON) for f16, f32, f64, i8, i16, i32, i64, u8, u16, u32, u64. ??

33 Feb 12, 2023

Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). LGC and CAMLP have been implemented.

label-propagation-rs Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). A simple LGC and a mor

4 Sep 15, 2021

GBS/LSDj visualizer based on SameBoy and FFmpeg

GBPresenter GBPresenter is a tool I wrote to generate visualizations of GameBoy chiptunes, based on SameBoy, FFmpeg, and Slint. The visualization desi

30 Jun 24, 2023

Rust based Cross-GPU Machine Learning

HAL : Hyper Adaptive Learning Rust based Cross-GPU Machine Learning. Why Rust? This project is for those that miss strongly typed compiled languages.

83 Dec 20, 2022

Rust crate to create Anki decks. Based on the python library genanki

genanki-rs: A Rust Crate for Generating Anki Decks With genanki-rs you can easily generate decks for the popular open source flashcard platform Anki.

63 Dec 23, 2022

Graph-based Approximate Nearest Neighbor Search

granne* granne (graph-based retrieval of approximate nearest neighbors) is a Rust library for approximate nearest neighbor search based on Hierarchica

283 Dec 21, 2022

zenoh-flow aims at providing a zenoh-based data-flow programming framework for computations that span from the cloud to the device.

Eclipse Zenoh-Flow Zenoh-Flow provides a zenoh-based dataflow programming framework for computations that span from the cloud to the device. ⚠️ This s

35 Dec 12, 2022

Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

Related tags

Overview

Narwhal and Tusk

Overview

Getting Started

License

Comments

What this PR does

Tips for review

Releases(v0.1.1)

v0.1.1(Jul 29, 2021)

v0.1.0(Jun 30, 2021)

Owner

Facebook Research

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

Efficient ML solutions for long-tailed demands.

An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

Efficient argmin & argmax (in 1 function)

Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). LGC and CAMLP have been implemented.

GBS/LSDj visualizer based on SameBoy and FFmpeg

Rust based Cross-GPU Machine Learning

Rust crate to create Anki decks. Based on the python library genanki

Graph-based Approximate Nearest Neighbor Search

zenoh-flow aims at providing a zenoh-based data-flow programming framework for computations that span from the cloud to the device.

High performance distributed framework for training deep learning recommendation models based on PyTorch.

A naive density-based clustering algorithm written in Rust

Rust implementation of user-based collaborative filtering

Serenade: Low-Latency Session-Based Recommendations

A high level, easy to use gpgpu crate based on wgpu

Small program which groups images based on the GPS position.

MesaTEE GBDT-RS : a fast and secure GBDT library, supporting TEEs such as Intel SGX and ARM TrustZone