Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

Overview

Narwhal and Tusk

rustc license

This repo contains a prototype of Narwhal and Tusk. It supplements the paper Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus.

Overview

We propose separating the task of transaction dissemination from transaction ordering, to enable high-performance Byzantine fault-tolerant consensus in a permissioned setting. To this end, we design and evaluate a mempool protocol, Narwhal, specializing in high-throughput reliable dissemination and storage of causal histories of transactions. Narwhal tolerates an asynchronous network and maintains its performance despite failures. We demonstrate that composing Narwhal with a partially synchronous consensus protocol (HotStuff) yields significantly better throughput even in the presence of faults. However, loss of liveness during view-changes can result in high latency. To achieve overall good performance when faults occur we propose Tusk, a zero-message overhead asynchronous consensus protocol embedded within Narwhal. We demonstrate its high performance under a variety of configurations and faults. Further, Narwhal is designed to easily scale-out using multiple workers at each validator, and we demonstrate that there is no foreseeable limit to the throughput we can achieve for consensus, with a few seconds latency.

As a summary of results, on a Wide Area Network (WAN), Hotstuff over Narwhal achieves 170,000 tx/sec with a 2.5-sec latency instead of 1,800 tx/sec with 1-sec latency of Hotstuff. Additional workers increase throughput linearly to 600,000 tx/sec without any latency increase. Tusk achieves 140,000 tx/sec with 4 seconds latency or 20x better than the state-of-the-art asynchronous protocol. Under faults, both Narwhal based protocols maintain high throughput, but the HotStuff variant suffers from slightly higher latency.

Getting Started

The core protocols are written in Rust, but all benchmarking scripts are written in Python and run with Fabric. To deploy and benchmark a testbed of 4 nodes on your local machine, clone the repo and compile it in release mode:

$ git clone https://github.com/facebookresearch/narwhal.git
$ cd rust
$ cargo build --release

Then install the Python dependencies:

$ cd ../scripts
$ pip install -r requirements.txt

You also need to install Clang (required by rocksdb) and tmux (which runs all nodes and clients in the background). Finally, run a local benchmark using fabric:

$ fab local

This command may take a long time the first time you run it (compiling rust code in release mode may be slow) and you can customize a number of benchmark parameters in fabfile.py. When the benchmark terminates, it displays a summary of the execution similarly to the one below.

-----------------------------------------
 SUMMARY:
-----------------------------------------
 Committee size: 4 nodes
 Number of workers: 1 worker(s) per node
 Faults: 0 nodes
 Transaction size: 512 B
 Max batch size: 1,000 txs
 Transaction rate: 60,000 tx/s

 Dag Results:
 + Total certified bytes: 799,468,544 B
 + Execution time: 29,646 ms
 + Estimated BPS: 26,967,619 B/s
 + Estimated TPS: 52,671 txs/s
 + Block Latency: 6 ms
 + Client Latency: 93 ms

 Consensus Results:
 + Total committed bytes: 786,986,496 B
 + Execution time: 29,542 ms
 + Estimated BPS: 26,639,130 B/s
 + Estimated TPS: 52,030 txs/s
 + Block Latency: 395 ms
 + Client Latency: 482 ms
-----------------------------------------

License

This software is licensed as Apache 2.0.

Comments
  • Redundant Dag Traverse

    Redundant Dag Traverse

    Hello there,

    I am new to this project and have one question about the consensus crate.

    I feel the 'Consensus:order_leaders' is redundant. The traverse of the dag is repeated twice(or more if some leader has no link to the later blocks) in both order_leader and order_dag. Is there any purpose for doing that?

    opened by LKY-stephen 1
  • Cleanup worker

    Cleanup worker

    What this PR does

    • [x] Cleanup worker but keeping the same overall structure
    • [x] Simplify the structure but keeping the same number of tasks (avoid code repetition)
    • [x] Write good unit tests
    • [x] Facilities to allow the primary to cleanup the worker (we didn't have that before)
    • [x] Fallback workers sync in case the target recipient is dishonest (we didn't have that before)
    • [x] Pipe the new worker with the existing primary (untouched)
    • [x] Adapt and improve the benchmark scripts (make them print errors if anything is fishy)
    • [x] Add meaningful comments
    • [x] Run a few measurements (on AWS) to ensure we do not loose performance

    Tips for review

    1. Start with worker.rs. The worker has three main jobs, handle the primary messages, process client transactions, and listen to others' workers messages.
    2. Check the function handle_primary_messages, then handle_clients_transactions, then handle_workers_messages
    3. Dive into each component use by these functions: Synchronizer, BatchMaker, QuorumWaiter, Processor, and Helper.
    4. Each we call the function spawn of a component, it means we are starting a new tokio task.
    5. Check the network crate.
    6. Pay special attention at ReliableSender (it is delicate and easy to get wrong): https://github.com/facebookresearch/narwhal/blob/cleanup-worker/rust/network/src/reliable_sender.rs
    7. Check the config crate (nothing special here).
    CLA Signed 
    opened by asonnino 1
  • How do the replicas respond to a client?

    How do the replicas respond to a client?

    Hello! I am working on a secondary development of this project.

    Reading the code, I see that the client (from benchmark_client.rs) sends data to the replicas here. A replica accepts it while makes and commit a new block. During this call, the client does not wait for a reply from the replica and thus it receives no data from the replica.

    Consider a scenario where the client needs to fetch the data stored on the state machine replication. Could you please give me some hints on how I can modify your code so that replicas could respond to client requests?

    Thanks a lot!

    opened by maghsk 0
  • configuration and logs of test results

    configuration and logs of test results

    Hello,     We conducted a 100-node test on the WAN, and the test configuration and results are as follows:

    + CONFIG:
    Faults: 33 node(s)
    Committee size: 100 node(s)
    Worker(s) per node: 1 worker(s)
    Collocate primary and workers: True
    Input rate: 234,500 tx/s
    Transaction size: 200 B
    Execution time: 41 s
    Header size: 1,000 B
    Max header delay: 200 ms
    GC depth: 50 round(s)
    Sync retry delay: 10,000 ms
    Sync retry nodes: 33 node(s)
    batch size: 200,000 B
    Max batch delay: 200 ms
    + RESULTS:
    Consensus TPS: 213,986 tx/s
    Consensus BPS: 42,797,208 B/s
    Consensus latency: 4,771 ms
    End-to-end TPS: 207,890 tx/s
    End-to-end BPS: 41,577,920 B/s
    End-to-end latency: 7,852 ms
    

        We have some questions about the test log and configuration.

    1. First, according to the configuration, the sending rate of each client is 3500 tx/s, and our test time is 30s. But according to client's log, each client sends about 800 tx in this period of time.
    2. According to the worker's log, every 4 tx make up a batch, but batch contains 14000B is displayed, which doesn't seem to match the configured tx_size = 200 B.
    Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains sample tx 786
    Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains sample tx 787
    Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains sample tx 788
    Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains sample tx 789
    Batch jOiahFVevxMc4+RQEIlZfEjFHha/oBesYqcEHBKSZiU= contains 140000 B
    
    1. I would also like to ask about the meaning of the same number after each "Committed B" in the primary log.
    Committed B97(mZFTSr1a8XJoClO4) -> WK0oFGTH44pm3PYAtejZ05EDysdIuDJ1MZuphZQe3m4=
    Committed B97(mZFTSr1a8XJoClO4) -> isS9EtiKzZ2qm3DfL37mt8o02TPS519+/aEDggnzTTE=
    Committed B97(mZFTSr1a8XJoClO4) -> vI69CmxG2PlTMDc5GYZreMZVUIFIhS8zSIQCQmBxAlk=
    
    opened by xuyi9 0
  • Cleanup primary

    Cleanup primary

    • [x] Clean up the primary (without changing its structure)
    • [x] Clean up consensus
    • [x] Write meaningful tests
    • [x] Run a few measurements to ensure we don't loose performance

    Could you pay special attention to the consensus?

    CLA Signed 
    opened by asonnino 0
  • Reliable sender's connection replies has a potential ordering issue

    Reliable sender's connection replies has a potential ordering issue

    Hello.

    I think there is a bug in reliable sender's connection's pending replies queue ordering, but please let me know if it is my understanding that is lacking.

    If we send multiple messages to a peer, there is no reason that the acknowledgements will be received in the same order. Yet the code seems to assume that because it is sending the received ACK message to the first handler in the queue (pop_front).

    opened by samlaf 0
  • aws ec2 bandwidth between different regions

    aws ec2 bandwidth between different regions

    It seems that a peer connection is needed to establish between any two regions to achieve high bandwidth in aws ec2, while i didn't find in benchmark scripts.

    opened by shan-chen 15
  • Committee update

    Committee update

    What is the best way to:

    • Update the committee (change of authorities)
    • Change authorities network info (not the authorities per se, but for instance their ip addresses)
    enhancement question 
    opened by asonnino 0
  • Remove redundant serialize-deserialize on the Primary

    Remove redundant serialize-deserialize on the Primary

    The Primary serializes and deserializes data many times for different reasons (storage, net, replying to sync requests). It would be nice to use the same trick as the worker to avoid/limit that.

    enhancement 
    opened by asonnino 0
Releases(v0.1.1)
Owner
Facebook Research
Facebook Research
Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Cleora Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metr

Synerise 405 Dec 20, 2022
HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

hnsw Hierarchical Navigable Small World Graph for fast ANN search Enable the serde feature to serialize and deserialize HNSW. Tips A good default for

Rust Computer Vision 93 Dec 30, 2022
🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

?? efficient approximate nearest neighbor search algorithm collections library written in Rust ?? .

Hora-Search 2.3k Jan 3, 2023
Efficient ML solutions for long-tailed demands.

MegFlow MegFlow 是一个面向视觉应用的流式计算框架, 目标是简单、高性能、帮助机器学习应用快速落地。 Features 基于 async-std[features=tokio1] 的高效异步运行时调度器 简洁的基于 toml 的建图描述格式 支持静态、动态、共享子图 支持 Rust/P

旷视天元 MegEngine 371 Dec 21, 2022
An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification

Omikuji An efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, writte

Tom Dong 73 Nov 7, 2022
Efficient argmin & argmax (in 1 function)

ArgMinMax Efficient argmin & argmax (in 1 function) with SIMD (SSE, AVX(2), AVX512, NEON) for f16, f32, f64, i8, i16, i32, i64, u8, u16, u32, u64. ??

Jeroen Van Der Donckt 33 Feb 12, 2023
Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). LGC and CAMLP have been implemented.

label-propagation-rs Label Propagation Algorithm by Rust. Label propagation (LP) is graph-based semi-supervised learning (SSL). A simple LGC and a mor

vaaaaanquish 4 Sep 15, 2021
GBS/LSDj visualizer based on SameBoy and FFmpeg

GBPresenter GBPresenter is a tool I wrote to generate visualizations of GameBoy chiptunes, based on SameBoy, FFmpeg, and Slint. The visualization desi

Noah Sweilem 30 Jun 24, 2023
Rust based Cross-GPU Machine Learning

HAL : Hyper Adaptive Learning Rust based Cross-GPU Machine Learning. Why Rust? This project is for those that miss strongly typed compiled languages.

Jason Ramapuram 83 Dec 20, 2022
Rust crate to create Anki decks. Based on the python library genanki

genanki-rs: A Rust Crate for Generating Anki Decks With genanki-rs you can easily generate decks for the popular open source flashcard platform Anki.

Yannick Funk 63 Dec 23, 2022
Graph-based Approximate Nearest Neighbor Search

granne* granne (graph-based retrieval of approximate nearest neighbors) is a Rust library for approximate nearest neighbor search based on Hierarchica

null 283 Dec 21, 2022
zenoh-flow aims at providing a zenoh-based data-flow programming framework for computations that span from the cloud to the device.

Eclipse Zenoh-Flow Zenoh-Flow provides a zenoh-based dataflow programming framework for computations that span from the cloud to the device. ⚠️ This s

null 35 Dec 12, 2022
High performance distributed framework for training deep learning recommendation models based on PyTorch.

PERSIA (Parallel rEcommendation tRaining System with hybrId Acceleration) is developed by AI platform@Kuaishou Technology, collaborating with ETH. It

null 340 Dec 30, 2022
A naive density-based clustering algorithm written in Rust

Density-based clustering This a pure Rust implementation of a naive density-based clustering algorithm similar to DBSCAN. Here, 50 points are located

chris m 0 Mar 19, 2020
Rust implementation of user-based collaborative filtering

Rucommender Recommendation system written in Rust Overview An implementation in Rust of a collaborative filtering recommendations algorithm with a use

null 0 Sep 15, 2018
Serenade: Low-Latency Session-Based Recommendations

Serenade: Low-Latency Session-Based Recommendations This repository contains the official code for session-based recommender system Serenade, which em

bol.com 61 Dec 16, 2022
A high level, easy to use gpgpu crate based on wgpu

A high level, easy to use gpgpu crate based on wgpu. It is made for very large computations on powerful gpus

null 18 Nov 26, 2022
Small program which groups images based on the GPS position.

gps-cluster This small program will take some pictures in input, and based on the metadata on every image, it will group them by their GPS position, i

Alessio Bandiera 2 Sep 12, 2022
MesaTEE GBDT-RS : a fast and secure GBDT library, supporting TEEs such as Intel SGX and ARM TrustZone

MesaTEE GBDT-RS : a fast and secure GBDT library, supporting TEEs such as Intel SGX and ARM TrustZone MesaTEE GBDT-RS is a gradient boost decision tre

MesaLock Linux 179 Nov 18, 2022