Byzantine-fault-tolerant time synchronization

Overview

Byztime

Byztime is a Byzantine-fault-tolerant protocol for synchronizing time among a group of peers, without reliance on any external time authority. The time kept by Byztime is simply a counter that advances at a rate of something very close to one unit per second, such that all nodes are in close agreement as to its current value. Byztime timestamps have no well-defined epoch. If all nodes have correctly-set system clocks when first initialized, then Byztime will initially match POSIX time, but will eventually drift away from it since 1. there is no external source keeping it in sync, and 2. Byztime's timescale lacks leap seconds.

Byztime's algorithm is focused on keeping its worst-case error --- the absolute distance between any two correct nodes' estimate of the current time --- as small as possible. It achieves this somewhat at the expense of typical-case error, using only the single highest-quality time sample from each peer rather than combining many samples to smooth out network jitter. In the worst case, the difference between two correct nodes' clocks will asymptotically converge toward 4δ + 4ερ, where δ is the one-way network latency between the two farthest-spaced peers, ε is the (dimensionless) drift rate of correct nodes' hardware clocks, and ρ is the polling interval. If all nodes behave honestly, the bound improves to 2δ + 2ερ and will be reached after a single round of the protocol rather than converging asymptotically.

Byztimed runs completely independently of NTP, and a bad NTP time source will not disrupt Byztime. This comes with a minor caveat: just before the daemon shuts down it records the the current offset between Byztime time and system time, and uses this offset to re-initialize its estimate following a reboot. The only time this particularly matters is if many nodes reboot simultaneously and the network loses quorum. What happens in this case depends somewhat on NTP and what order things start up in at boot time. If Byztime starts before NTP starts and shuts down only after NTP shuts down, then the continuity of the Byztime timescale will be as good as the RTC and the CMOS battery of the restarting nodes, but no better. On the other hand if NTP is allowed to stabilize the system clock before Byztime starts up, then the continuity of the Byztime scales will be as good as its NTP sources — which is probably a lot better than your RTC, but could be arbitrarily bad if the NTP source is faulty. Again, this only becomes an issue if Byztime loses quorum, meaning ⅓ or more of the network reboots at once.

Byztime also currently relies on the system time for determining whether an X.509 certificate is expired. Once Roughtime matures a bit we may consider integrating a Roughtime client into byztimed for certificate validation purposes.

Build

Byztime is built like any standard Rust crate. The easiest way to install byztimed is to get it from crates.io via cargo:

cargo install byztimed

If you prefer to check out and build this repo, note that byztimed includes libbyztime as a submodule, so be sure to clone with git clone --recurse-submodules, or run git submodule update --init --recursive if you have already cloned without the --recurse-submodules option.

Byztime is tested against Rust's stable channel, but compilers significantly older than the current stable will probably work. The most recent version known not to work is 1.38 (because we rely on async/await, which stabilized in 1.39).

Byztimed currently runs only on Linux, and is well-tested only on AMD64. Other CPU architectures should work; please file a bug ticket if you encounter any issues. We hope to eventually support more operating systems, but this will be an uphill battle because Byztime depends on timekeeping facilities that currently only Linux provides. The effort to improve portability will likely require contributing some new functionality to other OS kernels.

Usage

Run byztimed as byztimed <path-to-config-file>. See CONFIG.md for configuration file syntax.

libbyztime is the C client library for consuming time from byztimed. See its byztime.h file for API documentation. The byztime crate within this repo provides idiomatic Rust bindings to libbyztime and its documentation can be read on docs.rs.

Protocol overview

Although it is fundamentally a peer-to-peer protocol, Byztime uses a client-server communication pattern, with each node acting as both a client and a server to each other node. A client-only operation mode, wherein a node synchronizes itself to the consensus but does not vote in it, is also supported.

Byztime uses Network Time Security (NTS) for cryptographic link protection. Communication from each client to each server begins by the client initiating a TLS handshake and then using NTS-KE to negotiate shared keys and obtain NTS cookies. After NTS-KE is complete, the TLS connection closes and the remainder of the protocol runs over UDP. NTS provides message-level authenticated encryption. It provides replay protection for the client, but not for the server. The server never updates any state in response to a message from a client, so processing replays is harmless. For the remainder of this overview, we'll take the abstraction of authenticated encryption for granted and omit NTS-related machinery from our descriptions.

Each node is assumed to be equipped with a local clock which counts the time elapsed since some arbitrary epoch such as when the system last booted. One node's local clock has no a priori known relationship to another's. Rather, this relationship is discovered through the execution of the protocol. The shared time that nodes seek to synchronize to is called the global clock. Nodes maintain an estimate of their global offset, which is the difference between the global clock and their local clock. The local clock never receives any adjustments; only the global offset does.

The protocol proceeds by each node periodically sending a query to each of its peers, and the peer sending a response which includes a snapshot of its local clock and its current estimate of its global offset. Each query/response volley is called a measurement.

The protocol uses the following global parameters:

  1. N: the number of nodes participating in consensus.

  2. f: the number of faulty nodes that can be tolerated. f = floor((N-1)/3).

  3. drift: a dimensionless number giving an upper bound on how fast or slow a correct node's local clock might be. For example if drift is 50e-6 then the clock might drift by up to 50µs per second.

Each node keeps the following state:

  1. The era of its local clock. This is a randomly-generated identifier which changes if the local clock loses its state, e.g. after a reboot.

  2. global_offset: The node's estimate of the offset between the global clock and its local clock: local_clock() + global_offset == estimate of global clock.

  3. error: The maximum absolute difference between the above estimate of the global clock and its true value.

  4. last_update: The local clock time at which global_offset and error were last updated.

And then for each of its peers:

  1. peer.era: The peer's clock era as of the last time it communicated.

  2. peer.local_offset: The node's estimate of the offset between its local clock and the peer's local clock: local_clock() + peer.local_offset == estimate of peer's local clock.

  3. peer.global_offset: The peer's estimate of the offset between its own local clock and the global clock, as of the last time it communicated.

  4. peer.rtt: The round trip time of the current "best" measurement of the peer's clock. This measurement is the one on which peer.local_offset is based.

  5. peer.origin_time: The local clock time at which the query which led to the current best measurement was sent.

  6. peer.inflight_id: The random unique identifier associated with a query, if any, that is currently awaiting a response.

  7. peer.inflight_origin_time: The local clock time at which the current in-flight query (if any) was sent.

There is some additional state related to NTS — a cache of cookies and shared keys — which works basically the same way as it does for NTP and we'll disregard it for the purposes of this explanation.

At first run, nodes initialize global_offset such that the global clock matches their real-time clock. They periodically check the offset between the two and persist this offset to disk. This persisted value is used to recover a rough value with which to reinitialize global_offset after a reboot, but the error bounds on offsets recovered in this manner are considered infinite.

Once per configured polling interval, clients send a query message to each of their peers, containing just a randomly-generated unique identifier. The sender updates peer.inflight_id and peer.inflight_origin_time to reflect the content of the packet and the time at which it was sent. If there was already another query in flight, the old query is assumed to have been dropped by the network and the new inflight_id and inflight_origin_time values overwrite the old ones.

Servers respond immediately to any query they receive. The response contains:

  1. response_id: A copy of the query's unique identifier.

  2. response_local: A snapshot of the server's local clock.

  3. response_era: The server's era.

  4. response_global_offset: The server's global_offset.

When the client receives the response, it processes it as follows:

  1. Set now to a snapshot of the local clock at the moment the response was received.

  2. Verify that peer.inflight_id is non-null and matches response_id. If not, discard the response. Otherwise, set peer.inflight_id to null and continue.

  3. Set peer.global_offset to response_global_offset.

  4. Compute rtt as now - peer.inflight_origin_time.

  5. If this is the first response seen so far from this peer, or if peer.era does not match the era contained in the response, skip to step 8.

  6. Compute the following lower-is-better quality metric for the current best measurement we have from this peer: Q = peer.rtt/2 + 2 * drift * (now - peer.origin_time). This represents the worst-case error in estimating the offset between this node's local clock and the peer's local clock, taking into account network asymmetry and clock drift. Drift is multiplied by 2 because the two clocks could each be drifting in opposite directions.

  7. Compute this quality metric for the new measurement: Q' = rtt/2 + 2 * drift * (now - peer.inflight_origin_time). If Q' > Q, then the old measurement is better than the new one, so keep it and return without further processing.

  8. Set peer.rtt to rtt, peer.origin_time to peer.inflight_origin_time, and peer.era to response_era.

  9. Set peer.local_offset to response_clock + rtt/2 - now.

Now with the newly updated clock values from the peer, recompute global_offset and error:

  1. For each peer p, compute an estimate est = p.local_offset + p.global_offset and error err = p.rtt/2 + 2 * drift * (now - p.origin_time), giving an interval (est - err, est + err). Create lists of all resulting minima and maxima.

  2. If we ourselves are a participant in consensus, insert global_offset into the list of minima and the list of maxima.

  3. Sort both lists. Discard the f lowest minima and the f highest maxima. Let min' equal the lowest remaining minimum and max' equal the highest remaining maximum. Let global_offset' = (max' + min')/2. Let error' = (max' - min')/2. This averaging method — discarding the f highest and lowest and taking the midpoint of the remaining range — is due to "A New Fault-Tolerant Algorithm for Clock Synchronization" (Welch and Lynch 1988) and is crucial for achieving Byzantine fault tolerance.

  4. Determine whether the new global offset and error are consistent with the old one. Let age = now - last_update. Let min = global_offset - error and let max = global_offset + error. Let drift_limit = 2 * age * drift. Now check that min' > min - drift_limit and max' < max + drift_limit. If this check fails, return without further processing. (This step is not necessary for ensuring synchronization, but without it, a MitM adversary could cause time throughout the network to advance too quickly or too slowly, by delaying query messages but not response messages or vice versa.)

  5. Set last_update to now, global_offset to global_offset', and error to error'.

This completes our description of the protocol. Applications consuming time from Byztime query the current values of global_offset, error, and last_update. The global time is local_time + global_offset, with error bounds of ±(error + 2*drift*(local_time - last_update)).

Estimates of global time are not frequency-stable: they jump discontinuously with each update and can move backward. It's up to the application how to deal with this. libbyztime includes support for clamping the results of successive calls to get_global_time() to make them consistent with each other.

Caveats

Akamai has been using Byztime in business-critical applications since early 2020 and it has been very stable for us. However, until two specific issues are resolved, this software should be considered beta:

  1. We are likely to make some backward-incompatible changes to Byztime's wire protocol. Byztime currently uses NTS-KE codepoints in the Experimental Use range; we plan to obtain and use permanent allocations from IANA. We also will likely change the format of Byztime's time packets, currently based on Protobufs, to a bespoke fixed-field format, in order to make parsing more predictable and make it easier to ensure to that request size matches response size. We plan to have a single flag-day release that makes all these changes at once, and then commit to backward-compatibility thereafter.

  2. Some of Byztime's statistics-and-health reporting capabilities have have been removed for this open-source release because they depend on Akamai-internal infrastructure to function. We plan to redesign and reimplement this functionality around open standards.

You might also like...
a simple rust service for Scheduling commands execution on time basis, an easy alternative to cron

Tasker A Simple crate which provides a service and a configuration API for genrating commands based tasks ,on time basis. Installation build from sour

Rust implementation of the PTHash perfect hash function for static compile-time generated hash tables

QuickPHF QuickPHF is a Rust implementation of the PTHash minimal perfect hash function algorithm. It consists of two crates: quickphf - runtime code f

Automated Solana tool for quick arbitrage, customizable, with real-time data and wallet integration. Trade responsibly.
Automated Solana tool for quick arbitrage, customizable, with real-time data and wallet integration. Trade responsibly.

Solana Arbitrage Trading Tool The Solana Arbitrage Trading Tool is an automated solution crafted to spot and capitalize on arbitrage opportunities wit

My utils for long-lived, fault-tolerant rust tasks

Sisyphus Utilities for long-running, resilient tasks. This library contains code I wrote, found useful, and want to keep using. It aims to provide sys

A fault-tolerant, recursive-descent parser for Ara Programming Language 🌲

Ara Parser A fault-tolerant, recursive-descent parser for Ara Programming Language 🌲 Note: This project is a hard-fork of php-rust-tools/parser Speci

A handwritten fault-tolerant, recursive-descent parser for PHP written in Rust.

PHP-Parser A handwritten fault-tolerant, recursive-descent parser for PHP written in Rust. Warning - this is still alpha software and the public API i

Fault-tolerant Async Actors Built on Tokio
Fault-tolerant Async Actors Built on Tokio

Kameo 🧚🏻 Fault-tolerant Async Actors Built on Tokio Async: Built on tokio, actors run asyncronously in their own isolated spawned tasks. Supervision

Pure Rust Fault Proof Program that runs the rollup state-transition to verify an L2 output from L1 inputs.

palmtop palmtop is a fault proof program that runs the rollup state transition to verify an L2 output from L1 inputs. The verifiable L2 output can the

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ ⚡ Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine 🔍 M

🍋: A General Lock following paper
🍋: A General Lock following paper "Optimistic Lock Coupling: A Scalable and Efficient General-Purpose Synchronization Method"

Optimistic Lock Coupling from paper "Optimistic Lock Coupling: A Scalable and Efficient General-Purpose Synchronization Method" In actual projects, th

A simple utility for multithreading a/synchronization

Flowync Quick Example use flowync::Flower; fn main() { let flower = Flower::i32, String::new(1); std::thread::spawn({ let handle =

Remedy is a multi-threaded rust-imap-maildir synchronization program

remedy Remedy is a multi-threaded rust-imap-maildir synchronization program. Please note that remedy is under heavy development. Current features: IMA

The most fundamental type for async synchronization: an intrusive linked list of futures

wait-list This crate provides WaitList, the most fundamental type for async synchronization. WaitList is implemented as an intrusive linked list of fu

Fuzzy Index for Python, written in Rust. Works like error-tolerant dict, keyed by a human input.

FuzzDex FuzzDex is a fast Python library, written in Rust. It implements an in-memory fuzzy index that works like an error-tolerant dictionary keyed b

Synchronization primitives for kernels.

hermit-sync hermit-sync provides synchronization primitives targeted at operating system kernels. For API documentation see the docs. License Licensed

Rust library for concurrent data access, using memory-mapped files, zero-copy deserialization, and wait-free synchronization.

mmap-sync mmap-sync is a Rust crate designed to manage high-performance, concurrent data access between a single writer process and multiple reader pr

Fully-typed, async, reusable state management and synchronization for Dioxus 🧬

dioxus-query 🦀 ⚡ Fully-typed, async, reusable state management and synchronization for Dioxus 🧬. Inspired by TanStack Query. ⚠️ Work in progress ⚠️

Synchronization primitives for both web and native.

wasm_sync wasm_sync offers synchronization primitives that work in both browser and native contexts. In web browsers, use of atomic wait instructions

Delightfully simple delay-tolerant networking

Blizzard: Delightfully simple delay-tolerant networking NOTE: I am in the process of updating this implementation to match a new spec. If you found th

Owner
Akamai Unofficial
Collection of unofficial Akamai tools and non-supported community projects.
Akamai Unofficial
🐴 RusTOTPony — CLI manager of one-time password generators aka Google Authenticator

?? RusTOTPony CLI manager of time-based one-time password generators. It is a desktop alternative for Google Authenticator. Installation Arch Linux Pa

German Lashevich 23 Jan 5, 2023
Cryptography-oriented big integer library with constant-time, stack-allocated (no_std-friendly) implementations of modern formulas

RustCrypto: Cryptographic Big Integers Pure Rust implementation of a big integer library which has been designed from the ground-up for use in cryptog

Rust Crypto 88 Dec 31, 2022
Pure-Rust traits and utilities for constant-time cryptographic implementations.

subtle Pure-Rust traits and utilities for constant-time cryptographic implementations. It consists of a Choice type, and a collection of traits using

dalek cryptography 196 Dec 13, 2022
BTM - Blockchain Time Machine.

BTM Blockchain Time Machine. BTM is an incremental data backup mechanism that does not require downtime. Why would you need this? btm will give you th

漢 3 Dec 15, 2022
Monorepo of our ahead-of-time implementation of Secure ECMAScript

Ahead-of-time Secure EcmaScript The monorepo contains a set of packages that helps adopt SES in a pre-compiled way. Security Assumptions This project

Dimension 13 Dec 21, 2022
Minimal compile-time Rust template engine

boilerplate boilerplate is a minimal compile-time Rust text template engine. Quick Start Add boilerplate to your project's Cargo.toml: [dependencies]

Casey Rodarmor 15 Dec 30, 2022
Rust compile-time type information experiment

Compile-Time Type Information This crate is an experimental standard library side implementation of potential ctti language feature. The idea is to pr

Auri 12 Jan 20, 2023
Rust library for practical time-lock encryption using `drand` threshold network

tlock-rs: Practical Timelock Encryption/Decryption in Rust This repo contains pure Rust implementation of drand/tlock scheme. It provides time-based e

Timofey 32 Jan 8, 2023
Rust encryption library for practical time-lock encryption.

tlock_age: Hybrid Timelock Encryption/Decryption in Rust tlock_age is a library to encrypt and decrypt age filekey using tlock scheme. It provides an

Thibault 5 Mar 29, 2023
Arkworks circuits for verifiable time-lock encryption

zk-timelock This repo contains arithmetic circuits for verifiable time-lock encryption made using arkworks-rs toolkit. For more details on such an enc

Timofey 68 Apr 5, 2023