Raft distributed consensus algorithm implemented in Rust.

Overview

Raft

Build Status Documentation Crates.io dependency status

Problem and Importance

When building a distributed system one principal goal is often to build in fault-tolerance. That is, if one particular node in a network goes down, or if there is a network partition, the entire cluster does not fall over. The cluster of nodes taking part in a distributed consensus protocol must come to agreement regarding values, and once that decision is reached, that choice is final.

Distributed Consensus Algorithms often take the form of a replicated state machine and log. Each state machine accepts inputs from its log, and represents the value(s) to be replicated, for example, a hash table. They allow a collection of machines to work as a coherent group that can survive the failures of some of its members.

Two well known Distributed Consensus Algorithms are Paxos and Raft. Paxos is used in systems like Chubby by Google, and Raft is used in things like tikv or etcd. Raft is generally seen as a more understandable and simpler to implement than Paxos.

Design

Raft replicates the state machine through logs. If you can ensure that all the machines have the same sequence of logs, after applying all logs in order, the state machine will reach a consistent state.

A complete Raft model contains 4 essential parts:

  1. Consensus Module, the core consensus algorithm module;

  2. Log, the place to keep the Raft logs;

  3. State Machine, the place to save the user data;

  4. Transport, the network layer for communication.

The design of the Raft crate

Note: This Raft implementation in Rust includes the core Consensus Module only, not the other parts. The core Consensus Module in the Raft crate is customizable, flexible, and resilient. You can directly use the Raft crate, but you will need to build your own Log, State Machine and Transport components.

Using the raft crate

You can use raft with either rust-protobuf or Prost to encode/decode gRPC messages. We use rust-protobuf by default. To use Prost, build (or depend on) Raft using the prost-codec feature and without default features.

Developing the Raft crate

Raft is built using the latest version of stable Rust, using the 2018 edition. Minimum supported version is 1.44.0.

Using rustup you can get started this way:

rustup component add clippy
rustup component add rustfmt

In order to have your PR merged running the following must finish without error:

cargo test --all && \
cargo clippy --all --all-targets -- -D clippy::all   && \
cargo fmt --all -- --check

You may optionally want to install cargo-watch to allow for automated rebuilding while editing:

cargo watch -s "cargo check"

Modifying Protobufs

See instructions in the proto subdirectory.

Benchmarks

We use Criterion for benchmarking.

It's currently an ongoing effort to build an appropriate benchmarking suite. If you'd like to help out please let us know! Interested?

You can run the benchmarks by installing gnuplot then running:

cargo bench

You can check target/criterion/report/index.html for plots and charts relating to the benchmarks.

You can check the performance between two branches:

git checkout master
cargo bench --bench benches -- --save-baseline master
git checkout other
cargo bench --bench benches -- --baseline master

This will report relative increases or decreased for each benchmark.

Acknowledgments

Thanks etcd for providing the amazing Go implementation!

Projects using the Raft crate

  • TiKV, a distributed transactional key value database powered by Rust and Raft.

Links for Further Research

Comments
  • Replace script with build script, use features

    Replace script with build script, use features

    See https://github.com/pingcap/kvproto/pull/349

    Part of https://github.com/pingcap/raft-rs/issues/177

    Signed-off-by: ice1000 [email protected]

    opened by ice1000 37
  • Porting prost migration to raft 0.4

    Porting prost migration to raft 0.4

    For TiKV usages. This is opened as a PR only for code review, please don't merge.

    Currently there's one assertion error in tests but the production code are (seem to) working.

    opened by ice1000 32
  • Add free() for raft node

    Add free() for raft node

    Close #419 and https://github.com/tikv/raft-rs/pull/448 .

    Add two API

    1. free_resources(id: u64)
    2. free_resources_for_all()

    Signed-off-by: jayzhan [email protected]

    opened by jayzhan211 19
  • batch MsgAppend entries

    batch MsgAppend entries

    #18 entries will be appended if there is an existing receiver MsgAppend message. let me know if anything needs to be added. Signed-off-by: balaji [email protected]

    Feature 
    opened by poonai 16
  • Initialize the env_logger in tests for tracing.

    Initialize the env_logger in tests for tracing.

    Previously it required some manual intervention to get the logs of a test.

    This makes it so, for example, RUST_LOG=raft=debug cargo test $SOME_TEST_NAME produces a full log.

    Enhancement 
    opened by Hoverbear 15
  • Optimize memory footprint of raft state machine

    Optimize memory footprint of raft state machine

    After introducing memory tracing in TiKV, we found that the memory footprint could be very large if there were too many raft state machine. Most of them came from the inflight vector. Supposing max_inflight_messages was set to 4096, and there were 3 peers, leader would need to allocate 3 * 4096 * 8 = 96K memory. If there were 100k regions, then the virt memory became 9.1 GiB. And most of the memory were idle and perhaps never visited again.

    There are many possible optimizations for this situation. On the one hand, if we can build a global flow control for all raft state machines, then the memory usage can be reduced; on the other hand, if we can reduce the total number of regions, the impact of flow control can also be omitted. But for the short term, I think it's easy to allocate lazily for inflight vector. We can use a linked list to allocate block by block and free them when it's idle. Because most raft state machines should be idle when it comes to 100k, so there will be only one block for all inflight vector. If each block is allocated as 512, then the memory footprint can be reduced to 572MiB, which is reasonably small enough.

    Enhancement Help Wanted 
    opened by BusyJay 14
  • Use a single set of Progresses for ProgressSet.

    Use a single set of Progresses for ProgressSet.

    The motivation for this PR may be unclear until you consider future work. But, shortly, this is partially in preparation for Joint Consensus.

    This is a step to allowing us to use different configurations (eg voter/learner topology) on top of the same Progress states. This will mean we can have ProgressSet be aware of its peer set changing, and able to have calls like self.prs().voters() or self.prs().learner_nodes() always remain accurate.

    Rationale

    Currently we store Progress data, which are of a non-trivial size, in two FxHashMaps in ProgressSet. They are holding voters and learners respectively. During add_node(), add_learner(), remove_node(), and promote_learner() we currently check sometimes multiple HashMaps, and at times need to move Progress values between maps. We don't need to do this.

    Typical use cases for ProgressSet include iterating over these peers (such as self.prs().voters()), fetching individual peers (such as self.prs().voters().get(id)), checking membership (self.prs().voters().contains_key()), removing nodes (either role), adding voters, adding learners, promoting learners, or removing learners.

    Scanning the entire set is done by chaining iterators over the two hashmaps already, so storing them in a single HashMap has no penalty. Returning Iterators where possible instead of realized structures means we can possibly save on allocations (you can do self.prs().voters().any(_) faster now).

    Scanning over a specific subset (eg voter, learner) is slightly slower since the iterator returned is internally filtering out non-voter/non-learners. Since most node sets are fairly small, and the check is a simple bool, this performance change is not dramatic. (This can be possibly optimized, but I'm not sure it's worth the complexity).

    Fetching individual nodes was optimized since it is always a lookup now, not possibly two lookups (in the case of getting a node without the specifier).

    Removing nodes should also be faster (single lookup).

    Promoting Learners now is just changing a boolean (but this will likely change later for Joint Consensus).

    Future Work

    In the next PR my thought is to introduce poll and quorum logic to ProgressSet, instead of being in Raft (Raft can still proxy to them). Why? When we enter a joint consensus (#101 ) state we no longer can just check if # of votes >= quorum. The cluster needs to account for the votes of nodes in two separate configurations. So the check for elections is # of votes >= quorum or (# votes_old >= quorum_old) && (# votes_new >= quorum_new) depending on the state of the system.

    I'd like to follow with a PR to have ProgressSet hold a ConfState (a vec of voters, and a vec of learners in a protobuf), and this FxHashMap<u64, Progress> that ProgressSet already holds. This will allow us to in the future use an enum inside the ProgressSet:

    enum Configuration {
        Stable(ConfState),
        Joint { old: ConfState, new: ConfState },
    }
    

    It may be wise to use a different datatype for the ConfStates than the protobuf though. This needs to be explored. Perhaps a FxHashMap<u64, Weak<Progress>> pointing back into the actual Progress, so we can just follow the pointer.

    This will allow us to quickly pass lists of IDs to callers and compose iterators of progress with chains, mapping into the HashMap. (If we do this, the Weak/Rc idea above is more compelling perhaps).

    Performance Impact

    There were no major performance impacts. A few benchmarks were slightly slower or slightly faster, but no remarkable impacts. Note many tests are a small % slower due to the additional allocation involved since we're now holding a bit of extra data.

       Compiling raft v0.3.1 (file:///home/hoverbear/git/raft-rs)
        Finished release [optimized] target(s) in 7.01s
         Running target/release/deps/benches-b3f4a3ccb54b1af9
    Raft::new (0, 0)        time:   [654.68 ns 658.33 ns 663.60 ns]                              
                            change: [+2.4424% +4.1041% +5.5835%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 9 outliers among 100 measurements (9.00%)
      2 (2.00%) high mild
      7 (7.00%) high severe
    
    Raft::new (3, 1)        time:   [1.0552 us 1.0571 us 1.0592 us]                              
                            change: [+5.3398% +7.4368% +9.4033%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 13 outliers among 100 measurements (13.00%)
      3 (3.00%) high mild
      10 (10.00%) high severe
    
    Raft::new (5, 2)        time:   [1.2448 us 1.2469 us 1.2492 us]                              
                            change: [+3.0618% +5.4548% +7.7207%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 18 outliers among 100 measurements (18.00%)
      9 (9.00%) low severe
      2 (2.00%) low mild
      2 (2.00%) high mild
      5 (5.00%) high severe
    
    Raft::new (7, 3)        time:   [1.5091 us 1.5169 us 1.5285 us]                              
                            change: [+6.9430% +8.0638% +9.3565%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 6 outliers among 100 measurements (6.00%)
      1 (1.00%) high mild
      5 (5.00%) high severe
    
    Raft::campaign (3, 1, CampaignPreElection)                                                                             
                            time:   [1.6698 us 1.6796 us 1.6914 us]
                            change: [+7.1247% +8.9953% +11.082%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 3 outliers among 100 measurements (3.00%)
      3 (3.00%) high severe
    
    Raft::campaign (3, 1, CampaignElection)                                                                             
                            time:   [1.8421 us 1.8543 us 1.8701 us]
                            change: [-0.7334% +0.9614% +2.9925%] (p = 0.36 > 0.05)
                            No change in performance detected.
    Found 3 outliers among 100 measurements (3.00%)
      1 (1.00%) high mild
      2 (2.00%) high severe
    
    Raft::campaign (3, 1, CampaignTransfer)                                                                             
                            time:   [1.9347 us 1.9791 us 2.0401 us]
                            change: [+4.6167% +6.8336% +9.7574%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 6 outliers among 100 measurements (6.00%)
      1 (1.00%) high mild
      5 (5.00%) high severe
    
    Raft::campaign (5, 2, CampaignPreElection)                                                                             
                            time:   [2.3698 us 2.3725 us 2.3759 us]
                            change: [-0.5268% +0.8138% +2.1554%] (p = 0.24 > 0.05)
                            No change in performance detected.
    Found 5 outliers among 100 measurements (5.00%)
      4 (4.00%) high mild
      1 (1.00%) high severe
    
    Raft::campaign (5, 2, CampaignElection)                                                                             
                            time:   [2.5981 us 2.6072 us 2.6228 us]
                            change: [+5.1047% +6.6554% +8.2343%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 5 outliers among 100 measurements (5.00%)
      1 (1.00%) high mild
      4 (4.00%) high severe
    
    Raft::campaign (5, 2, CampaignTransfer)                                                                             
                            time:   [2.7336 us 2.7424 us 2.7546 us]
                            change: [-2.8336% -1.6015% -0.2993%] (p = 0.01 < 0.05)
                            Change within noise threshold.
    Found 4 outliers among 100 measurements (4.00%)
      1 (1.00%) high mild
      3 (3.00%) high severe
    
    Raft::campaign (7, 3, CampaignPreElection)                                                                             
                            time:   [3.0390 us 3.0577 us 3.0808 us]
                            change: [+6.2413% +8.4249% +10.575%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 2 outliers among 100 measurements (2.00%)
      1 (1.00%) high mild
      1 (1.00%) high severe
    
    Raft::campaign (7, 3, CampaignElection)                                                                             
                            time:   [3.3627 us 3.3815 us 3.4033 us]
                            change: [+3.5173% +5.6968% +7.8047%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) high severe
    
    Raft::campaign (7, 3, CampaignTransfer)                                                                             
                            time:   [3.5182 us 3.5326 us 3.5498 us]
                            change: [+0.1086% +1.5806% +3.0729%] (p = 0.04 < 0.05)
                            Change within noise threshold.
    Found 3 outliers among 100 measurements (3.00%)
      1 (1.00%) high mild
      2 (2.00%) high severe
    
    RawNode::new            time:   [957.90 ns 962.74 ns 968.28 ns]                          
                            change: [+0.4930% +2.1651% +3.6894%] (p = 0.01 < 0.05)
                            Change within noise threshold.
    Found 5 outliers among 100 measurements (5.00%)
      3 (3.00%) high mild
      2 (2.00%) high severe
    
    Progress::default       time:   [2.7608 ns 2.7694 ns 2.7820 ns]                               
                            change: [+3.4533% +4.0793% +4.7229%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) high severe
    
    ProgressSet::new (0, 0) time:   [24.331 ns 24.464 ns 24.622 ns]                                    
                            change: [+12.726% +13.755% +14.741%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::new (3, 1) time:   [77.597 ns 77.870 ns 78.208 ns]                                    
                            change: [+4.2794% +5.0575% +5.8202%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 5 outliers among 100 measurements (5.00%)
      3 (3.00%) high mild
      2 (2.00%) high severe
    
    ProgressSet::new (5, 2) time:   [78.236 ns 78.473 ns 78.770 ns]                                    
                            change: [+4.1984% +5.2535% +6.2644%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    
    ProgressSet::new (7, 3) time:   [78.297 ns 78.467 ns 78.690 ns]                                    
                            change: [+3.2772% +4.1558% +5.3449%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 4 outliers among 100 measurements (4.00%)
      1 (1.00%) high mild
      3 (3.00%) high severe
    
    ProgressSet::insert_voter (0, 0)                                                                            
                            time:   [82.880 ns 83.059 ns 83.289 ns]
                            change: [-2.3601% -1.5005% -0.6649%] (p = 0.00 < 0.05)
                            Change within noise threshold.
    Found 3 outliers among 100 measurements (3.00%)
      3 (3.00%) high mild
    
    ProgressSet::insert_voter (3, 1)                                                                             
                            time:   [195.68 ns 196.88 ns 198.43 ns]
                            change: [+3.8072% +5.1527% +6.6821%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 4 outliers among 100 measurements (4.00%)
      3 (3.00%) high mild
      1 (1.00%) high severe
    
    ProgressSet::insert_voter (5, 2)                                                                             
                            time:   [238.59 ns 240.82 ns 243.56 ns]
                            change: [+9.6677% +11.744% +13.744%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 3 outliers among 100 measurements (3.00%)
      1 (1.00%) high mild
      2 (2.00%) high severe
    
    ProgressSet::insert_voter (7, 3)                                                                             
                            time:   [250.93 ns 252.21 ns 253.76 ns]
                            change: [+9.5544% +10.656% +11.673%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::insert_learner (0, 0)                                                                             
                            time:   [84.346 ns 84.675 ns 85.032 ns]
                            change: [+4.0566% +5.3939% +6.6914%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 4 outliers among 100 measurements (4.00%)
      2 (2.00%) high mild
      2 (2.00%) high severe
    
    ProgressSet::insert_learner (3, 1)                                                                             
                            time:   [194.62 ns 195.77 ns 197.10 ns]
                            change: [+8.5062% +9.9784% +11.317%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::insert_learner (5, 2)                                                                             
                            time:   [216.15 ns 217.55 ns 219.23 ns]
                            change: [+9.6463% +11.074% +12.443%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 3 outliers among 100 measurements (3.00%)
      2 (2.00%) high mild
      1 (1.00%) high severe
    
    ProgressSet::insert_learner (7, 3)                                                                             
                            time:   [236.09 ns 237.22 ns 238.59 ns]
                            change: [+6.0486% +7.3821% +8.6658%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 2 outliers among 100 measurements (2.00%)
      1 (1.00%) high mild
      1 (1.00%) high severe
    
    ProgressSet::promote (0, 0)                                                                            
                            time:   [30.150 ns 30.321 ns 30.517 ns]
                            change: [-0.8956% +0.9434% +2.8369%] (p = 0.32 > 0.05)
                            No change in performance detected.
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) high mild
    
    ProgressSet::promote (3, 1)                                                                             
                            time:   [174.10 ns 174.50 ns 175.02 ns]
                            change: [-1.6114% +0.7134% +2.4685%] (p = 0.57 > 0.05)
                            No change in performance detected.
    Found 4 outliers among 100 measurements (4.00%)
      1 (1.00%) high mild
      3 (3.00%) high severe
    
    ProgressSet::promote (5, 2)                                                                             
                            time:   [206.81 ns 207.74 ns 208.90 ns]
                            change: [+12.212% +13.224% +14.202%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::promote (7, 3)                                                                             
                            time:   [225.56 ns 227.15 ns 229.03 ns]
                            change: [-0.7211% +1.0632% +2.8948%] (p = 0.25 > 0.05)
                            No change in performance detected.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    
    ProgressSet::remove (0, 0)                                                                            
                            time:   [65.105 ns 65.554 ns 66.085 ns]
                            change: [+8.8515% +10.210% +11.544%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::remove (3, 1)                                                                             
                            time:   [210.29 ns 211.60 ns 213.16 ns]
                            change: [+7.3192% +8.9684% +10.540%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::remove (5, 2)                                                                             
                            time:   [312.97 ns 319.75 ns 327.84 ns]
                            change: [+40.410% +43.287% +45.910%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::remove (7, 3)                                                                             
                            time:   [292.14 ns 294.15 ns 296.60 ns]
                            change: [+22.410% +23.951% +25.465%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 2 outliers among 100 measurements (2.00%)
      1 (1.00%) high mild
      1 (1.00%) high severe
    
    ProgressSet::iter (0, 0)                                                                            
                            time:   [32.426 ns 32.682 ns 32.986 ns]
                            change: [+19.914% +21.344% +22.678%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::iter (3, 1)                                                                             
                            time:   [215.07 ns 217.18 ns 219.64 ns]
                            change: [+24.669% +26.020% +27.425%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 3 outliers among 100 measurements (3.00%)
      3 (3.00%) high mild
    
    ProgressSet::iter (5, 2)                                                                             
                            time:   [232.52 ns 234.99 ns 237.96 ns]
                            change: [+18.472% +20.308% +21.988%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 5 outliers among 100 measurements (5.00%)
      5 (5.00%) high mild
    
    ProgressSet::iter (7, 3)                                                                             
                            time:   [255.43 ns 262.34 ns 272.19 ns]
                            change: [+14.089% +16.026% +18.440%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 4 outliers among 100 measurements (4.00%)
      1 (1.00%) high mild
      3 (3.00%) high severe
    
    ProgressSet::get (0, 0) time:   [24.563 ns 24.764 ns 25.007 ns]                                    
                            change: [+19.712% +21.106% +22.553%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::get (3, 1) time:   [178.88 ns 180.24 ns 181.89 ns]                                     
                            change: [+17.606% +19.109% +20.784%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    
    ProgressSet::get (5, 2) time:   [198.47 ns 199.85 ns 201.50 ns]                                     
                            change: [+17.484% +18.857% +20.212%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::get (7, 3) time:   [222.11 ns 223.86 ns 226.02 ns]                                     
                            change: [+8.8628% +10.119% +11.473%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::nodes (0, 0)                                                                            
                            time:   [30.588 ns 30.741 ns 30.935 ns]
                            change: [+12.690% +14.065% +15.416%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::nodes (3, 1)                                                                             
                            time:   [199.91 ns 201.45 ns 203.25 ns]
                            change: [+12.052% +13.444% +14.856%] (p = 0.00 < 0.05)
                            Performance has regressed.
    
    ProgressSet::nodes (5, 2)                                                                             
                            time:   [231.50 ns 233.10 ns 235.04 ns]
                            change: [+9.1243% +10.754% +12.436%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high severe
    
    ProgressSet::nodes (7, 3)                                                                             
                            time:   [250.84 ns 252.33 ns 254.05 ns]
                            change: [+2.3537% +3.7544% +5.0726%] (p = 0.00 < 0.05)
                            Performance has regressed.
    Found 1 outliers among 100 measurements (1.00%)
      1 (1.00%) high mild
    
    
    Enhancement Feature 
    opened by Hoverbear 14
  • Convert `Storage::entries`'s `max_size` argument to `Option<u64>`

    Convert `Storage::entries`'s `max_size` argument to `Option`

    Currently the value is just a u64 with u64::MAX defined as NO_LIMIT. This no limit is not mentioned in the documentation. Something like the following enum would suite better:

    enum Size {
        Fixed(u64),
        NoLimit,
    }; 
    

    I could provide a PR, if wanted.

    Edit (brson): per discussion, let's change this to Option<u64> with a doc-comment explaining that None means "no limit". The Storage trait is in src/storage.rs.

    Help Wanted Good First Issue Documentation 
    opened by bkchr 14
  • make configuration change effective on received

    make configuration change effective on received

    Make configuration changes effective on received.

    Some test cases about membership change are removed:

    • tests in test_membership_change::api. Because now these API are removed.
    • pending_delete_fails_after_begin, pending_create_with_quorum_fails_after_begin, pending_create_and_destroy_both_fail. They are covered by minority_followers_halt.
    opened by hicqu 13
  • Consider priority during election

    Consider priority during election

    There may be differences for several nodes of a cluster, like some nodes are in a far away data center, or some nodes have poor hardware configurations. In such case we would want to make them have less chance to become a leader.

    After pingcap/raft-rs#63, we support configuring different randomized timeout range for different nodes, which can make it hard to become leader with a larger timeout. However it can still cause noises. Because as long as the nodes have enough logs, it can still become leader even if other preferred nodes have the same logs.

    So I propose to add a priority for every node. A follower votes for a candidate only when one of following prerequisites is met:

    1. Candidate has more logs;
    2. Candidate has the same logs and its priority is not less than the follower.

    This policy is expected to work well when the nodes that are preferred to become leaders can form a quorum. In such case a non-preferred node can become leader only when it has more logs. However a node is not preferred usually means that it takes more time to catch up log for high network latency or poor performance as described above.

    Enhancement Request for Comment 
    opened by BusyJay 13
  • *: add protection against unlimited log growth(#131)

    *: add protection against unlimited log growth(#131)

    • Add protection against uncommitted log growth of leader. This protection is done by estimate size of entry data rather than count of entry. So entry with no data will never be dropped
    • Add tests for raft and raw_node
    • fix wrong clippy command in CONTRIBUTING.md
    • close #131

    Signed-off-by: lyzongyuan [email protected]

    opened by c0x0o 12
  • `handle_committed_entries` after saving hard state in examples

    `handle_committed_entries` after saving hard state in examples

    Changes

    handle_committed_entries after saving hard state in examples.

    Reason

    Projects might base their consensus thread structure looking at these examples. In production environment most of the projects will not use in memory storage, but will save states on disk. And it is very important to save applied index increase only after the committed index (in hard state) was saved. Otherwise on restart this precondition might be broken, if program stopped in between the lines saving applied index and hard state. In that case Raft lib will fail with panic.

    opened by e-ivkov 4
  • How do I use the `ConfChangeType::AddLearnerNode` in `propose_conf_change`?

    How do I use the `ConfChangeType::AddLearnerNode` in `propose_conf_change`?

    Specifically, I want to add a node to the raft group first. But I first add a learner node, and then when the log of the learner node is up-to-date, I make it become the voter.

    I see the ConfChangeType::AddLearnerNode in this library, but I'm so sorry because I'm not sure how to use it.

    Or the library itself implement the steps I've described? As long as I use ConfChangeType::AddNode. 😂

    opened by pidb 4
  • Add new message type MsgGroupBroadcast and corresponding handler

    Add new message type MsgGroupBroadcast and corresponding handler

    Signed-off-by: LintianShi [email protected]

    Modify raft-rs to support follower replication in TiKV. Main change:

    • Add option of follower replication for RawNode and Config
    • Add new message type: MsgGroupBroadcast and an extra field forwards in Message, which contains information that needs to be forwarded.
    • Add handler for MsgGroupBroadcast, which appends entries to local log and forward MsgAppend to other peers.
    opened by LintianShi 4
  • Fix a broken anchor link

    Fix a broken anchor link

    The origin link(https://docs.rs/raft/latest/raft/examples/single_mem_node/main.rs#L113-L179) is broken.

    The path of the corresponding code in docs.rs is (https://docs.rs/crate/raft/0.6.0/source/examples/single_mem_node/main.rs), but the line number range cannot be specified.

    I changed the link to point to the corresponding code on Github.

    opened by IDJack 0
  • raft:Fix one log append reject bug about MsgUnreachable msg

    raft:Fix one log append reject bug about MsgUnreachable msg

    Problem Description: close tikv/tikv#11371

    1. When follower reject the appendlog requests, it will send a MsgUnreachable msg to leader
    2. But the MsgUnreachable is a local type which means only be used in inner raftcore, cannot transfered by net, which leads to some logic check failed.

    Solution:

    1. Let the MsgUnreachable be a response message, and when leader receives, call the raft core step directly.
    opened by tier-cap 2
  • Using PROST in `Error`

    Using PROST in `Error`

    Seems like if the crate is built with prost-codec feature - then prost error should be there https://github.com/tikv/raft-rs/blob/2357cb22760719bcd107a90d1e64ef505bdb1e15/src/errors.rs#L27

    opened by e-ivkov 1
Releases(v0.6.0)
  • v0.6.0(Jun 16, 2021)

    • Joint Consensus became a stable feature (#379, #380, #382, #383, #385, #386, #411)
    • Ported aggresive flow control from etcd/raft (#354)
    • Introduced group commit to force geo replication safety (#359)
    • Harden read index (#355, #363)
    • Support limiting uncommited logs (#398)
    • Support asynchronous ready (#403, #410, #417, #433)
    • Fast log append rejection (#367)
    • bytes::Bytes is used for protos if feature protobuf-codec is enabled (by default) (#438)
    • Switched to thiserror (#435)
    • Implemented committed entries pagination (#440)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.3(May 8, 2019)

  • v0.4.2(Apr 29, 2019)

    • Fix potential two leaders at the same term when transferring leader. (https://github.com/pingcap/raft-rs/pull/225)
    • Support StatusRef that doesn't clone. (https://github.com/pingcap/raft-rs/pull/227)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Apr 23, 2019)

    • Migrate from Rust-protobuf to Prost (https://github.com/pingcap/raft-rs/pull/204)

    Please note: This is a point release intended for TiKV. It's not intended for general usage. We recommend you use 0.5.0.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Apr 23, 2019)

    • Introduced an experimental Joint Consensus based arbitrary membership change feature. (https://github.com/pingcap/raft-rs/pull/101)
    • Harmonized protobuf dependency to match important downstreams. (https://github.com/pingcap/raft-rs/pull/181)
    • Unified the Progress collections inside ProgressSet. (https://github.com/pingcap/raft-rs/pull/108)
    • Raft::new() now returns a Result. (https://github.com/pingcap/raft-rs/pull/122)
    • Removed the Progress.is_learner field. Check via function in ProgressSet instead. (https://github.com/pingcap/raft-rs/pull/119)
    • Added Appvevor. Added then removed bors. (https://github.com/pingcap/raft-rs/pull/137, https://github.com/pingcap/raft-rs/pull/134)
    • Introduced getters and setters for various Ready fields. (https://github.com/pingcap/raft-rs/pull/120)
    • Reduced memory allocation on reset. (https://github.com/pingcap/raft-rs/pull/130)
    • Added issue templates, more links. (https://github.com/pingcap/raft-rs/pull/133, https://github.com/pingcap/raft-rs/pull/126)
    • Moved poll and quorum checking functionality into ProgressSet. (https://github.com/pingcap/raft-rs/pull/121)
    • The leader is now trivially in the replicate state. (https://github.com/pingcap/raft-rs/pull/146)
    • Fixed a problem with lease based read-only requests interacting with check_quorum wrong. (https://github.com/pingcap/raft-rs/pull/141)
    • Corrected the single_mem_node example. (https://github.com/pingcap/raft-rs/pull/162)
    • Fixed typos. (https://github.com/pingcap/raft-rs/pull/159)
    • Adopted Hashbrown over FxHash. (https://github.com/pingcap/raft-rs/pull/160)
    • Corrected learner checking in handle_transfer_leader. (https://github.com/pingcap/raft-rs/pull/165)
    • Resolved some lints (https://github.com/pingcap/raft-rs/pull/174, https://github.com/pingcap/raft-rs/pull/168, https://github.com/pingcap/raft-rs/pull/142, https://github.com/pingcap/raft-rs/pull/124)
    • Fixed uses of #[feature(_)] so that we can build on stable cleanly. (https://github.com/pingcap/raft-rs/pull/180)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Sep 21, 2018)

    • No longer scan the raft log when becoming a leader. (https://github.com/pingcap/raft-rs/pull/100)
    • Added the ability to skip broadcast commit at runtime. (https://github.com/pingcap/raft-rs/pull/115)
    • Documented all public API. (https://github.com/pingcap/raft-rs/pull/87)
    • Refined a few points in the API in preparation for more work. (https://github.com/pingcap/raft-rs/pull/102)
    • Configuration logic was moved into its own module. (https://github.com/pingcap/raft-rs/pull/91)
    • Added fail-rs based tests. (https://github.com/pingcap/raft-rs/pull/114)
    • Added benchmarking using criterion. (https://github.com/pingcap/raft-rs/pull/110)
    • Expanded tested examples. (https://github.com/pingcap/raft-rs/pull/118)
    • Improved documentation. (https://github.com/pingcap/raft-rs/pull/106)
    • Refined the CI scripts to ensure strict linting. (https://github.com/pingcap/raft-rs/pull/117)
    • Tests now output logs. Configure it with RUST_LOG=raft=info. (https://github.com/pingcap/raft-rs/pull/103)
    • Eased the log dependency. (https://github.com/pingcap/raft-rs/pull/116)
    • Formatting updates. (https://github.com/pingcap/raft-rs/pull/104)
    • Updated some dependencies. (https://github.com/pingcap/raft-rs/pull/97)
    • Use the clippy preview from Rustup. (https://github.com/pingcap/raft-rs/pull/95)
    • Adopted a Code of Conduct. (https://github.com/pingcap/raft-rs/pull/107)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Jul 12, 2018)

    • Bugfix: Reset leader_id when becoming precandidate to resolve prevote and check_quorum compatability (https://github.com/pingcap/raft-rs/pull/84)
    • Bugfix: Becoming a precandidate should reset votes (https://github.com/pingcap/raft-rs/pull/83)
    • Fix some typos, improve variable naming, and other small documentation fixes (https://github.com/pingcap/raft-rs/pull/77, https://github.com/pingcap/raft-rs/pull/79, https://github.com/pingcap/raft-rs/pull/78, https://github.com/pingcap/raft-rs/pull/80)
    • Implement Default for Config and fleshed out an example (https://github.com/pingcap/raft-rs/pull/81)
    • Improve our changelog format (https://github.com/pingcap/raft-rs/pull/85)
    • Remove custom Rustfmt configuration (https://github.com/pingcap/raft-rs/pull/86)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jun 11, 2018)

    • Support configuring the election timeout range (https://github.com/pingcap/raft-rs/pull/63).
    • Keep compatible with rust-protobuf 2.0 (https://github.com/pingcap/raft-rs/pull/64, https://github.com/pingcap/raft-rs/pull/75)
    • Made Raft now Send (https://github.com/pingcap/raft-rs/pull/67)
    • Added documentation examples (https://github.com/pingcap/raft-rs/pull/69)
    • Fixed a deadlock in the prevote migration process (https://github.com/pingcap/raft-rs/pull/42)
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jun 8, 2018)

    • Deprecated sync-log and add context (https://github.com/pingcap/raft-rs/pull/59)
    • Fix learner isolation bug (https://github.com/pingcap/raft-rs/pull/58)
    • Port several tests (https://github.com/pingcap/raft-rs/pull/54, https://github.com/pingcap/raft-rs/pull/41)
    • Add examples (https://github.com/pingcap/raft-rs/pull/44)
    • Use fxhash (https://github.com/pingcap/raft-rs/pull/48)
    Source code(tar.gz)
    Source code(zip)
Owner
TiKV Project
TiKV Project
open source training courses about distributed database and distributed systemes

Welcome to learn Talent Plan Courses! Talent Plan is an open source training program initiated by PingCAP. It aims to create or combine some open sour

PingCAP 8.3k Dec 30, 2022
Another minimal Raft implementation in Rust.

raft-rs Not my first time implementing Raft. I wrote about another implementation in Go I did. But you don't learn a concept well until you've impleme

Phil Eaton 43 Dec 15, 2023
A raft framework, for regular people

RmqttRaft - A raft framework, for regular people This is an attempt to create a layer on top of tikv/raft-rs, that is easier to use and implement. Thi

null 16 Dec 21, 2022
Distributed SQL database in Rust, written as a learning project

toyDB Distributed SQL database in Rust, written as a learning project. Most components are built from scratch, including: Raft-based distributed conse

Erik Grinaker 4.6k Jan 8, 2023
The rust client for CeresDB. CeresDB is a high-performance, distributed, schema-less, cloud native time-series database that can handle both time-series and analytics workloads.

The rust client for CeresDB. CeresDB is a high-performance, distributed, schema-less, cloud native time-series database that can handle both time-series and analytics workloads.

null 12 Nov 18, 2022
This is a small demo of how to transform a simple single-server RocksDB service written in Rust into a distributed version using OmniPaxos.

OmniPaxos Demo This is a small demo of how to transform a simple single-server RocksDB service into a distributed version using OmniPaxos. Related res

Harald Ng 6 Jun 28, 2023
High performance and distributed KV store w/ REST API. 🦀

About Lucid KV High performance and distributed KV store w/ REST API. ?? Introduction Lucid is an high performance, secure and distributed key-value s

Lucid ᵏᵛ 306 Dec 28, 2022
Distributed transactional key-value database, originally created to complement TiDB

Website | Documentation | Community Chat TiKV is an open-source, distributed, and transactional key-value database. Unlike other traditional NoSQL sys

TiKV Project 12.4k Jan 3, 2023
small distributed database protocol

clepsydra Overview This is a work-in-progress implementation of a core protocol for a minimalist distributed database. It strives to be as small and s

Graydon Hoare 19 Dec 2, 2021
Aggregatable Distributed Key Generation

Aggregatable DKG and VUF WARNING: this code should not be used in production! Implementation of Aggregatable Distributed Key Generation, a distributed

Kobi Gurkan 38 Nov 30, 2022
Garage is a lightweight S3-compatible distributed object store

Garage [ Website and documentation | Binary releases | Git repository | Matrix channel ] Garage is a lightweight S3-compatible distributed object stor

Deuxfleurs 156 Dec 30, 2022
A scalable, distributed, collaborative, document-graph database, for the realtime web

is the ultimate cloud database for tomorrow's applications Develop easier. Build faster. Scale quicker. What is SurrealDB? SurrealDB is an end-to-end

SurrealDB 16.9k Jan 8, 2023
Canary - Distributed systems library for making communications through the network easier, while keeping minimalism and flexibility.

Canary Canary is a distributed systems and communications framework, focusing on minimalism, ease of use and performance. Development of Canary utiliz

null 28 Nov 3, 2022
Distributed, version controlled, SQL database with cryptographically verifiable storage, queries and results. Think git for postgres.

SDB - SignatureDB Distributed, version controlled, SQL database with cryptographically verifiable storage, queries and results. Think git for postgres

Fremantle Industries 5 Apr 26, 2022
Embedded Distributed Encrypted Database (Research).

EDED Embedded Distributed Encrypted Database. Research projects to support ESSE. WIP Distributed design features Adapt to personal distributed usecase

Sun 2 Jan 6, 2022
A high-performance, distributed, schema-less, cloud native time-series database

CeresDB is a high-performance, distributed, schema-less, cloud native time-series database that can handle both time-series and analytics workloads.

null 1.8k Dec 30, 2022
Distributed, MVCC SQLite that runs on FoundationDB.

mvSQLite Distributed, MVCC SQLite that runs on top of FoundationDB. Documentation mvSQLite Features Releases Quick reference Try it Contributing Featu

Heyang Zhou 1.1k Jul 3, 2023
A Distributed SQL Database - Building the Database in the Public to Learn Database Internals

Table of Contents Overview Usage TODO MVCC in entangleDB SQL Query Execution in entangleDB entangleDB Raft Consensus Engine What I am trying to build

Sarthak Dalabehera 38 Jan 2, 2024
implementation of an incremental compilation algorithm similar to rustc's

Incremental Query This library wasn't ever supposed to be a library. It started with me wanting to learn how rustc's query system worked, and I decide

Jonathan Dönszelmann 7 Aug 29, 2024