[State Sync] Add a new Storage Service implementation (server-side) #9199

JoshLind · 2021-09-16T16:54:44Z

Motivation

This PR adds a new server-side implementation for the Storage Service. The Storage Service is an inter-node service allowing Diem nodes to request storage data (e.g., transactions and proofs) directly from other nodes over the Diem network. This PR sets up the framework (template) for this service by implementing the plumbing between request/response processing from the server-side, as well as outlines the required interface between the server and local Diem DB.

To achieve this, the PR offers 2 commits:

Add the new server implementation (as a library to later be exposed over the network)
Add several (simple) tests for the server implementation. These will need to be improved and expanded in the future.

Notes:

The implementation offered here is by no means complete (and you'll notice many TODOs). Instead, this PR is meant to set up the framework so that we can start building on the server-side implementation and make the Storage Service a living thing 😄

Have you read the Contributing Guidelines on pull requests?

Yes.

Test Plan

A set of simple unit tests have been added for this functionality.

Related PRs

None, but this PR relates to: #8906

msmouse · 2021-09-16T17:18:13Z

state-sync/storage-service/server/src/lib.rs

+/// The underlying implementation of the StorageReaderInterface, used by the
+/// storage server.
+pub struct StorageReader {
+    storage: Arc<RwLock<DbReaderWriter>>,


Does it ever write to the DB? Why not a Arc<dyn DbReader>?

And why is the RwLock needed?

@msmouse, it doesn't ever write to the db. However, I'm trying to be consistent in the state sync and storage service code. Everything will get an Arc<RwLock<DbReaderWriter>>. The reason for this is we'll have several state sync (and storage service threads) reading and writing to the db concurrently and it's simpler to give them all a shared RwLock object than to try and do some magic around reader and writer only. IIUC, the DiemDB struct today does not have a global lock (e.g. mutex), correct? If so, I want to avoid any scary situations.

it looks like the DbReaderWriter is just

pub struct DbReaderWriter { pub reader: Arc<dyn DbReader>, pub writer: Arc<dyn DbWriter>, }

so whoever is constructing the StorageReader just could just clone the inner Arc<dyn DbReader> and hand it down.

as far as concurrency, part of the DbReader API contract is to be thread safe, so you don't need any locks/synchronization to call it. after all, we're effectively doing read-only calls to an MVCC database here, which supports unlimited read concurrency

What happens if a single DbReader call (e.g., get_transactions()) makes several independent calls to the database (e.g., get_transactions, get_transaction_info and get_events_by_version). Each of these calls is going to read different objects in the database. MVCC might guarantee that each read works, but there's no guarantees across several independent reads, correct? A writer might write between them and violate the higher level assumptions (at the DbReader API). Likewise with interleaved writes. Note: the get_transactions() example is a little simple (because you could argue it's bound by versions, which shouldn't be changed once written, but I'm not comfortable making that assumption for everything). Perhaps this has all been thought about before, though? With a single read/write lock around the entire API, it's much easier to reason about (each API call can be thought of as executing as a single "database transaction" or atomic operation). Happy to be convinced, but I'm still not there yet 😄

The DB interfaces always take a ledger_version parameter if necessary, to guarantee a consistent view of the blockchain across reads. If not a flaw, it never assumes the user asks for "the latest".

@msmouse, yeah, my feeling is that in most cases we might be alright. But, race conditions can be very sneaky and I'm still not 100% sure we won't have issues. We're also moving to a world where state sync is going to be multi-threaded and writing to the DB concurrently (e.g., writing different epochs, account states, transactions in parallel, etc.). Thus, I want to do this simple thing now (i.e., wrap all uses of state sync and storage code in a single RwLock). If we find that the single RwLock becomes a point of contention, I'll happily optimize it, but I'd like to make sure that everything is safe and simple for now. Does that make sense?

As long as we are on the same page.

Absolutely, thanks @msmouse (and @phlip9)! I've noted this down as something to dig into in the future (once we have an initial end-to-end prototype). 😄

But still, we shouldn't need to write the DB in concurrency. It's not worth it. 😜

@msmouse.. I think we'll see. I just don't want to rule this out now, especially as our current design is going to be processing data chunks in parallel 😄

phlip9 · 2021-09-16T17:38:21Z

state-sync/storage-service/server/src/lib.rs

+    // Returns the frozen subtrees at the specified version.
+    //fn get_frozen_subtrees(
+    //    version,
+    //) -> Result<Vec<HashValue>, Error>


What is this API for? bootstrapping the TransactionAccumulatorSummary?

It's for the case where a node grabs all account states and needs to bootstrap its view of the transaction accumulator. However, I'll remove it for now until I'm absolutely sure we need it. I think there's a way around it.

phlip9 · 2021-09-16T17:40:25Z

state-sync/storage-service/server/src/lib.rs

+    #[error("Unexpected error encountered: {0}")]
+    UnexpectedErrorEncountered(String),
+}
+


add a pub type Result<T, E = Error> = ::std::result::Result<T, E>;

Can you elaborate? Why would we want this? These errors should be internal to the storage service only.

This I think is just more convenience along with the Error so you don't have to keep typing Error. It's optional

As weirdly as it sounds, I prefer typing Error everywhere. Especially since so much of our code uses anyhow (which I'm not a big fan of). This way, I know exactly what Error type I'm getting 😄 . It's also consistent with the rest of our code (elsewhere)

phlip9 · 2021-09-16T17:54:40Z

state-sync/storage-service/server/src/lib.rs

+    pub fn handle_request(
+        &self,
+        request: StorageServiceRequest,
+    ) -> Result<StorageServiceResponse, Error> {


I guess we probably want handle_request to return just StorageServiceResponse and convert any errors into StorageServiceErrors? Even if DiemDB returns an error, it should probably just return an "Internal Error" variant

I've done this for now but we'll need to give it more attention in the future. For example, if some errors are really bad (e.g., internal invariant violations), we may want them to bubble up to main to cause the node to die. But we can deal with that later 😄

gregnazario · 2021-09-16T17:51:23Z

Cargo.toml

+    "state-sync/storage-service/server",
+    "state-sync/storage-service/types",


So, something I hadn't been thinking about before now, think about how this works out from the logging perspective, and how you find the logs with all of these crates. You'll have to check how the modules work out and hopefully they're consistent. For example, I found out there were 3 different ones for storage yesterday, and only one for say state sync

Yeah, it's not great. But, I'd argue that this is a limitation with the logger and setup and not with our code structure. For example, it's important to split the code here up because of circular dependencies and not wanting to tie the server and client together (which is what state sync currently does). When I come to add the logger I'll try to dig a little

state-sync/storage-service/types/src/lib.rs

gregnazario · 2021-09-17T00:00:26Z

state-sync/storage-service/server/src/lib.rs

+    #[error("Unexpected error encountered: {0}")]
+    UnexpectedErrorEncountered(String),
+}
+


This I think is just more convenience along with the Error so you don't have to keep typing Error. It's optional

state-sync/storage-service/server/src/lib.rs

JoshLind · 2021-09-17T22:58:19Z

Thanks, all! 😄

/land

Closes: diem#9199

github-actions · 2021-09-17T23:42:22Z

Cluster Test Result

Test runner setup time spent 281 secs
Compatibility test results for land_a9975df1 ==> land_6122cb78 (PR)
1. All instances running land_a9975df1, generating some traffic on network
2. First full node land_a9975df1 ==> land_6122cb78, to validate new full node to old validator node traffic
3. First Validator node land_a9975df1 ==> land_6122cb78, to validate storage compatibility
4. First batch validators (14) land_a9975df1 ==> land_6122cb78, to test consensus and traffic between old full nodes and new validator node
5. First batch full nodes (14) land_a9975df1 ==> land_6122cb78
6. Second batch validators (15) land_a9975df1 ==> land_6122cb78, to upgrade rest of the validators
7. Second batch of full nodes (15) land_a9975df1 ==> land_6122cb78, to finish the network upgrade, time spent 669 secs
all up : 1183 TPS, 3838 ms latency, 4400 ms p99 latency, no expired txns, time spent 250 secs
Logs: http://kibana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-17T23:19:28Z',to:'2021-09-17T23:42:21Z'))
Dashboard: http://grafana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/d/performance/performance?from=1631920768000&to=1631922141000
Validator 1 logs: http://kibana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-17T23:19:28Z',to:'2021-09-17T23:42:21Z'))&_a=(columns:!(log),query:(language:kuery,query:'kubernetes.pod_name:"val-1"'),sort:!(!('@timestamp',desc)))

Repro cmd:

./scripts/cti --tag land_a9975df1 --cluster-test-tag land_6122cb78 -E BATCH_SIZE=15 -E UPDATE_TO_TAG=land_6122cb78 --report report.json --suite land_blocking_compat

🎉 Land-blocking cluster test passed! 👌

JoshLind requested a review from a team as a code owner September 16, 2021 16:54

bors-libra added this to In Review in bors Sep 16, 2021

diem-cla-bot bot added the cla-signed label Sep 16, 2021

JoshLind requested review from gregnazario, lightmark, msmouse, phlip9 and zekun000 September 16, 2021 16:55

msmouse reviewed Sep 16, 2021

View reviewed changes

phlip9 reviewed Sep 16, 2021

View reviewed changes

JoshLind force-pushed the storage-service-server branch from 4fa6d44 to a2a8bf7 Compare September 16, 2021 23:54

JoshLind requested review from phlip9 and msmouse September 17, 2021 00:05

gregnazario reviewed Sep 17, 2021

View reviewed changes

JoshLind requested a review from gregnazario September 17, 2021 00:05

JoshLind force-pushed the storage-service-server branch from a2a8bf7 to 1279fad Compare September 17, 2021 20:01

gregnazario previously approved these changes Sep 17, 2021

View reviewed changes

bors-libra moved this from In Review to Queued in bors Sep 17, 2021

JoshLind added 2 commits September 17, 2021 23:01

[State Sync] Add a new Storage Service implemention (server-side!)

02d86fc

[State Sync] Add simple tests for the storage service.

6122cb7

Closes: diem#9199

bors-libra moved this from Queued to Testing in bors Sep 17, 2021

bors-libra dismissed gregnazario’s stale review via 6122cb7 September 17, 2021 23:42

bors-libra force-pushed the storage-service-server branch from 1279fad to 6122cb7 Compare September 17, 2021 23:42

bors-libra removed this from Testing in bors Sep 17, 2021

bors-libra closed this in 6122cb7 Sep 17, 2021

bors-libra merged commit 6122cb7 into diem:main Sep 17, 2021

bors-libra temporarily deployed to Sccache September 17, 2021 23:42 Inactive

bors-libra temporarily deployed to Docker September 17, 2021 23:43 Inactive

bors-libra temporarily deployed to Sccache September 17, 2021 23:43 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[State Sync] Add a new Storage Service implementation (server-side) #9199

[State Sync] Add a new Storage Service implementation (server-side) #9199

JoshLind commented Sep 16, 2021

msmouse Sep 16, 2021 •

edited

JoshLind Sep 16, 2021 •

edited

phlip9 Sep 17, 2021

JoshLind Sep 17, 2021 •

edited

msmouse Sep 17, 2021 •

edited

JoshLind Sep 17, 2021 •

edited

msmouse Sep 17, 2021

JoshLind Sep 17, 2021

msmouse Sep 17, 2021

JoshLind Sep 17, 2021

phlip9 Sep 16, 2021

JoshLind Sep 16, 2021 •

edited

phlip9 Sep 16, 2021

JoshLind Sep 16, 2021

gregnazario Sep 17, 2021

JoshLind Sep 17, 2021 •

edited

phlip9 Sep 16, 2021

JoshLind Sep 16, 2021

gregnazario Sep 16, 2021

JoshLind Sep 17, 2021

gregnazario Sep 17, 2021

JoshLind commented Sep 17, 2021

github-actions bot commented Sep 17, 2021

		"state-sync/storage-service/server",
		"state-sync/storage-service/types",

[State Sync] Add a new Storage Service implementation (server-side) #9199

[State Sync] Add a new Storage Service implementation (server-side) #9199

Conversation

JoshLind commented Sep 16, 2021

Motivation

Have you read the Contributing Guidelines on pull requests?

Test Plan

Related PRs

msmouse Sep 16, 2021 • edited

Choose a reason for hiding this comment

JoshLind Sep 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshLind Sep 17, 2021 • edited

Choose a reason for hiding this comment

msmouse Sep 17, 2021 • edited

Choose a reason for hiding this comment

JoshLind Sep 17, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshLind Sep 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshLind Sep 17, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshLind commented Sep 17, 2021

github-actions bot commented Sep 17, 2021

msmouse Sep 16, 2021 •

edited

JoshLind Sep 16, 2021 •

edited

JoshLind Sep 17, 2021 •

edited

msmouse Sep 17, 2021 •

edited

JoshLind Sep 17, 2021 •

edited

JoshLind Sep 16, 2021 •

edited

JoshLind Sep 17, 2021 •

edited