Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[forge] change health check order for k8s nodes #9202

Merged
merged 1 commit into from Sep 30, 2021

Conversation

zihaoccc
Copy link
Contributor

Motivation

(Write your motivation for proposed changes here.)

Have you read the Contributing Guidelines on pull requests?

(Write your answer here.)

Test Plan

(Share your test plan here. If you changed code, please provide us with clear instructions for verifying that your changes work.)

Related PRs

(If this PR adds or changes functionality, please take some time to update the docs at https://github.com/diem/diem/tree/main/developers.diem.com, and link to your PR here.)

If targeting a release branch, please fill the below out as well

  • Justification and breaking nature (who does it affect? validators, full nodes, tooling, operators, AOS, etc.)
  • Comprehensive test results that demonstrate the fix working and not breaking existing workflows.
  • Why we must have it for V1 launch.
  • What workarounds and alternative we have if we do not push the PR.

rustielin
rustielin previously approved these changes Sep 16, 2021
@zihaoccc
Copy link
Contributor Author

/land

@bors-libra bors-libra moved this from In Review to Queued in bors Sep 16, 2021
bors-libra pushed a commit that referenced this pull request Sep 17, 2021
@bors-libra bors-libra moved this from Queued to Testing in bors Sep 17, 2021
@github-actions
Copy link

Cluster Test Result

Test runner setup time spent 271 secs
Compatibility test results for land_f13aa94e ==> land_7fd23703 (PR)
1. All instances running land_f13aa94e, generating some traffic on network
2. First full node land_f13aa94e ==> land_7fd23703, to validate new full node to old validator node traffic
3. First Validator node land_f13aa94e ==> land_7fd23703, to validate storage compatibility
4. First batch validators (14) land_f13aa94e ==> land_7fd23703, to test consensus and traffic between old full nodes and new validator node
5. First batch full nodes (14) land_f13aa94e ==> land_7fd23703
6. Second batch validators (15) land_f13aa94e ==> land_7fd23703, to upgrade rest of the validators
7. Second batch of full nodes (15) land_f13aa94e ==> land_7fd23703, to finish the network upgrade, time spent 648 secs
all up : 1170 TPS, 3884 ms latency, 4400 ms p99 latency, no expired txns, time spent 250 secs
Logs: http://kibana.ct-2-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-17T00:23:36Z',to:'2021-09-17T00:46:00Z'))
Dashboard: http://grafana.ct-2-k8s-testnet.aws.hlw3truzy4ls.com/d/performance/performance?from=1631838216000&to=1631839560000
Validator 1 logs: http://kibana.ct-2-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-17T00:23:36Z',to:'2021-09-17T00:46:00Z'))&_a=(columns:!(log),query:(language:kuery,query:'kubernetes.pod_name:"val-1"'),sort:!(!('@timestamp',desc)))

Repro cmd:

./scripts/cti --tag land_f13aa94e --cluster-test-tag land_7fd23703 -E BATCH_SIZE=15 -E UPDATE_TO_TAG=land_7fd23703 --report report.json --suite land_blocking_compat

🎉 Land-blocking cluster test passed! 👌

@bors-libra
Copy link
Contributor

💔 Test Failed - ci-test

@bors-libra bors-libra moved this from Testing to In Review in bors Sep 17, 2021
@zihaoccc
Copy link
Contributor Author

/land

@bors-libra bors-libra moved this from In Review to Queued in bors Sep 17, 2021
bors-libra pushed a commit that referenced this pull request Sep 17, 2021
@bors-libra bors-libra moved this from Queued to Testing in bors Sep 17, 2021
@github-actions
Copy link

Cluster Test Result

Test runner setup time spent 268 secs
Compatibility test results for land_235bbd9b ==> land_b48f3ffc (PR)
1. All instances running land_235bbd9b, generating some traffic on network
2. First full node land_235bbd9b ==> land_b48f3ffc, to validate new full node to old validator node traffic
3. First Validator node land_235bbd9b ==> land_b48f3ffc, to validate storage compatibility
4. First batch validators (14) land_235bbd9b ==> land_b48f3ffc, to test consensus and traffic between old full nodes and new validator node
5. First batch full nodes (14) land_235bbd9b ==> land_b48f3ffc
6. Second batch validators (15) land_235bbd9b ==> land_b48f3ffc, to upgrade rest of the validators
7. Second batch of full nodes (15) land_235bbd9b ==> land_b48f3ffc, to finish the network upgrade, time spent 683 secs
all up : 1191 TPS, 3807 ms latency, 4300 ms p99 latency, no expired txns, time spent 249 secs
Logs: http://kibana.ct-1-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-17T05:46:58Z',to:'2021-09-17T06:08:43Z'))
Dashboard: http://grafana.ct-1-k8s-testnet.aws.hlw3truzy4ls.com/d/performance/performance?from=1631857618000&to=1631858923000
Validator 1 logs: http://kibana.ct-1-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-17T05:46:58Z',to:'2021-09-17T06:08:43Z'))&_a=(columns:!(log),query:(language:kuery,query:'kubernetes.pod_name:"val-1"'),sort:!(!('@timestamp',desc)))

Repro cmd:

./scripts/cti --tag land_235bbd9b --cluster-test-tag land_b48f3ffc -E BATCH_SIZE=15 -E UPDATE_TO_TAG=land_b48f3ffc --report report.json --suite land_blocking_compat

🎉 Land-blocking cluster test passed! 👌

@bors-libra
Copy link
Contributor

💔 Test Failed - ci-test

@zihaoccc
Copy link
Contributor Author

/canary

bors-libra pushed a commit that referenced this pull request Sep 17, 2021
@bors-libra bors-libra moved this from In Review to Canary in bors Sep 17, 2021
@bmwill
Copy link
Contributor

bmwill commented Sep 17, 2021

This change seems to be different from what the PR title indicates as this is changing how the health check works in a specific test. I'm also not exactly sure what you're trying to solve but this change may not be sufficient.

@github-actions
Copy link

Cluster Test Result

Test runner setup time spent 263 secs
Compatibility test results for land_98ec6a24 ==> land_3ff88698 (PR)
1. All instances running land_98ec6a24, generating some traffic on network
2. First full node land_98ec6a24 ==> land_3ff88698, to validate new full node to old validator node traffic
3. First Validator node land_98ec6a24 ==> land_3ff88698, to validate storage compatibility
4. First batch validators (14) land_98ec6a24 ==> land_3ff88698, to test consensus and traffic between old full nodes and new validator node
5. First batch full nodes (14) land_98ec6a24 ==> land_3ff88698
6. Second batch validators (15) land_98ec6a24 ==> land_3ff88698, to upgrade rest of the validators
7. Second batch of full nodes (15) land_98ec6a24 ==> land_3ff88698, to finish the network upgrade, time spent 631 secs
all up : 1178 TPS, 3856 ms latency, 4400 ms p99 latency, no expired txns, time spent 250 secs
Logs: http://kibana.ct-2-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-17T18:42:31Z',to:'2021-09-17T19:04:31Z'))
Dashboard: http://grafana.ct-2-k8s-testnet.aws.hlw3truzy4ls.com/d/performance/performance?from=1631904151000&to=1631905471000
Validator 1 logs: http://kibana.ct-2-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-17T18:42:31Z',to:'2021-09-17T19:04:31Z'))&_a=(columns:!(log),query:(language:kuery,query:'kubernetes.pod_name:"val-1"'),sort:!(!('@timestamp',desc)))

Repro cmd:

./scripts/cti --tag land_98ec6a24 --cluster-test-tag land_3ff88698 -E BATCH_SIZE=15 -E UPDATE_TO_TAG=land_3ff88698 --report report.json --suite land_blocking_compat

🎉 Land-blocking cluster test passed! 👌

@bors-libra
Copy link
Contributor

💔 Test Failed - ci-test

@bors-libra bors-libra moved this from Canary to In Review in bors Sep 17, 2021
@zihaoccc zihaoccc marked this pull request as draft September 22, 2021 03:16
@zihaoccc zihaoccc marked this pull request as ready for review September 29, 2021 22:26
bmwill
bmwill previously approved these changes Sep 29, 2021
Copy link
Contributor

@bmwill bmwill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/land

@bors-libra bors-libra moved this from In Review to Queued in bors Sep 29, 2021
bors-libra pushed a commit that referenced this pull request Sep 29, 2021
@bors-libra bors-libra moved this from Queued to Testing in bors Sep 29, 2021
@bors-libra
Copy link
Contributor

💔 Test Failed - ci-test

@bors-libra bors-libra moved this from Testing to In Review in bors Sep 29, 2021
@bmwill
Copy link
Contributor

bmwill commented Sep 29, 2021

/land

@bors-libra bors-libra moved this from In Review to Queued in bors Sep 29, 2021
@bors-libra bors-libra moved this from Queued to Testing in bors Sep 30, 2021
@github-actions
Copy link

Cluster Test Result

Test runner setup time spent 258 secs
Compatibility test results for land_0a476189 ==> land_bfd30363 (PR)
1. All instances running land_0a476189, generating some traffic on network
2. First full node land_0a476189 ==> land_bfd30363, to validate new full node to old validator node traffic
3. First Validator node land_0a476189 ==> land_bfd30363, to validate storage compatibility
4. First batch validators (14) land_0a476189 ==> land_bfd30363, to test consensus and traffic between old full nodes and new validator node
5. First batch full nodes (14) land_0a476189 ==> land_bfd30363
6. Second batch validators (15) land_0a476189 ==> land_bfd30363, to upgrade rest of the validators
7. Second batch of full nodes (15) land_0a476189 ==> land_bfd30363, to finish the network upgrade, time spent 688 secs
all up : 1199 TPS, 3783 ms latency, 4300 ms p99 latency, no expired txns, time spent 250 secs
Logs: http://kibana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-30T00:25:39Z',to:'2021-09-30T00:48:18Z'))
Dashboard: http://grafana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/d/performance/performance?from=1632961539000&to=1632962898000
Validator 1 logs: http://kibana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-09-30T00:25:39Z',to:'2021-09-30T00:48:18Z'))&_a=(columns:!(log),query:(language:kuery,query:'kubernetes.pod_name:"val-1"'),sort:!(!('@timestamp',desc)))

Repro cmd:

./scripts/cti --tag land_0a476189 --cluster-test-tag land_bfd30363 -E BATCH_SIZE=15 -E UPDATE_TO_TAG=land_bfd30363 --report report.json --suite land_blocking_compat

🎉 Land-blocking cluster test passed! 👌

@bors-libra bors-libra removed this from Testing in bors Sep 30, 2021
@bors-libra bors-libra merged commit bfd3036 into diem:main Sep 30, 2021
@bors-libra bors-libra temporarily deployed to Sccache September 30, 2021 00:49 Inactive
@bors-libra bors-libra temporarily deployed to Docker September 30, 2021 00:50 Inactive
@bors-libra bors-libra temporarily deployed to Sccache September 30, 2021 00:50 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants