Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend gray failure recentHealthTriggeredRecoveryTime to reflect any recovery trigger #11877

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

spraza
Copy link
Collaborator

@spraza spraza commented Jan 11, 2025

Description

Extend gray failure recentHealthTriggeredRecoveryTime state to reflect any recovery, including non-gray failure triggered ones. This helps reduce number of false positive gray failure recoveries.

Let's take an example. CC_MAX_HEALTH_RECOVERY_COUNT is 1 and CC_TRACKING_HEALTH_RECOVERY_INTERVAL is 120.

Previously, gray failure will trigger another recovery after a non-gray failure recovery regardless of the knob values above.

Now, the knobs account for non-gray failure recoveries as well. So in this example, gray failure will not trigger another recovery after non-gray failure recovery for atleast 2 minutes. After 2 minutes, if complaints have not expired and workers have not recovered, gray failure can decide to trigger a recovery.

Once accepted, will also create a backport PR for 7.3.

Testing

  1. 100K currently running: 20250111-000712-praza-5160fe051c0f2a507d0339b79b934665fb5403 compressed=True data_size=36362848 fail_fast=10 max_runs=100000 priority=100 sanity=False submitted=20250111-000712 timeout=5400 username=praza-5160fe051c0f2a507d0339b79b934665fb54030e. Will update results when done.
  2. Also explicitly ran fdbserver -r unittests -f /fdbserver/clustercontroller/ which includes gray failure unit tests, all pass.
  3. Ran multiple tests in Kubernetes cluster to ensure gray failure recovery is not triggered soon after non-gray failure recovery. Also tested that when there is no non-gray failure recovery, then gray failure recovery is still triggered.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

…t any recovery, including non-gray failure triggered ones
@spraza spraza changed the title Extend gray failure recentHealthTriggeredRecoveryTime state to reflect any recovery Extend gray failure recentHealthTriggeredRecoveryTime to reflect any recovery trigger Jan 11, 2025
@spraza spraza requested review from johscheuer and jzhou77 January 11, 2025 00:09
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 5160fe0
  • Duration 0:22:08
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

  • Commit ID: 49bcf0d
  • Duration 0:22:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 5160fe0
  • Duration 0:39:42
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 5160fe0
  • Duration 0:55:02
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 5160fe0
  • Duration 1:00:14
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 5160fe0
  • Duration 1:00:16
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 49bcf0d
  • Duration 0:54:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 49bcf0d
  • Duration 0:56:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 5160fe0
  • Duration 1:14:28
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 5160fe0
  • Duration 1:19:50
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: 49bcf0d
  • Duration 1:11:55
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: 49bcf0d
  • Duration 1:12:47
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@spraza spraza marked this pull request as draft January 11, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants