Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-7485 control: Implement system reint to act on all pools #15551

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Dec 3, 2024

Add dmg system reint command to reintegrate a set of storage nodes or
ranks from all the pools they belong to. Takes --ranks or --rank-hosts in
ranged format.

  • Shorten variable naming from Reintegrate to Reint in C code
  • Don't export variables unnecessarily in cmd/dmg
  • Improve reporting of protobuf unmarshal errors
  • Add system reintegrate command
  • Implement reint with system drain request flag
  • Add unit test coverage for new code

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@tanabarr tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Dec 3, 2024
@tanabarr tanabarr self-assigned this Dec 3, 2024
Copy link

github-actions bot commented Dec 3, 2024

Ticket title is 'dmg command to drain and reintegrate nodes from all pools'
Status is 'In Review'
Labels: 'triaged'
https://daosio.atlassian.net/browse/DAOS-7485

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/357/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/354/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/273/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/304/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/1/execution/node/519/log

Base automatically changed from tanabarr/control-drainpools-pernode to master December 3, 2024 18:56
@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/375/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/360/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/369/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/359/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/2/execution/node/364/log

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15551/5/testReport/

tanabarr and others added 8 commits December 11, 2024 18:18
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr force-pushed the tanabarr/control-reintpools-pernode branch from 45f3e40 to d96bbfd Compare December 11, 2024 18:18
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15551/6/testReport/

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
…-stack/daos into tanabarr/control-reintpools-pernode

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarrointel.com>
@daosbuild1
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15551/16/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15551/16/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15551/16/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15551/16/display/redirect

Features: control pool
Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr
Copy link
Contributor Author

tanabarr commented Jan 7, 2025

This PR is failing the copyright check but I didn't want to rerun all the code through the new hooks as it would probably require a force push and inconvenience reviewers. @daltonbohning does that sound reasonable or should I work out how to update the copyright notices?

@tanabarr
Copy link
Contributor Author

tanabarr commented Jan 7, 2025

gatekeeper please use PR title and description as commit message when landing

@daltonbohning
Copy link
Contributor

This PR is failing the copyright check but I didn't want to rerun all the code through the new hooks as it would probably require a force push and inconvenience reviewers. @daltonbohning does that sound reasonable or should I work out how to update the copyright notices?

Yeah, the copyright hook is not foolproof and it's tricky because the copyright is more about when the work was done, not when the commit was merged to master. I don't know the right answer, but if you do want to update them as if all this work was done in 2025, here is a commit for that: (I had to do some manual trickery to get this)
9885e01

The hook should update the copyright for any new changes on this PR, assuming you have the hooks setup.

knard38
knard38 previously approved these changes Jan 8, 2025
…intpools-pernode

Features: pool
Signed-off-by: Tom Nabarro <[email protected]>
Features: pool
Signed-off-by: Tom Nabarro <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15551/20/display/redirect

@@ -134,7 +134,7 @@ func (m MgmtMethod) String() string {
MethodPoolExclude: "PoolExclude",
MethodPoolDrain: "PoolDrain",
MethodPoolExtend: "PoolExtend",
MethodPoolReintegrate: "PoolReintegrate",
MethodPoolReint: "PoolReint",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, I think this line still needs to be fixed. These names and strings are totally internal to Go.

Comment on lines 56 to 61
DRPC_METHOD_MGMT_REINTEGRATE = 226,
DRPC_METHOD_MGMT_CONT_SET_OWNER = 227,
DRPC_METHOD_MGMT_EXCLUDE = 228,
DRPC_METHOD_MGMT_EXTEND = 229,
DRPC_METHOD_MGMT_POOL_EVICT = 230,
DRPC_METHOD_MGMT_DRAIN = 231,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for renumbering these? It shouldn't cause any problems since daos_server and daos_engine must be the same version, but... why? If you're just standardizing the naming without changing the meaning, IMO it's better to rename in place, with the same numbers.

Doc-only: false

Signed-off-by: Tom Nabarro <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/21/execution/node/1453/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15551/21/execution/node/1470/log

@tanabarr
Copy link
Contributor Author

@kjacque I would like to get this PR landed and I don't think we should block on variable renaming issues, I have reverted most of the go name changes and I don't think reverting the dRPC method names warrants another run through CI. If it's okay with you can we go ahead with this version of the PR and move on to bigger fish? TIA

@tanabarr
Copy link
Contributor Author

CI failures all attributable to DAOS-16921

@tanabarr tanabarr requested review from kjacque and knard38 January 10, 2025 12:42
@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 10, 2025
kjacque
kjacque previously approved these changes Jan 10, 2025
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dRPC opcode renumbering is the only thing that really bothers me, but I can let it go because I can't see a way that it could break actual operation. daos_server and daos_engine must be the same version, and these opcodes only communicate between server and engine. That said, I think it's generally a bad idea to renumber these method IDs. We treat them as an API.

Copy link
Contributor

@mjmac mjmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this should be landed without reverting the truncated names in the Go code. All of the renaming makes the patch larger than necessary, and it introduces inconsistency in the naming. If you want to have shorter names on the C side to conform to those conventions, that's fine, but there's no good reason to impose the C conventions on the Go code.

To be clear: "Reintegrate" -> "Reint" is a disimprovement. Please revert those changes specifically.

Allow-unstable-test: true
Features: pool
Signed-off-by: Tom Nabarro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
control-plane work on the management infrastructure of the DAOS Control Plane forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

6 participants