Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARO-13667 hypershift: support disabling image registry via capability #1729

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

flavianmissi
Copy link
Member

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 19, 2024
Copy link
Contributor

openshift-ci bot commented Dec 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from flavianmissi. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch from 3f45963 to dda1779 Compare December 19, 2024 16:07
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 4 times, most recently from 09fcfa5 to a975798 Compare December 20, 2024 12:20
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 9 times, most recently from 6da7e55 to 59ccb87 Compare January 8, 2025 13:55
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 2 times, most recently from 2ed863b to 87ef4db Compare January 8, 2025 14:36
Comment on lines 107 to 131
The `CVO` reconciler needs to edit the `ClusterVersion` resource, setting
`spec.capabilities.baselineCapabilitySet` to `"None"`, then calculating the
difference between `ClusterVersionCapabilitySets[ClusterVersionCapabilitySetCurrent]`
and `HostedCluster.spec.disabledCapabilities`, assigning the resulting set to
field `spec.capabilities.additionalEnabledCapabilities`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not strictly related to this enhancement, but working through the caps I've noticed that the installer has a dedicated (non-API based) validation we might need to replicate:
https://github.com/openshift/installer/blob/main/pkg/types/validation/installconfig.go#L205-L253

Things like

if c.Platform.BareMetal != nil && !enabledCaps.Has(configv1.ClusterVersionCapabilityBaremetal)

are obviously bogus for Hypershift, but the ClusterVersionCapabilitySetCurrent will still define BareMetal, so we're also setting this implicitly.

It seems to me that a decoupled and dedicated CapabilitySet for Hypershift would be a better idea, but I defer this to @deads2k - this is clearly more work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear to me why the installer's baremetal logic wouldn't apply to HyperShift. Does HyperShift not run on baremetal? Or does it run on baremetal but somehow allow the baremetal component to be disabled? I wonder what things like bring-your-own-nodes workflows might look like on HyperShift, but I could see some future where folks want just the Kube API and don't need the MachineAPI (or some future ClusterAPI?) component enabled, to tell the HostedControlPlane controller that it can skip the Cluster-API Machine(Set), etc. reconcilers.

Copy link

@tjungblu tjungblu Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking what's the impact when we run a Hypershift control plane (e.g. on AWS) with enabled baremetal capability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] the installer has a dedicated (non-API based) validation we might need to replicate:

Added a section about this under "Risks and Mitigations": https://github.com/openshift/enhancements/pull/1729/files#diff-36fbf863a674d8de905f7aa65cdb6158ec8c27b70d2fe5eed4272ca0a964a730R278-R289

@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 4 times, most recently from 1017891 to 2ee2bed Compare January 10, 2025 12:54
Comment on lines 244 to 339
1. Does Hypershift's CVO need to support [implicity capabilities][implicit-capabilities]?

[implicit-capabilities]: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#updates
Copy link
Member Author

@flavianmissi flavianmissi Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning can you help us answer this question? this came up on a slack discussion.
I don't fully understand the role of the CVO in Hypershift so I'm having difficulties understanding how potential changes in the resources belonging to a capability might affect upgrades.

Here's the relevant section from the linked enhancement:

Updates

During updates, resources may move from being associated with one capability, to another, or move from being part of the core (always installed) to being part of an optional capability. In such cases, the governing logic is that once a resource has been applied to a cluster, it must continue to be applied to the cluster (unless it is removed from the payload entirely, of course). Furthermore, if any resource from a given capability is applied to a cluster, all other resources from that capability must be applied to the cluster, even if the cluster configuration would otherwise filter out that capability. These are referred to as implicitly enabled capabilities and the purpose is to avoid scenarios that would break cluster functionality. Once a resource is applied to a cluster, it will always be reconciled in future versions, and that if any part of a capability is present, the entire capability will be present.

This means the CVO must apply the following logic:

    For the outgoing payload, find all manifests that are [currently included](https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#manifest-annotations).
    For the new payload, find all the manifests that would be included, in the absence of capability filtering.
    Match manifests using group, kind, namespace, and name.
    Union all capabilities from matched manifests with the set of previously-enabled capabilities to compute the new enabledCapabilities set.

As in [the general case](https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#capabilities-cannot-be-uninstalled), when the requested spec.capabilities diverge from enabledCapabilities, the CVO will set ImplicitlyEnabledCapabilities=True in ClusterVersion's status.conditions with a message pointing out the divergence.

cc @tjungblu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me start by saying that I didn't know about ImplicitlyEnabledCapabilities until just now, but I think the ramifications are this:

Capabilities cannot be disabled once enabled. That we already know. The effect is that if a user creates a cluster with the registry enabled or enables it after initially disabling it at install time, that cluster is locked to having the image registry enabled forever. The only possible toggle is from disabled -> enabled and it is a one-way trip.

ImplicitlyEnabledCapabilities is for converting more components into capabilities over time.

For example, in the y-release before image registry became a capability, ClusterVersionCapabilitySet4_13, ImageRegistry would not have been in the enabled capabilities set because it didn't exist yet. But once you upgraded to the y-stream where it was a capability, CV spec.capabilities would still be ClusterVersionCapabilitySet4_13 but CVO will set ImplicitlyEnabledCapabilities=True reflecting that ImageRegistry was not in the spec.capabilities but is still enabled because it was present pre-upgrade.

All that being said, I don't think we have to worry about this from an hypershift API perspective because we are not exposing the capabilities API directly on the HostedCluster resource. As long as the HostedCluster API reflects the limitations of the underlying capabilities API (i.e. that enabling is a one-way trip and disabling can only happen a cluster creation time) we should be fine. We can let the CVO deal with the ImplicitlyEnabledCapabilities and because we don't support any OCP versions before the ImageRegistry capabilities was created, we don't have to worry about the situation above.

Copy link
Member

@wking wking Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

You'll want to figure out how to handle at least some of implicit-enablement, to avoid dropping DeploymentConfig in clusters updating into OCPSTRAT-1465.

Details

It's taken us a while to get real-world examples, but 4.18's OperatorLifecycleManagerV1 is the first capability born out of the blue (for GA clusters). The functionality was tech-preview in 4.17. So a GA 4.17 cluster updating to 4.18 with the None capability set would have not had the OperatorLifecycleManagerV1 capability enabled. You don't currently allow anything like a None capability set, so this case isn't relevant for you yet.

But the other case, where a capability is removed from vCurrent, is relevant. We're expecting DeploymentConfig to be dropped soon (say in 4.20, just to simplify discussion, but that's not a commitment, I'm not sure if or when it will actually happen). Clusters updating from 4.19 to 4.20 in vCurrent will implicitly enable the capability. If you leave the cluster in vCurrent without implicitly enabling the capability, the OpenShift API server says "hmm, I guess I reject this DeploymentConfig API request..." and all the DeploymentConfig-based Pods in the cluster get garbage collected, or something. Which would probably upset the workload admins who'd been using DeploymentConfigs pre-update.

  • You could maybe narrowly dance around it by unioning ClusterVersion's status.capabilities.enabledCapabilities with the vCurrent - disabledCapabilities set, but that would be feeding hosted-cluster-side ClusterVersion status back into management-cluster-side HostedControlPlane controler activities, which we try to avoid.
  • The CVO avoids that kind of feedback loop at update-time by holding both the outgoing and incoming manifest sets in memory at once, and comparing them to see which capabilities need enabling. But the CVO is still pulling ClusterVersion status.capabilities.enabledCapabilities back into its control loop during CVO-container-restart, because we don't have an alternative location to store that state. And having status that cannot be reconstructed without a time machine to see what older clusters were reconciling can make some backup/recovery flows more difficult.
  • Another possible workaround is adding your own status property, e.g. to HostedControlPlane, to hold the implicitly-enabled capabilities on the management cluster without tainting them with hosted-cluster-side data. But this still suffers from the "control loop fed by status" backup/recovery exposure that the current standalone CVO is exposed to.
  • Another possible workaround would be to grow the enabledCapabilities property you float now, in this initial phase, and then you can hard-code logic like "when a cluster updates from 4.19, always add DeploymentConfig to enabledCapabilities" to the 4.20 HostedControlPlane controller. Or other permutations of that kind of hard-coding, which is tedious to maintain, but possibly less brittle than relying on the preservation of status properties.

Aside from moving whole capabilities like that, the current CVO code handles cases like:

  1. There was a component built into OCP, without an associated ClusterVersion capability knob in 4.y.
  2. The component maintainers try to create a new ClusterVersion capability for their component in 4.(y+1).
  3. But they missed a manifest! The PrometheusRule gets the capability annotation in 4.(y+2).

Folks with the capability disabled on 4.(y+1) will have it implicitly enabled upon updating to 4.(y+2), because we don't support having a half-enabled capability. This situation doesn't come up that often, but it has happened. If you're willing to have the occasional cluster ambiguously running a partial capability, you can probably ignore it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for your thoughtful answers.

The functionality was tech-preview in 4.17. So a GA 4.17 cluster updating to 4.18 with the None capability set would have not had the OperatorLifecycleManagerV1 capability enabled. You don't currently allow anything like a None capability set, so this case isn't relevant for you yet.

This EP proposes we start setting BaselineCapabilitySet to None, so this would become relevant...

Clusters updating from 4.19 to 4.20 in vCurrent will implicitly enable the capability. If you leave the cluster in vCurrent without implicitly enabling the capability, the OpenShift API server says "hmm, I guess I reject this DeploymentConfig API request..." and all the DeploymentConfig-based Pods in the cluster get garbage collected, or something.

Given that CVO in Hypershift (as well as Hypershift itself) currently don't handle implicitly enabled capabilities, doesn't it mean that we would already be in the state you describe if the DeploymentConfig capability gets removed from vCurrent today, even without the changes proposed in this EP?


I'm still processing the suggestions, will probably have more questions once I have fully digested each.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that CVO in Hypershift (as well as Hypershift itself) currently don't handle implicitly enabled capabilities, doesn't it mean that we would already be in the state you describe if the DeploymentConfig capability gets removed from vCurrent today, even without the changes proposed in this EP?

Yikes, you're right! I've warned the workloads folks here and here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a section about this under Risks and Mitigations: https://github.com/openshift/enhancements/pull/1729/files#diff-36fbf863a674d8de905f7aa65cdb6158ec8c27b70d2fe5eed4272ca0a964a730R233

I'm still uncertain about what the correct approach here is. Would be great if you could give us your input @deads2k.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wiring threads together, David's response seems to be over here.

Comment on lines 73 to 96
* Cluster instance admins can create clusters with the Image Registry disabled
* Cluster instance admins can re-enable the Image Registry without recreating
the cluster
* Components in the hosted control plane are not impacted by the Image Registry
absence
* Image Registry related assets such as workloads, storage account and
pull-secrets are not provisioned in clusters without the Image Registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be clear about the distinction between the Image Registry Capability vs an Image Registry instance throughout this enhancement.

For example, for the goals:

Suggested change
* Cluster instance admins can create clusters with the Image Registry disabled
* Cluster instance admins can re-enable the Image Registry without recreating
the cluster
* Components in the hosted control plane are not impacted by the Image Registry
absence
* Image Registry related assets such as workloads, storage account and
pull-secrets are not provisioned in clusters without the Image Registry
* Cluster instance admins can create clusters with the Image Registry Capability disabled
* Cluster instance admins can enable the Image Registry Capability without recreating
the cluster
* Components in the hosted control plane are not impacted by the absence of the Image Registry Capability.
* Assets compromising the Image Registry Capability such as workloads, storage accounts, and
pull-secrets are not provisioned in clusters without the Image Registry Capability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about being more clear about component vs capability, thanks for pointing that out. In some of your suggestions I did mean to refer to the component though, so I incorporated it partially. PTAL.

See [Hypershift docs][hypershift-personas] for a definition of the personas used below.

* As a cluster service consumer, I want to provision hosted control planes and
clusters without the Image Registry, so that my hosted clusters do not contain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
clusters without the Image Registry, so that my hosted clusters do not contain
clusters without the Image Registry Capability, so that my hosted clusters do not contain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do mean the component here, not the capability. Is there anything else I can do to improve clarity here? 🤔

@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 4 times, most recently from a00d7cc to 277b33d Compare January 14, 2025 12:38
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch from 277b33d to dfe3d8a Compare January 15, 2025 13:42

### API Extensions

```go

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, I've already created the API changes in openshift/hypershift#5415

@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 7 times, most recently from e5cd838 to 2759a36 Compare January 22, 2025 09:08
@flavianmissi flavianmissi changed the title WIP ARO-13667 hypershift: support disabling image registry via capability ARO-13667 hypershift: support disabling image registry via capability Jan 22, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 22, 2025
@flavianmissi flavianmissi force-pushed the hypershift-image-registry-capability-ARO-13667 branch 3 times, most recently from 4ad9211 to ee7cf65 Compare January 22, 2025 13:30
Copy link
Contributor

openshift-ci bot commented Jan 22, 2025

@flavianmissi: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants