-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARO-13667 hypershift: support disabling image registry via capability #1729
base: master
Are you sure you want to change the base?
ARO-13667 hypershift: support disabling image registry via capability #1729
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3f45963
to
dda1779
Compare
enhancements/hypershift/support-for-image-registry-capability.md
Outdated
Show resolved
Hide resolved
09fcfa5
to
a975798
Compare
6da7e55
to
59ccb87
Compare
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Outdated
Show resolved
Hide resolved
2ed863b
to
87ef4db
Compare
The `CVO` reconciler needs to edit the `ClusterVersion` resource, setting | ||
`spec.capabilities.baselineCapabilitySet` to `"None"`, then calculating the | ||
difference between `ClusterVersionCapabilitySets[ClusterVersionCapabilitySetCurrent]` | ||
and `HostedCluster.spec.disabledCapabilities`, assigning the resulting set to | ||
field `spec.capabilities.additionalEnabledCapabilities`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not strictly related to this enhancement, but working through the caps I've noticed that the installer has a dedicated (non-API based) validation we might need to replicate:
https://github.com/openshift/installer/blob/main/pkg/types/validation/installconfig.go#L205-L253
Things like
if c.Platform.BareMetal != nil && !enabledCaps.Has(configv1.ClusterVersionCapabilityBaremetal)
are obviously bogus for Hypershift, but the ClusterVersionCapabilitySetCurrent
will still define BareMetal, so we're also setting this implicitly.
It seems to me that a decoupled and dedicated CapabilitySet for Hypershift would be a better idea, but I defer this to @deads2k - this is clearly more work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear to me why the installer's baremetal
logic wouldn't apply to HyperShift. Does HyperShift not run on baremetal
? Or does it run on baremetal
but somehow allow the baremetal
component to be disabled? I wonder what things like bring-your-own-nodes workflows might look like on HyperShift, but I could see some future where folks want just the Kube API and don't need the MachineAPI
(or some future ClusterAPI
?) component enabled, to tell the HostedControlPlane controller that it can skip the Cluster-API Machine(Set), etc. reconcilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking what's the impact when we run a Hypershift control plane (e.g. on AWS) with enabled baremetal
capability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] the installer has a dedicated (non-API based) validation we might need to replicate:
Added a section about this under "Risks and Mitigations": https://github.com/openshift/enhancements/pull/1729/files#diff-36fbf863a674d8de905f7aa65cdb6158ec8c27b70d2fe5eed4272ca0a964a730R278-R289
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Outdated
Show resolved
Hide resolved
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Outdated
Show resolved
Hide resolved
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Show resolved
Hide resolved
1017891
to
2ee2bed
Compare
1. Does Hypershift's CVO need to support [implicity capabilities][implicit-capabilities]? | ||
|
||
[implicit-capabilities]: https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning can you help us answer this question? this came up on a slack discussion.
I don't fully understand the role of the CVO in Hypershift so I'm having difficulties understanding how potential changes in the resources belonging to a capability might affect upgrades.
Here's the relevant section from the linked enhancement:
Updates
During updates, resources may move from being associated with one capability, to another, or move from being part of the core (always installed) to being part of an optional capability. In such cases, the governing logic is that once a resource has been applied to a cluster, it must continue to be applied to the cluster (unless it is removed from the payload entirely, of course). Furthermore, if any resource from a given capability is applied to a cluster, all other resources from that capability must be applied to the cluster, even if the cluster configuration would otherwise filter out that capability. These are referred to as implicitly enabled capabilities and the purpose is to avoid scenarios that would break cluster functionality. Once a resource is applied to a cluster, it will always be reconciled in future versions, and that if any part of a capability is present, the entire capability will be present.
This means the CVO must apply the following logic:
For the outgoing payload, find all manifests that are [currently included](https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#manifest-annotations).
For the new payload, find all the manifests that would be included, in the absence of capability filtering.
Match manifests using group, kind, namespace, and name.
Union all capabilities from matched manifests with the set of previously-enabled capabilities to compute the new enabledCapabilities set.
As in [the general case](https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#capabilities-cannot-be-uninstalled), when the requested spec.capabilities diverge from enabledCapabilities, the CVO will set ImplicitlyEnabledCapabilities=True in ClusterVersion's status.conditions with a message pointing out the divergence.
cc @tjungblu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me start by saying that I didn't know about ImplicitlyEnabledCapabilities
until just now, but I think the ramifications are this:
Capabilities cannot be disabled once enabled. That we already know. The effect is that if a user creates a cluster with the registry enabled or enables it after initially disabling it at install time, that cluster is locked to having the image registry enabled forever. The only possible toggle is from disabled -> enabled and it is a one-way trip.
ImplicitlyEnabledCapabilities
is for converting more components into capabilities over time.
For example, in the y-release before image registry became a capability, ClusterVersionCapabilitySet4_13
, ImageRegistry
would not have been in the enabled capabilities set because it didn't exist yet. But once you upgraded to the y-stream where it was a capability, CV spec.capabilities
would still be ClusterVersionCapabilitySet4_13
but CVO will set ImplicitlyEnabledCapabilities=True
reflecting that ImageRegistry
was not in the spec.capabilities
but is still enabled because it was present pre-upgrade.
All that being said, I don't think we have to worry about this from an hypershift API perspective because we are not exposing the capabilities API directly on the HostedCluster resource. As long as the HostedCluster API reflects the limitations of the underlying capabilities API (i.e. that enabling is a one-way trip and disabling can only happen a cluster creation time) we should be fine. We can let the CVO deal with the ImplicitlyEnabledCapabilities
and because we don't support any OCP versions before the ImageRegistry
capabilities was created, we don't have to worry about the situation above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
You'll want to figure out how to handle at least some of implicit-enablement, to avoid dropping DeploymentConfig
in clusters updating into OCPSTRAT-1465.
Details
It's taken us a while to get real-world examples, but 4.18's OperatorLifecycleManagerV1
is the first capability born out of the blue (for GA clusters). The functionality was tech-preview in 4.17. So a GA 4.17 cluster updating to 4.18 with the None
capability set would have not had the OperatorLifecycleManagerV1
capability enabled. You don't currently allow anything like a None
capability set, so this case isn't relevant for you yet.
But the other case, where a capability is removed from vCurrent
, is relevant. We're expecting DeploymentConfig
to be dropped soon (say in 4.20, just to simplify discussion, but that's not a commitment, I'm not sure if or when it will actually happen). Clusters updating from 4.19 to 4.20 in vCurrent
will implicitly enable the capability. If you leave the cluster in vCurrent
without implicitly enabling the capability, the OpenShift API server says "hmm, I guess I reject this DeploymentConfig API request..." and all the DeploymentConfig-based Pods in the cluster get garbage collected, or something. Which would probably upset the workload admins who'd been using DeploymentConfigs pre-update.
- You could maybe narrowly dance around it by unioning ClusterVersion's
status.capabilities.enabledCapabilities
with thevCurrent - disabledCapabilities
set, but that would be feeding hosted-cluster-side ClusterVersionstatus
back into management-cluster-side HostedControlPlane controler activities, which we try to avoid. - The CVO avoids that kind of feedback loop at update-time by holding both the outgoing and incoming manifest sets in memory at once, and comparing them to see which capabilities need enabling. But the CVO is still pulling ClusterVersion
status.capabilities.enabledCapabilities
back into its control loop during CVO-container-restart, because we don't have an alternative location to store that state. And havingstatus
that cannot be reconstructed without a time machine to see what older clusters were reconciling can make some backup/recovery flows more difficult. - Another possible workaround is adding your own
status
property, e.g. to HostedControlPlane, to hold the implicitly-enabled capabilities on the management cluster without tainting them with hosted-cluster-side data. But this still suffers from the "control loop fed bystatus
" backup/recovery exposure that the current standalone CVO is exposed to. - Another possible workaround would be to grow the
enabledCapabilities
property you float now, in this initial phase, and then you can hard-code logic like "when a cluster updates from 4.19, always addDeploymentConfig
toenabledCapabilities
" to the 4.20 HostedControlPlane controller. Or other permutations of that kind of hard-coding, which is tedious to maintain, but possibly less brittle than relying on the preservation ofstatus
properties.
Aside from moving whole capabilities like that, the current CVO code handles cases like:
- There was a component built into OCP, without an associated ClusterVersion capability knob in 4.y.
- The component maintainers try to create a new ClusterVersion capability for their component in 4.(y+1).
- But they missed a manifest! The PrometheusRule gets the capability annotation in 4.(y+2).
Folks with the capability disabled on 4.(y+1) will have it implicitly enabled upon updating to 4.(y+2), because we don't support having a half-enabled capability. This situation doesn't come up that often, but it has happened. If you're willing to have the occasional cluster ambiguously running a partial capability, you can probably ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both for your thoughtful answers.
The functionality was tech-preview in 4.17. So a GA 4.17 cluster updating to 4.18 with the None capability set would have not had the OperatorLifecycleManagerV1 capability enabled. You don't currently allow anything like a None capability set, so this case isn't relevant for you yet.
This EP proposes we start setting BaselineCapabilitySet
to None, so this would become relevant...
Clusters updating from 4.19 to 4.20 in vCurrent will implicitly enable the capability. If you leave the cluster in vCurrent without implicitly enabling the capability, the OpenShift API server says "hmm, I guess I reject this DeploymentConfig API request..." and all the DeploymentConfig-based Pods in the cluster get garbage collected, or something.
Given that CVO in Hypershift (as well as Hypershift itself) currently don't handle implicitly enabled capabilities, doesn't it mean that we would already be in the state you describe if the DeploymentConfig capability gets removed from vCurrent
today, even without the changes proposed in this EP?
I'm still processing the suggestions, will probably have more questions once I have fully digested each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that CVO in Hypershift (as well as Hypershift itself) currently don't handle implicitly enabled capabilities, doesn't it mean that we would already be in the state you describe if the DeploymentConfig capability gets removed from
vCurrent
today, even without the changes proposed in this EP?
Yikes, you're right! I've warned the workloads folks here and here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a section about this under Risks and Mitigations
: https://github.com/openshift/enhancements/pull/1729/files#diff-36fbf863a674d8de905f7aa65cdb6158ec8c27b70d2fe5eed4272ca0a964a730R233
I'm still uncertain about what the correct approach here is. Would be great if you could give us your input @deads2k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wiring threads together, David's response seems to be over here.
* Cluster instance admins can create clusters with the Image Registry disabled | ||
* Cluster instance admins can re-enable the Image Registry without recreating | ||
the cluster | ||
* Components in the hosted control plane are not impacted by the Image Registry | ||
absence | ||
* Image Registry related assets such as workloads, storage account and | ||
pull-secrets are not provisioned in clusters without the Image Registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be clear about the distinction between the Image Registry Capability vs an Image Registry instance throughout this enhancement.
For example, for the goals:
* Cluster instance admins can create clusters with the Image Registry disabled | |
* Cluster instance admins can re-enable the Image Registry without recreating | |
the cluster | |
* Components in the hosted control plane are not impacted by the Image Registry | |
absence | |
* Image Registry related assets such as workloads, storage account and | |
pull-secrets are not provisioned in clusters without the Image Registry | |
* Cluster instance admins can create clusters with the Image Registry Capability disabled | |
* Cluster instance admins can enable the Image Registry Capability without recreating | |
the cluster | |
* Components in the hosted control plane are not impacted by the absence of the Image Registry Capability. | |
* Assets compromising the Image Registry Capability such as workloads, storage accounts, and | |
pull-secrets are not provisioned in clusters without the Image Registry Capability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about being more clear about component vs capability, thanks for pointing that out. In some of your suggestions I did mean to refer to the component though, so I incorporated it partially. PTAL.
See [Hypershift docs][hypershift-personas] for a definition of the personas used below. | ||
|
||
* As a cluster service consumer, I want to provision hosted control planes and | ||
clusters without the Image Registry, so that my hosted clusters do not contain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusters without the Image Registry, so that my hosted clusters do not contain | |
clusters without the Image Registry Capability, so that my hosted clusters do not contain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do mean the component here, not the capability. Is there anything else I can do to improve clarity here? 🤔
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Show resolved
Hide resolved
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Outdated
Show resolved
Hide resolved
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Outdated
Show resolved
Hide resolved
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Outdated
Show resolved
Hide resolved
a00d7cc
to
277b33d
Compare
277b33d
to
dfe3d8a
Compare
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Show resolved
Hide resolved
enhancements/hypershift/support-disabling-image-registry-via-capability.md
Outdated
Show resolved
Hide resolved
|
||
### API Extensions | ||
|
||
```go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, I've already created the API changes in openshift/hypershift#5415
e5cd838
to
2759a36
Compare
4ad9211
to
ee7cf65
Compare
@flavianmissi: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
No description provided.