Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠ Fix custom defaulter: avoid deleting unknown fields (zero change patch) #2982

Merged

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Oct 14, 2024

When creating a patch for the custom defaulter, don't return the removals generated by object decoding with the local scheme.

Otherwise, the patch will include a remove instruction for fields that the go type doesn't contain.

This could happen when building a webhook for a resource that belongs to another project (including k8s types).

This PR stops the defaulter from pruning the fields that are not recognized in the local scheme.
When doing this we are creating two patches one before and one after the custom defaulter call and only send back the remove operations that are not present in both patches.

The user can opt-out from this new behavior by specifying the DefaulterRemoveUnknownFields option when calling WithCustomDefaulter.

(This is a rework of #2931)

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 14, 2024
@trasc
Copy link
Contributor Author

trasc commented Oct 14, 2024

/cc @alculquicondor
/cc @sbueringer
/cc @alvaroaleman

@alculquicondor
Copy link
Contributor

For context to c-r maintainers, we have been discussing what alternatives we have for a generic defaulter that doesn't drop unknown fields. Here is the thread for additional context kubernetes-sigs/kueue#3194 (comment)

Taking @mimowo's words, we would be happy to first merge a solution in kueue and let it mature there before moving it here. But we would appreciate your input to the approaches.

@alvaroaleman
Copy link
Member

@alculquicondor @trasc @mimowo We (Stefan, Vince and I) had some discussion around #2932 vs this and we landed at this here being better but we aren't 100% sure if its compatible in all cases and currently don't have time to spend on that question, which is why we would like an opt-out for what is suggested here (but on by default).

Does that sound reasonable to you all?

@trasc
Copy link
Contributor Author

trasc commented Oct 23, 2024

@alculquicondor @trasc @mimowo We (Stefan, Vince and I) had some discussion around #2932 vs this and we landed at this here being better but we aren't 100% sure if its compatible in all cases and currently don't have time to spend on that question, which is why we would like an opt-out for what is suggested here (but on by default).

Does that sound reasonable to you all?

Sure, I'll ping you when I have something in that direction.

@alculquicondor
Copy link
Contributor

Thanks for considering the options!

Do you have any particular guideline on how the opt-out mechanism could look like? Should we have a different method in the factory? Or should the method WithDefaulter accept some options? Any other suggestion?

Other than that, I hope the overhead of the approach is not significant. Fortunately, a new encoding is coming to k/k CRDs https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4222-cbor-serializer#motivation

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 24, 2024
@trasc
Copy link
Contributor Author

trasc commented Oct 24, 2024

I think

... should the method WithDefaulter accept some options ...

it leave room to maybe add an additional option that removes the add zero values.

@trasc
Copy link
Contributor Author

trasc commented Oct 24, 2024

@alvaroaleman please have a look at my DefaulterPreserveUnknownFields proposal when you have some time.

Also please let me know if we can do something about the pull-controller-runtime-apidiff check,

changed from 
func(*k8s.io/apimachinery/pkg/runtime.Scheme, k8s.io/apimachinery/pkg/runtime.Object, CustomDefaulter) *Webhook to 
func(*k8s.io/apimachinery/pkg/runtime.Scheme, k8s.io/apimachinery/pkg/runtime.Object, CustomDefaulter, ...defaulterOption) *Webhook

is a flase positive in my opinion, as you don't need to actually do something to old code in order to use the new version.

@trasc trasc force-pushed the dont-drop-missing-in-scheme branch from 72ac9bd to e652815 Compare October 25, 2024 06:51
@trasc trasc force-pushed the dont-drop-missing-in-scheme branch from 4a124c8 to ab4e8c9 Compare October 25, 2024 18:34
@alculquicondor
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 25, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: dc8f633f86d5467de1824324a9004cf5c10d3d61

@alvaroaleman alvaroaleman added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 26, 2024
@trasc trasc force-pushed the dont-drop-missing-in-scheme branch from b5419bf to fb7b6cd Compare November 5, 2024 19:58
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2024
@trasc
Copy link
Contributor Author

trasc commented Nov 5, 2024

Rebased

@k8s-ci-robot
Copy link
Contributor

@trasc: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-controller-runtime-apidiff fb7b6cd link false /test pull-controller-runtime-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sbueringer
Copy link
Member

Thx!

/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 5, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7302a762ff601db57984c8fa33072a6e68119c8f

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, sbueringer, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alvaroaleman,sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 48ec3b7 into kubernetes-sigs:main Nov 5, 2024
8 of 9 checks passed
@trasc trasc deleted the dont-drop-missing-in-scheme branch November 26, 2024 07:15
@sbueringer
Copy link
Member

sbueringer commented Jan 2, 2025

@trasc @alculquicondor Hey folks,

just an early heads up already. I bumped Cluster API to CR main branch and I'm now getting panics from the webhook: https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api/11633/pull-cluster-api-e2e-blocking-main/1874860029216034816/artifacts/clusters/bootstrap/logs/capd-system/capd-controller-manager/capd-controller-manager-5b59b9b76-4kmz2/manager.log

Looks like jsonpatch.Operation is not always hashable

E0102 17:01:30.424172       1 <autogenerated>:1] "Observed a panic" logger="admission" webhookGroup="infrastructure.cluster.x-k8s.io" webhookKind="DockerClusterTemplate" DockerClusterTemplate="quick-start-zwk5pp/quick-start-cluster" namespace="quick-start-zwk5pp" name="quick-start-cluster" resource={"group":"infrastructure.cluster.x-k8s.io","version":"v1beta1","resource":"dockerclustertemplates"} user="kubernetes-admin" requestID="39810ae0-c339-4d68-a8cf-c007f0215a5b" panic="runtime error: hash of unhashable type map[string]interface {}" panicGoValue="\"hash of unhashable type map[string]interface {}\"" stacktrace=<
	goroutine 515 [running]:
	k8s.io/apimachinery/pkg/util/runtime.logPanic({0x259abe0, 0xc000822690}, {0x1ee0d80, 0xc00086cb10})
		k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0xbc
	sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle.func1()
		sigs.k8s.io/[email protected]/pkg/webhook/admission/webhook.go:163 +0x103
	panic({0x1ee0d80?, 0xc00086cb10?})
		runtime/panic.go:785 +0x132
	type:.hash.gomodules.xyz/jsonpatch/v2.Operation(0xc0007ea7c8, 0x5aa77e?)
		<autogenerated>:1 +0x45
	k8s.io/apimachinery/pkg/util/sets.Set[...].Has(...)
		k8s.io/[email protected]/pkg/util/sets/set.go:78
	sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).dropSchemeRemovals.func3({{0x22158db, 0x3}, {0xc0003d34d0, 0x28}, {0x1ebeba0, 0xc000aa9500}})
		sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:158 +0x65
	slices.IndexFunc[...](...)
		slices/slices.go:109
	slices.DeleteFunc[...]({0xc000740a80?, 0x8, 0x8}, 0xc0007eab68)
		slices/slices.go:237 +0xe3
	sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).dropSchemeRemovals(_, {{0xc000740a80, 0x8, 0x8}, {{0x0, 0x0}, 0x1, 0x0, {0x0, 0x0, ...}, ...}}, ...)
		sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:157 +0x4ac
	sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*defaulterForType).Handle(_, {_, _}, {{{0xc0003d2ab0, 0x24}, {{0xc000e0bc20, 0x1f}, {0xc000a75510, 0x7}, {0xc000c10780, ...}}, ...}})
		sigs.k8s.io/[email protected]/pkg/webhook/admission/defaulter_custom.go:129 +0x42e

@sbueringer
Copy link
Member

Not sure what the most efficient way is to compare jsonpatch.Operations (the straightforward option would be to call the Json method and compare that instead

@alculquicondor
Copy link
Contributor

I don't think we need to compare the whole thing, just the combination of operation and path.
I would be surprised if jsonpatch.CreatePatch generates more than one entry for a path, even.

Here is the type, for reference https://github.com/gomodules/jsonpatch/blob/e2d6410c0299d49b797482dc765a3b49ecc8a4b6/v3/jsonpatch.go#L17-L21

Should we revert for now?

cc @mbobrovskyi @mimowo

@mbobrovskyi
Copy link
Contributor

IIUC, the JSON Patch remove operation shouldn't include a value. I'm curious about the specific request needed to reproduce this scenario.

@mbobrovskyi
Copy link
Contributor

I think I've identified the problem. We're comparing all patches, which makes it possible to encounter a non-hashable value.

Created a fix #3056.

@sbueringer
Copy link
Member

Should we revert for now?

I think we're fine with the fix (#3056 should cover it)

I think I've identified the problem. We're comparing all patches, which makes it possible to encounter a non-hashable value.

Looks perfect, thank you!

@sbueringer
Copy link
Member

Quick feedback. The panic is fixed 🎉

I'm just hitting another issue in Cluster API. Not sure yet if that is something that we should care about in CR

API field:

	// imageTag allows to specify a tag for the image.
	// +optional
	ImageTag string `json:"imageTag,omitempty"`

Through a combination of unexpected behavior, we're trying to set this field to null via one of our CAPI controllers:

  "local" : {
    "imageTag" : null
  }

Before this change: our defaulting/mutating webhook generated this patch operation:

Operation: remove
Path: /spec/kubeadmConfigSpec/clusterConfiguration/dns/imageTag

=> The result was that the field was removed before CR validation

After this change: this patch operation is dropped by dropSchemeRemovals
=> The result is that imageTag stays null
=> Because the field is not nullable we end up with

failed to create KubeadmControlPlane.controlplane.cluster.x-k8s.io:
        FieldValueTypeInvalid: spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag:
        Invalid value: "null": spec.kubeadmConfigSpec.clusterConfiguration.etcd.local.imageTag in body must be of type string

@alculquicondor
Copy link
Contributor

It looks like WAI, from the name of the defaulter option removeUnknownOrOmitableFields.

It's a bug in the CAPI controller that happened to be masked by the webhook implementation details.
I wonder why the CAPI controller sets null, tho. Doesn't it use the go types that are marked with omitempty?

@sbueringer
Copy link
Member

sbueringer commented Jan 4, 2025

I agree. The behavior of the defaulter is exactly as expected.

CAPI has a controller that has to work with unstructured and in some cases it sets the field (unintentionally) to nil instead of an empty string.

I'm only wondering how many folks we break that are in a similar situation as CAPI. CAPI itself is not a problem, I can deal with this there by setting the option. I think I have to set it because we can't make a change like this within an API version in CAPI (there are many tools/products built on top of CAPI and they shouldn't suddenly start failing with issues like this). If I don't set the option the behavior of our API changes. Values that have been (unintentionally) valid before are suddenly invalid.

@alculquicondor
Copy link
Contributor

We did make the change opt-out. I think it's reasonable to make it opt-in for a few releases, have a deprecation note and eventually remove it.

@sbueringer
Copy link
Member

sbueringer commented Jan 8, 2025

Talked to Alvaro. Let's keep it as opt-out. We'll document it as "breaking change" in the release notes, which should increase the chances that folks notice the change (folks that don't read release notes also won't notice if we change the opt-in/out behavior over a few releases, it would just hit them a bit later).

In general I would probably like to keep the option permanently. I don't want to force folks to fork the custom defaulter just because they want to keep the behavior of their API entirely stable.

@sbueringer sbueringer changed the title 🐛 Fix custom defaulter: avoid deleting unknown fields (zero change patch) ⚠ Fix custom defaulter: avoid deleting unknown fields (zero change patch) Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants