Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reconcilation problem after restore KamajiControlPlane from backup #83

Conversation

andreykont
Copy link

@andreykont andreykont commented Feb 22, 2024

My goal is restore managed cluster from backup.
I use Velero for restore.
First of all I recover etcd database, TenantControlPlane and other generic capi resources.
On last stage I have to recover KamajiControlPlane resource.

KamajiControlPlane gets "Ready" status, but I see error in kamaji control plane controller's log:

2024-02-21T13:54:48Z ERROR Reconciler error {"controller": "kamajicontrolplane", "controllerGroup": "controlplane.cluster.x-k8s.io", "controllerKind": "KamajiControlPlane", "KamajiControlPlane": {"name":"ocean-ocean-avkontya3-jvpinos","namespace":"ocean-ocean-avkontya3-jvpinos"}, "namespace": "ocean-ocean-avkontya3-jvpinos", "name": "ocean-ocean-avkontya3-jvpinos", "reconcileID": "2cece214-b52e-41fc-884a-cac3ebd28bd6", "error": "cannot create or update CA secret: missing Certificate value from *kamajiv1alpha1.TenantControlPlane CA", "errorVerbose": "missing Certificate value from *kamajiv1alpha1.TenantControlPlane CA\ngithub.com/clastix/cluster-api-control-plane-provider-kamaji/controllers.(*KamajiControlPlaneReconciler).createOrUpdateCertificateAuthor

Cause of problem:
In my case kamajiCA and capiCA secret names are equals. Also when I restore kcp with Velero, it doesn't restore secret OwnerReferences. Kamaji control plane detects it like it's own object and delete "ca.crt" and "ca.key" data in kamajiCA secret. On the second iteration it makes error, because ca.crt deleted.

Wanted behavior: KamajiControlPlane controller doesn't control kamajiCA secrets.

How I solved the error:
Our provider doesn't have to control KamajiCA secret.
I replaced condition that check who is owner of CA secret.
Now we control it by secret names.

@prometherion
Copy link
Member

I'm a bit confused @andreykont, isn't this a problem in Kamaji upstream rather than on the CAPI provider?

@andreykont
Copy link
Author

I'm a bit confused @andreykont, isn't this a problem in Kamaji upstream rather than on the CAPI provider?

Good day, @prometherion . Yes. I completely agree. Kamaji doesn't restore owner reference of it's CA secret. It's a mistake. But anyway it might take some period of time. I think capi provider has to control such situation (kamaji
and capiCA secret names are equal) to avoid racing. In addition, I think Kamaji project knows nothing about CAPI project and don't have to control it's CA secret name (to be not equals with CAPI project secret name).

@prometherion
Copy link
Member

Kamaji doesn't restore owner reference of it's CA secret

That's correct: https://github.com/clastix/kamaji/blob/eff68db336f3c161447e08a6733344260c10e12a/internal/resources/ca_certificate.go#L92-L110

This snippet shows we're just avoiding rewriting the secret content if the certificates are matching since Velero is not able to restore the Owner Reference.

But anyway it might take some period of time

It's Kubernetes, eventually consistent :) Jokes apart, we can wait for a few iterations before having everything reconciled properly, and I'd say Kamaji resources are our source of truth since the CAPI CP provider is almost dummy.

I think capi provider has to control such situation (kamaji
and capiCA secret names are equal) to avoid racing. In addition, I think Kamaji project knows nothing about CAPI project and don't have to control it's CA secret name (to be not equals with CAPI project secret name)

This is a bit problematic since we started Kamaji completely decoupled from Cluster API, and the CP provider is the one responsible for this, however, we stumbled upon a naming collision afterwards and had to patch in this way.


tl;dr; WDYT in fixing this in Kamaji, rather than in CAPI?

@andreykont
Copy link
Author

Hello, @prometherion.

Let me clarify the problem once more. After restoring kamaji tenant control plane’s secret with cert and key is restored, KamajiControlPlane Provider start reconcilation and adopts the secret changing keys in its map:

capiCA.Data = map[string][]byte{
corev1.TLSCertKey: crt,
corev1.TLSPrivateKeyKey: key,
}

Then it updates secret and returns to reconcile process and comes back to check secret once more.
As ownerReference is set by KamajiControlPlaneProvider line

return controllerutil.SetControllerReference(&kcp, capiCA, r.client.Scheme()) //nolint:wrapcheck

the secret is not valid for that check:

if len(capiCA.GetOwnerReferences()) > 0 && capiCA.OwnerReferences[0].Kind == "TenantControlPlane" {

and KamajiControlPlaneProvider tries to find ca.crt and ca.key in secret, which it patched during the previous reconcilation loop. As the result it logs errors as it is not able to find in TenantControlPlane secret ca.crt,

another solution is to add more checks there

if len(capiCA.GetOwnerReferences()) > 0 && capiCA.OwnerReferences[0].Kind == "TenantControlPlane" {

either by name equals or that secret is already patched for CAPI.

I think it should not be patched from the kamaji upstream as it is not aware of kamajiControlPlane provider. Moreover, i think it can be fixed in kamajiControlPlane provider.

If we come to CAPI Project to fix that it is going to break their agreement on how to prepare secrets i think. https://cluster-api.sigs.k8s.io/developer/architecture/controllers/cluster.html#secrets

@prometherion
Copy link
Member

I think it should not be patched from the kamaji upstream as it is not aware of kamajiControlPlane provider

True the fact Kamaji is not aware of the CAPI CP provider, but Owner/Controller References should be fixed upon first reconciliation, that's the reason why I'm suggesting fixing in Kamaji.

Once we have the OwnerReference in place, there's not bug in in the CP provider, if I understand correctly.

@andreykont
Copy link
Author

Hello, @prometherion.
Ok, I will open PR in Kamaji project to restore owner reference during reconciliation process.

@andreykont
Copy link
Author

Fixed in clastix/kamaji#427

@andreykont andreykont closed this Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants