From 54222b43e24a1f5317899c968424d7d9d6224cd1 Mon Sep 17 00:00:00 2001 From: craig Date: Tue, 17 Sep 2024 12:15:54 +0100 Subject: [PATCH 1/4] add dns record missing alert Signed-off-by: craig add orphan record mitigation doc fix lint improve alert query --- Makefile | 3 +- config/observability/kustomization.yaml | 3 +- .../rbac/ksm_clusterrole_patch.yaml | 1 + doc/user-guides/orphan-dns-records.md | 94 +++++++++++++++++++ examples/alerts/kustomization.yaml | 1 + examples/alerts/orphan_records.yaml | 22 +++++ 6 files changed, 121 insertions(+), 3 deletions(-) create mode 100644 doc/user-guides/orphan-dns-records.md create mode 100644 examples/alerts/orphan_records.yaml diff --git a/Makefile b/Makefile index 22c4f87be..6c237fcc7 100644 --- a/Makefile +++ b/Makefile @@ -353,11 +353,12 @@ run: generate fmt vet ## Run a controller from your host. docker-build: GIT_SHA=$(shell git rev-parse HEAD || echo "unknown") docker-build: DIRTY=$(shell $(PROJECT_PATH)/utils/check-git-dirty.sh || echo "unknown") docker-build: ## Build docker image with the manager. - $(CONTAINER_ENGINE) build \ + $(CONTAINER_ENGINE) build \ --build-arg QUAY_IMAGE_EXPIRY=$(QUAY_IMAGE_EXPIRY) \ --build-arg GIT_SHA=$(GIT_SHA) \ --build-arg DIRTY=$(DIRTY) \ --build-arg QUAY_IMAGE_EXPIRY=$(QUAY_IMAGE_EXPIRY) \ + --load \ -t $(IMG) . docker-push: ## Push docker image with the manager. diff --git a/config/observability/kustomization.yaml b/config/observability/kustomization.yaml index c1f894d18..daf19d080 100644 --- a/config/observability/kustomization.yaml +++ b/config/observability/kustomization.yaml @@ -3,8 +3,7 @@ kind: Kustomization resources: - github.com/prometheus-operator/kube-prometheus?ref=release-0.13 - - github.com/Kuadrant/gateway-api-state-metrics?ref=0.4.0 - - github.com/Kuadrant/gateway-api-state-metrics/config/examples/dashboards?ref=0.4.0 + - github.com/Kuadrant/gateway-api-state-metrics/config/kuadrant?ref=0.5.0 # To scrape istio metrics, 3 configurations are required: # 1. Envoy metrics directly from the istio ingress gateway pod - prometheus/monitors/pod-monitor-envoy.yaml diff --git a/config/observability/rbac/ksm_clusterrole_patch.yaml b/config/observability/rbac/ksm_clusterrole_patch.yaml index 8766bb16e..aa32b1206 100644 --- a/config/observability/rbac/ksm_clusterrole_patch.yaml +++ b/config/observability/rbac/ksm_clusterrole_patch.yaml @@ -34,6 +34,7 @@ - dnspolicies - ratelimitpolicies - authpolicies + - dnsrecords verbs: - list - watch diff --git a/doc/user-guides/orphan-dns-records.md b/doc/user-guides/orphan-dns-records.md new file mode 100644 index 000000000..8c654c699 --- /dev/null +++ b/doc/user-guides/orphan-dns-records.md @@ -0,0 +1,94 @@ +## Orphan DNS Records + +This document is focused around multi-cluster DNS where you have more than one instance of a gateway that shares a common hostname with other gateways and assumes you have the [observability](https://docs.kuadrant.io/0.10.0/kuadrant-operator/doc/observability/examples/) stack set up. + +### What is an orphan record? + +An orphan DNS record is a record or set of records that are owned by an instance of the DNS operator that no longer has a representation of those records on its cluster. + +### How do orphan records occur? + +Orphan records can occur when a `DNSRecord` resource (a resource that is created in response to a `DNSPolicy`) is deleted without allowing the owning controller time to clean up the associated records in the DNS provider. Generally in order for this to happen, you would need to force remove a `finalizer` from the `DNSRecord` resource, delete the kuadrant-system namespace directly or un-install kuadrant (delete the subscription if using OLM) without first cleaning up existing policies or delete a cluster entirely without first cleaning up the associated DNSPolicies. These are not common scenarios but when they do occur they can leave behind records in your DNS Provider which may point to IPs / Hosts that are no longer valid. + + +### How do you spot an orphan record(s) exist? + +There is an a prometheus based based alert that we have created that uses some metrics exposed from the DNS components to spot this situation. If you have installed the alerts for Kuadrant under the examples folder, you will see in the alerts tab an alert called `PossibleOrphanedDNSRecords`. When this is firing it means there are likely to be orphaned records in your provider. + +### How do you get rid of an orphan record? + +To remove an Orphan Record we must first identify the owner of that record that is no longer aware of the record. To do this we need an existing DNSRecord in another cluster. + +Example: You have 2 clusters that each have a gateway and share a host `apps.example.com` and have setup a DNSPolicy for each gateway. On cluster 1 you remove the `kuadrant-system` namespace without first cleaning up existing DNSPolicies targeting the gateway in your `ingress-gateway` namespace. Now there are a set of records that were being managed for that gateway that have not been removed. +On cluster 2 the DNS Operator managing the existing DNSRecord in that cluster has a record of all owners of that dns name. +In prometheus alerts, it spots that the number of owners does not correlate to the number of DNSRecord resources and triggers an alert. +To remedy this rather than going to the DNS provider directly and trying to figure out which records to remove, you can instead follow the steps below. + +1) Get the owner id of the DNSRecord on cluster 2 for the shared host + +``` +kubectl get dnsrecord somerecord -n my-gateway-ns -o=jsonpath='{.status.ownerID}' +``` + +2) get all the owner ids + +``` +kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.status.domainOwners}' + +## output +["26aacm1z","49qn0wp7"] +``` + +3) create a placeholder DNSRecord with none active ownerID + + +for each owner id returned that isn't the owner id of the record we got earlier that we want to remove records for, we need to create a dnsrecord resource and delete it. This will trigger the running operator in this cluster to clean up those records. + +``` +# this is one of the owner id **not** in the existing dnsrecord on cluster +export ownerID=26aacm1z + +export rootHost=$(kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.spec.rootHost}') + +# export a namespace with the aws credentials in it +export targetNS=kuadrant-system + +kubectl apply -f - < 0 + for: 5m + labels: + severity: warning + annotations: + summary: "The number of DNS Owners is greater than the number of records for root domain '{{ $labels.rootDomain }}'" + description: "This alert fires if the number of owners (controller collaborating on a record set) is greater than the number of records. This may mean a record has been left behind in the provider due to a failed delete" From 557415b0ae1c47c7a53a4ac4df2070280c781a26 Mon Sep 17 00:00:00 2001 From: Craig Brookes Date: Mon, 23 Sep 2024 10:27:52 +0100 Subject: [PATCH 2/4] Update doc/user-guides/orphan-dns-records.md Co-authored-by: Michael Nairn --- doc/user-guides/orphan-dns-records.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/user-guides/orphan-dns-records.md b/doc/user-guides/orphan-dns-records.md index 8c654c699..89446d111 100644 --- a/doc/user-guides/orphan-dns-records.md +++ b/doc/user-guides/orphan-dns-records.md @@ -13,7 +13,7 @@ Orphan records can occur when a `DNSRecord` resource (a resource that is created ### How do you spot an orphan record(s) exist? -There is an a prometheus based based alert that we have created that uses some metrics exposed from the DNS components to spot this situation. If you have installed the alerts for Kuadrant under the examples folder, you will see in the alerts tab an alert called `PossibleOrphanedDNSRecords`. When this is firing it means there are likely to be orphaned records in your provider. +There is a prometheus based alert that uses some metrics exposed from the DNS components to spot this situation. If you have installed the alerts for Kuadrant under the examples folder, you will see in the alerts tab an alert called `PossibleOrphanedDNSRecords`. When this is firing it means there are likely to be orphaned records in your provider. ### How do you get rid of an orphan record? From 55a0e51fd680c0b5ce6974934438ea79fff637d1 Mon Sep 17 00:00:00 2001 From: Craig Brookes Date: Mon, 23 Sep 2024 10:27:57 +0100 Subject: [PATCH 3/4] Update doc/user-guides/orphan-dns-records.md Co-authored-by: Michael Nairn --- doc/user-guides/orphan-dns-records.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/user-guides/orphan-dns-records.md b/doc/user-guides/orphan-dns-records.md index 89446d111..93042acd7 100644 --- a/doc/user-guides/orphan-dns-records.md +++ b/doc/user-guides/orphan-dns-records.md @@ -17,7 +17,7 @@ There is a prometheus based alert that uses some metrics exposed from the DNS co ### How do you get rid of an orphan record? -To remove an Orphan Record we must first identify the owner of that record that is no longer aware of the record. To do this we need an existing DNSRecord in another cluster. +To remove an Orphan Record we must first identify the owner that is no longer aware of the record. To do this we need an existing DNSRecord in another cluster. Example: You have 2 clusters that each have a gateway and share a host `apps.example.com` and have setup a DNSPolicy for each gateway. On cluster 1 you remove the `kuadrant-system` namespace without first cleaning up existing DNSPolicies targeting the gateway in your `ingress-gateway` namespace. Now there are a set of records that were being managed for that gateway that have not been removed. On cluster 2 the DNS Operator managing the existing DNSRecord in that cluster has a record of all owners of that dns name. From a7f8cdb8eeac75a4d710dc6a50a613a4e510b235 Mon Sep 17 00:00:00 2001 From: Craig Brookes Date: Mon, 23 Sep 2024 10:28:04 +0100 Subject: [PATCH 4/4] Update doc/user-guides/orphan-dns-records.md Co-authored-by: Michael Nairn --- doc/user-guides/orphan-dns-records.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/user-guides/orphan-dns-records.md b/doc/user-guides/orphan-dns-records.md index 93042acd7..9a44d2836 100644 --- a/doc/user-guides/orphan-dns-records.md +++ b/doc/user-guides/orphan-dns-records.md @@ -91,4 +91,4 @@ kubectl get dnsrecord.kuadrant.io somerecord -n my-gateway-ns -o=jsonpath='{.sta ``` -We should also see our alert eventually stop triggering also. +We should also see our alert eventually stop triggering.