etcd context deadline exceeded - sensu backend not connecting to etcd #9

dcharleston · 2021-05-11T23:31:14Z

I'm following the readme and using all default settings. Running locally on minikube. sensu-backend pod repeatedly fails because the readiness check for the backend's /health endpoint never passes. It returns:

{
    "Alarms": null,
    "ClusterHealth": [
        {
            "MemberID": 10276657743932975437,
            "MemberIDHex": "8e9e05c52164694d",
            "Name": "sensu-etcd-0",
            "Err": "context deadline exceeded",
            "Healthy": false
        }
    ],
    "Header": {
        "cluster_id": 14841639068965178418,
        "member_id": 10276657743932975437,
        "raft_term": 2
    }
}

etcd cluster health comes back as healthy from both the etcd and the sensu-backend container :

/ # ETCDCTL_API=3 etcdctl --endpoints "http://sensu-etcd-0.sensu-etcd.sensu-exam
ple.svc.cluster.local:2379" endpoint health
http://sensu-etcd-0.sensu-etcd.sensu-example.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 1.554986ms

Following errors appear in the sensu-backend logs:

{"level":"warn","ts":"2021-05-11T23:18:27.057Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-69922e1f-6460-409b-99ae-ede3c1ab2c80/sensu-etcd-0:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
....
....
....
{"component":"store","level":"info","members":[{"ID":10276657743932975437,"name":"sensu-etcd-0","peerURLs":["http://localhost:2380"],"clientURLs":["http://sensu-etcd-0:2379"]}],"msg":"retrieved cluster members","time":"2021-05-11T23:27:28Z"}
{"cache_version":"v1","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v2.Silenced","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v1","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v2.Namespace","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v2","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v3.EntityConfig","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v2","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v3.EntityConfig","time":"2021-05-11T23:27:29Z"}
{"backend_id":"1d9a5640-eba9-4ee9-89a9-a8c25ff09831","component":"metricsd","level":"debug","msg":"metricsd heartbeat","name":"entity_metrics","time":"2021-05-11T23:27:29Z"}
{"backend_id":"1d9a5640-eba9-4ee9-89a9-a8c25ff09831","component":"metricsd","level":"debug","msg":"metricsd heartbeat","name":"cluster_metrics","time":"2021-05-11T23:27:30Z"}
{"component":"metricsd","level":"info","msg":"refreshing metrics suite on this backend","name":"entity_metrics","time":"2021-05-11T23:27:31Z"}
{"level":"warn","ts":"2021-05-11T23:27:31.672Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-df513939-a8df-41d5-a64b-ae6bb7bee084/sensu-etcd-0:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"component":"store","health":{"MemberID":10276657743932975437,"MemberIDHex":"8e9e05c52164694d","Name":"sensu-etcd-0","Err":"context deadline exceeded","Healthy":false},"level":"info","msg":"cluster member health","time":"2021-05-11T23:27:31Z"}

The text was updated successfully, but these errors were encountered:

tattwei46 · 2021-12-05T09:15:33Z

Hi, anyone taking a look at the above issue? Also having the same problem

rivlinpereira · 2022-04-14T21:25:33Z

same issue

jspaleta · 2022-07-05T21:45:38Z

Okay Here's the underlying issue as i see it in my minikube environment running on my fedora linux system.
The sensu-backend readinessProbe is failing in a weird way because of not quite yet supported ipv6 in minikube.

It looks like minikube is letting the sensu-backend bind its tcp api to ipv6 localhost tcp port 8080 instead of ipv4 tcp port 8080, and there doesn't seeem to be an obvious way to prevent minikube from allowing this to happen.
Here's what it looks like from inside the sensu-backend-0 container running under my minikube

$ netstat -tlpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:6060          0.0.0.0:*               LISTEN      1/sensu-backend
tcp        0      0 127.0.0.1:3030          0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:3031          0.0.0.0:*               LISTEN      -
tcp        0      0 :::8081                 :::*                    LISTEN      1/sensu-backend
tcp        0      0 :::8080                 :::*                    LISTEN      1/sensu-backend
tcp        0      0 :::3000                 :::*                    LISTEN      1/sensu-backend

those last 3 services are listening on ipv6 and that definitely not good.

The k8s configurations provided in this repo assumes ipv4 will be using in the pods. The sensu-backend readinessProbe uses the busybox provided wget in an alpine container which is not ipv6 compatible.

Need to either figure out a way to configure minikube so it doesnt let that happen, or we need to figure out a way to tell the sensu-backend to explicitly bind onf ipv4 localhost.

jspaleta · 2022-07-06T00:17:32Z

Turns out this is a problem with the sensu-backend readinessProbe settings. The settings were too aggressive for default minikube resource provisioning and probes were being started faster then they were timing out, causing a problem.

Please test PR #10 and comment there on the potential fix

mvthul · 2022-08-11T06:45:45Z

Did anyone got this working?

jspaleta · 2022-08-11T16:27:06Z

@mvthul
I believe I have a fix for this, and I have an open PR for it, see previous comment. I just need someone experiencing the problem to test my proposed fix and make sure it works for them.

mvthul · 2022-08-11T16:30:08Z

I changed ur changes that I could see I see everything is green and running but stil context deadline is appearing in logs. When I log in t sensu there is a red bar popping up and if I click details I see under ETCD context deadline. Tried so many things to fix and tried so many other helm charts and scripts. Nothing seems to work with version 6+

jspaleta · 2022-08-11T16:40:21Z

The specific changes needed to solve the problem may require system specific changes to the configuration... let me explain.

There are timeouts configured for the readiness probes and if the system running minikube is resource poor, then the those configurations will be too aggressive and the readiness probes will fall over because the underlying service didn't get enough cpu cycles to complete the start up process.

the PR i put together changes these settings enough so that it works on my laptop running minikube. But the nature of the problem is such that even though it works for me, it might fail for someone else with tighter system resources.

There might not be a one size fits all solution here, because we definitely still want the readiness probes to give up at reasonable point. For something like google or amazon's service that reasonable point of failure is much sooner than any local minikube deployment...because of available resources.

If as a minikube user your still having this specific problem, you may need to further adjust the readinessProbe settings to give your minikube deployment more time to provision everything.

mvthul · 2022-08-11T16:43:32Z

I tried in Azure AKS and locally with Microk8s both same issue 😭

…

________________________________ Van: Jef Spaleta ***@***.***> Verzonden: Thursday, August 11, 2022 6:40:31 PM Aan: sensu/sensu-k8s-quick-start ***@***.***> CC: mvthul ***@***.***>; Mention ***@***.***> Onderwerp: Re: [sensu/sensu-k8s-quick-start] etcd context deadline exceeded - sensu backend not connecting to etcd (#9) The specific changes needed to solve the problem may require system specific changes to the configuration... let me explain. There are timeouts configured for the readiness probes and if the system running minikube is resource poor, then the those configurations will be too aggressive and the readiness probes will fall over because the underlying service didn't get enough cpu cycles to complete the start up process. the PR i put together changes these settings enough so that it works on my laptop running minikube. But the nature of the problem is such that even though it works for me, it might fail for someone else with tighter system resources. There might not be a one size fits all solution here, because we definitely still want the readiness probes to give up at reasonable point. For something like google or amazon's service that reasonable point of failure is much sooner than any local minikube deployment...because of available resources. If as a minikube user your still having this specific problem, you may need to further adjust the readinessProbe settings to give your minikube deployment more time to provision everything. — Reply to this email directly, view it on GitHub<#9 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AM3SRPX2VHTZMGBFECJSI3TVYUUH7ANCNFSM44XJBN3Q>. You are receiving this because you were mentioned.Message ID: ***@***.***>

jspaleta · 2022-08-11T16:46:38Z

okay well this inst confined to minikube.. this needs to be reinvestigated.

Azure AKS isn't a service I've tested against yet, but I'll look into it.

jspaleta · 2022-08-25T01:00:42Z

@mvthul
Okay so for me on minikube, the context deadline exceeded error is most likely due to having slow disk access to the visualized volumes. etcd is sensitive to slow disk performance for its backing store.

For me the context timeout exceeded messages are intermittent and aren't causing a problem for the intended purpose of kicking the tires in minikube, everything spins up and I'm able to use the sensu dashboard.

For Azure AKS, you might need to change the storage class associated with the sensu-etcd persistent volume. I don't know what AKS storageClass options has out of the gate, but you'll want a dedicated SSD for the sensu-etcd volume.

WladyX · 2023-05-09T10:50:23Z

I was experiencing the same issue on tanzu kubernetes, seems the PR works as expected, i think you should merge it.

sensu-discourse · 2023-06-23T12:45:22Z

This issue has been mentioned on Sensu Community. There might be relevant details there:

https://discourse.sensu.io/t/issues-installing-sensu-6-10-on-eks/3137/2

jspaleta linked a pull request Jul 6, 2022 that will close this issue

Update to fix issue with minikube based testing #10

Open

jspaleta self-assigned this Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd context deadline exceeded - sensu backend not connecting to etcd #9

etcd context deadline exceeded - sensu backend not connecting to etcd #9

dcharleston commented May 11, 2021

tattwei46 commented Dec 5, 2021

rivlinpereira commented Apr 14, 2022

jspaleta commented Jul 5, 2022

jspaleta commented Jul 6, 2022

mvthul commented Aug 11, 2022

jspaleta commented Aug 11, 2022

mvthul commented Aug 11, 2022

jspaleta commented Aug 11, 2022

mvthul commented Aug 11, 2022 via email

jspaleta commented Aug 11, 2022 •

edited

Loading

jspaleta commented Aug 25, 2022

WladyX commented May 9, 2023

sensu-discourse commented Jun 23, 2023

etcd context deadline exceeded - sensu backend not connecting to etcd #9

etcd context deadline exceeded - sensu backend not connecting to etcd #9

Comments

dcharleston commented May 11, 2021

tattwei46 commented Dec 5, 2021

rivlinpereira commented Apr 14, 2022

jspaleta commented Jul 5, 2022

jspaleta commented Jul 6, 2022

mvthul commented Aug 11, 2022

jspaleta commented Aug 11, 2022

mvthul commented Aug 11, 2022

jspaleta commented Aug 11, 2022

mvthul commented Aug 11, 2022 via email

jspaleta commented Aug 11, 2022 • edited Loading

jspaleta commented Aug 25, 2022

WladyX commented May 9, 2023

sensu-discourse commented Jun 23, 2023

jspaleta commented Aug 11, 2022 •

edited

Loading