Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd context deadline exceeded - sensu backend not connecting to etcd #9

Open
dcharleston opened this issue May 11, 2021 · 13 comments · May be fixed by #10
Open

etcd context deadline exceeded - sensu backend not connecting to etcd #9

dcharleston opened this issue May 11, 2021 · 13 comments · May be fixed by #10
Assignees

Comments

@dcharleston
Copy link

I'm following the readme and using all default settings. Running locally on minikube. sensu-backend pod repeatedly fails because the readiness check for the backend's /health endpoint never passes. It returns:

{
    "Alarms": null,
    "ClusterHealth": [
        {
            "MemberID": 10276657743932975437,
            "MemberIDHex": "8e9e05c52164694d",
            "Name": "sensu-etcd-0",
            "Err": "context deadline exceeded",
            "Healthy": false
        }
    ],
    "Header": {
        "cluster_id": 14841639068965178418,
        "member_id": 10276657743932975437,
        "raft_term": 2
    }
}

etcd cluster health comes back as healthy from both the etcd and the sensu-backend container :

/ # ETCDCTL_API=3 etcdctl --endpoints "http://sensu-etcd-0.sensu-etcd.sensu-exam
ple.svc.cluster.local:2379" endpoint health
http://sensu-etcd-0.sensu-etcd.sensu-example.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 1.554986ms

Following errors appear in the sensu-backend logs:

{"level":"warn","ts":"2021-05-11T23:18:27.057Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-69922e1f-6460-409b-99ae-ede3c1ab2c80/sensu-etcd-0:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
....
....
....
{"component":"store","level":"info","members":[{"ID":10276657743932975437,"name":"sensu-etcd-0","peerURLs":["http://localhost:2380"],"clientURLs":["http://sensu-etcd-0:2379"]}],"msg":"retrieved cluster members","time":"2021-05-11T23:27:28Z"}
{"cache_version":"v1","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v2.Silenced","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v1","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v2.Namespace","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v2","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v3.EntityConfig","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v2","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v3.EntityConfig","time":"2021-05-11T23:27:29Z"}
{"backend_id":"1d9a5640-eba9-4ee9-89a9-a8c25ff09831","component":"metricsd","level":"debug","msg":"metricsd heartbeat","name":"entity_metrics","time":"2021-05-11T23:27:29Z"}
{"backend_id":"1d9a5640-eba9-4ee9-89a9-a8c25ff09831","component":"metricsd","level":"debug","msg":"metricsd heartbeat","name":"cluster_metrics","time":"2021-05-11T23:27:30Z"}
{"component":"metricsd","level":"info","msg":"refreshing metrics suite on this backend","name":"entity_metrics","time":"2021-05-11T23:27:31Z"}
{"level":"warn","ts":"2021-05-11T23:27:31.672Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-df513939-a8df-41d5-a64b-ae6bb7bee084/sensu-etcd-0:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"component":"store","health":{"MemberID":10276657743932975437,"MemberIDHex":"8e9e05c52164694d","Name":"sensu-etcd-0","Err":"context deadline exceeded","Healthy":false},"level":"info","msg":"cluster member health","time":"2021-05-11T23:27:31Z"}
@tattwei46
Copy link

Hi, anyone taking a look at the above issue? Also having the same problem

@rivlinpereira
Copy link

same issue

@jspaleta
Copy link
Contributor

jspaleta commented Jul 5, 2022

Okay Here's the underlying issue as i see it in my minikube environment running on my fedora linux system.
The sensu-backend readinessProbe is failing in a weird way because of not quite yet supported ipv6 in minikube.

It looks like minikube is letting the sensu-backend bind its tcp api to ipv6 localhost tcp port 8080 instead of ipv4 tcp port 8080, and there doesn't seeem to be an obvious way to prevent minikube from allowing this to happen.
Here's what it looks like from inside the sensu-backend-0 container running under my minikube

$ netstat -tlpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:6060          0.0.0.0:*               LISTEN      1/sensu-backend
tcp        0      0 127.0.0.1:3030          0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:3031          0.0.0.0:*               LISTEN      -
tcp        0      0 :::8081                 :::*                    LISTEN      1/sensu-backend
tcp        0      0 :::8080                 :::*                    LISTEN      1/sensu-backend
tcp        0      0 :::3000                 :::*                    LISTEN      1/sensu-backend

those last 3 services are listening on ipv6 and that definitely not good.

The k8s configurations provided in this repo assumes ipv4 will be using in the pods. The sensu-backend readinessProbe uses the busybox provided wget in an alpine container which is not ipv6 compatible.

Need to either figure out a way to configure minikube so it doesnt let that happen, or we need to figure out a way to tell the sensu-backend to explicitly bind onf ipv4 localhost.

@jspaleta
Copy link
Contributor

jspaleta commented Jul 6, 2022

Turns out this is a problem with the sensu-backend readinessProbe settings. The settings were too aggressive for default minikube resource provisioning and probes were being started faster then they were timing out, causing a problem.

Please test PR #10 and comment there on the potential fix

@jspaleta jspaleta linked a pull request Jul 6, 2022 that will close this issue
@mvthul
Copy link

mvthul commented Aug 11, 2022

Did anyone got this working?

@jspaleta
Copy link
Contributor

@mvthul
I believe I have a fix for this, and I have an open PR for it, see previous comment. I just need someone experiencing the problem to test my proposed fix and make sure it works for them.

@mvthul
Copy link

mvthul commented Aug 11, 2022

I changed ur changes that I could see I see everything is green and running but stil context deadline is appearing in logs. When I log in t sensu there is a red bar popping up and if I click details I see under ETCD context deadline. Tried so many things to fix and tried so many other helm charts and scripts. Nothing seems to work with version 6+

@jspaleta
Copy link
Contributor

The specific changes needed to solve the problem may require system specific changes to the configuration... let me explain.

There are timeouts configured for the readiness probes and if the system running minikube is resource poor, then the those configurations will be too aggressive and the readiness probes will fall over because the underlying service didn't get enough cpu cycles to complete the start up process.

the PR i put together changes these settings enough so that it works on my laptop running minikube. But the nature of the problem is such that even though it works for me, it might fail for someone else with tighter system resources.

There might not be a one size fits all solution here, because we definitely still want the readiness probes to give up at reasonable point. For something like google or amazon's service that reasonable point of failure is much sooner than any local minikube deployment...because of available resources.

If as a minikube user your still having this specific problem, you may need to further adjust the readinessProbe settings to give your minikube deployment more time to provision everything.

@mvthul
Copy link

mvthul commented Aug 11, 2022 via email

@jspaleta
Copy link
Contributor

jspaleta commented Aug 11, 2022

okay well this inst confined to minikube.. this needs to be reinvestigated.

Azure AKS isn't a service I've tested against yet, but I'll look into it.

@jspaleta jspaleta self-assigned this Aug 11, 2022
@jspaleta
Copy link
Contributor

@mvthul
Okay so for me on minikube, the context deadline exceeded error is most likely due to having slow disk access to the visualized volumes. etcd is sensitive to slow disk performance for its backing store.

For me the context timeout exceeded messages are intermittent and aren't causing a problem for the intended purpose of kicking the tires in minikube, everything spins up and I'm able to use the sensu dashboard.

For Azure AKS, you might need to change the storage class associated with the sensu-etcd persistent volume. I don't know what AKS storageClass options has out of the gate, but you'll want a dedicated SSD for the sensu-etcd volume.

@WladyX
Copy link

WladyX commented May 9, 2023

I was experiencing the same issue on tanzu kubernetes, seems the PR works as expected, i think you should merge it.

@sensu-discourse
Copy link

This issue has been mentioned on Sensu Community. There might be relevant details there:

https://discourse.sensu.io/t/issues-installing-sensu-6-10-on-eks/3137/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants