Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional support for k8s liveness and rediness probes #30

Open
michaelyaakoby opened this issue Aug 31, 2020 · 12 comments
Open

Add optional support for k8s liveness and rediness probes #30

michaelyaakoby opened this issue Aug 31, 2020 · 12 comments

Comments

@michaelyaakoby
Copy link
Member

See https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-features.html#production-ready-kubernetes-probes for how its done in spring-boot's actuator.

In high level, pyctuator should support telling k8s when an application/service is ready to serve requests (i.e. readiness probe) and if the application is alive (this is the liveness probe) - this is different from the health status returned by /pyctuator/health for two reasons:

  1. it should return fast, so shouldn't include health checks of external resources
  2. it should reflect the k8s lifecycle as described in actuator's documentation (and in k8s docs of course)

While actuator allow to configure additional checks to these probes, it is suggested that initially we'll provide default implementation that users can choose to use.

@michaelyaakoby
Copy link
Member Author

Since if the Liveness State of an application is broken, Kubernetes will try to solve that problem by restarting the application instance, the “liveness” probe should not depend on health checks for external systems.
Otherwise, if an external system fails (e.g. a database, a Web API, an external cache) triggering the liveness to fail, Kubernetes might restart all application instances and create cascading failures.

Also, since Kubernetes will not route traffic to an instance of the application that it's "readiness state" is unready, checking external systems must be made carefully by the application developers.

The probes should return 200 if they the server is alive/ready and >=400 otherwise (or completely fail to repsonse).

For more details on these probes, see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/.

Therefore I suggest:

  1. Add two new endpoints under /pyctuator/health/probes', one for readinessand another forliveness`.
  2. By default all they do is returning 200 OK.
  3. Allow to register custom "readiness" checks which are functions that throw an exception to indicate not-ready.

WDYT?

@simoneggler
Copy link

there's another problem. when the registration with the admin server fails, pyctuator does not serve actuator endpoints at all. this behavior leads to cascading restarts since the readiness probe fails with connection refused.

@michaelyaakoby
Copy link
Member Author

I see,
Can you provide steps to reproduce?
Thanks

@simoneggler
Copy link

  1. point registration_url to an url that does not resolve (e.g. service in k8s cluster with no running pod)
  2. create instance of Pyctuator (with flask)
  3. flask only becomes ready after a timeout of more than two minutess (Failed registering with boot-admin, [Errno 110] Connection timed out)

expected behavoir: exception is thrown much earlier

@michaelyaakoby
Copy link
Member Author

Thanks, will look into this.

@michaelyaakoby
Copy link
Member Author

Actually, assuming you run your flask application in a k8s pod too, I'm thinking if this might be a behavior of k8s.
Can you try to curl to the same URL from the pod running the flask application? I wonder if you'll get an immediate answer.

@simoneggler
Copy link

Curl times out after two minutes...

@michaelyaakoby
Copy link
Member Author

Yes, I just tested it myself.
Not sure that in this case I can make pyctuator timeout earlier.

@simoneggler
Copy link

How about connecting in the background? The service may be healthy without admin registration....

@michaelyaakoby
Copy link
Member Author

I agree that you don't want to block an application from starting because the monitoring system is down.

Pyctuator is using the builtin http.client.HTTPConnection so it doesnt force you to use any library - whatever will be the solution, we need to maintain this.

Looking at http.client.HTTPConnection, I see there's an optional timeout that we may use, see https://docs.python.org/3/library/http.client.html

I can make this a config parameter with a default that's much lower, maybe 10s.

@simoneggler
Copy link

That‘d be great, thanks.

@michaelyaakoby
Copy link
Member Author

Oh, just noticed we are discussing this hanging within the issue asking k8s probes.
Moving the discussion to #51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants