Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods and containers may be restarted and lose their state #35

Open
craigwalton-dsit opened this issue Dec 20, 2024 · 0 comments
Open

Pods and containers may be restarted and lose their state #35

craigwalton-dsit opened this issue Dec 20, 2024 · 0 comments

Comments

@craigwalton-dsit
Copy link
Collaborator

craigwalton-dsit commented Dec 20, 2024

This is described in the docs https://k8s-sandbox.ai-safety-institute.org.uk/design/limitations/#containers-may-restart - please read this first to make sense of the info below.

This info was migrated from the old internal repo's issues.

  • Going from StatefulSet -> Job:
    • we gain the ability to set restartPolicy: never and backoffLimit: 0 which will prevent containers or Pods from restarting
    • we have to work around any containers which currently rely on restart behaviour during startup (e.g. using a new dependsOn field)
    • we lose the --wait functionality we got from helm install. We'll have to do something ourselves, maybe as a postInstall helm hook
  • Going from StatefulSet -> bare Pod:
    • we gain the ability to set restartPolicy: never which prevents containers from restarting
    • we have to work around any containers which currently rely on restart behaviour during startup
    • the --wait functionality of helm install is suboptimal as it will wait for all pods to be "ready" even if they become "failed" until the timeout (which can be hours for large runs)

We could support something a bit like dependsOn from compose.yaml e.g.

services:
  web:
    depends_on:
      - db
      - redis
  redis:
    image: redis
  db:
    image: postgres

We'd implement this by adding an initContainer which uses a service account to wait until the required services are ready. Would involve having to pass a service account to the initContainer. Questions remain as to what timeout behaviour to use.

Other suggestions were to

  • Expose the restartPolicy to the user and let them control it (would still entail using Jobs)
  • Could write our own controller, but this is a large undertaking
  • Use a helm hook (like test) which the Python code can call to periodically check there have been no restarts

We noted that we don't have a good handle on how often this is an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant