Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Jobs being killed during execution in new premerge #362

Open
boomanaiden154 opened this issue Jan 21, 2025 · 0 comments
Open

[CI] Jobs being killed during execution in new premerge #362

boomanaiden154 opened this issue Jan 21, 2025 · 0 comments
Assignees

Comments

@boomanaiden154
Copy link
Contributor

We are seeing some workflows being killed prematurely in the new premerge system, typically in the build and test stage:

  1. https://github.com/llvm/llvm-project/actions/runs/12719721823
  2. https://github.com/llvm/llvm-project/actions/runs/12718965069
  3. https://github.com/llvm/llvm-project/actions/runs/12718483685
  4. https://github.com/llvm/llvm-project/actions/runs/12717546963
  5. https://github.com/llvm/llvm-project/actions/runs/12712443995

Our original investigation pointed us to autoscaling, with logs showing that the nodes that a pod would be scheduled on were scheduled to downscale while the job was executing. This led to patches like 69508a6 which still did not fix the problem. It was still occurring, eg in https://github.com/llvm/llvm-project/actions/runs/12746855571.

Spending a lot more time looking at the logs, I observed that a konnectivity-agent pod would get killed about half a second before the job get prematurely killed/interrupted, with good consistency. They seemed to be going down on other nodes as they autoscaled down. We disabled autoscaling for a couple days to test this hypothesis and observed no failures. This being the issue could make sense given konnectivity handles communications between pods and the control plane, which we have to go through when running exec on a k8s pod.

The plan is to disable the kubernetes execution mode. This takes away some flexibility, but should fix the reliability problem. We also want to open an upstream issue in konnectivity so we can hopefully get the flexibility back.

Crossref for the internal bug: b/389220221.

@boomanaiden154 boomanaiden154 self-assigned this Jan 21, 2025
boomanaiden154 added a commit that referenced this issue Jan 21, 2025
We are running into reliability problems with the kubernetes executor
mode, hypothesized currently to be due to a complex interaction with
konnectivity-agent pods dying and subsequently killing in-process execs
through the k8s control plane. The plan is to switch back to the
in-container executor mode for now while we sort out the issue upstream
with the konnectivity developers.

This is related to #362.
boomanaiden154 added a commit to llvm/llvm-project that referenced this issue Jan 21, 2025
This patch removes the container from the premerge job. We are moving
away from the kubernetes executor back to executing everything in the
same container due to reliability issues. This patch updates everything
in the premerge job to work.

This is part of a temp fix to
llvm/llvm-zorg#362.
github-actions bot pushed a commit to arm/arm-toolchain that referenced this issue Jan 21, 2025
This patch removes the container from the premerge job. We are moving
away from the kubernetes executor back to executing everything in the
same container due to reliability issues. This patch updates everything
in the premerge job to work.

This is part of a temp fix to
llvm/llvm-zorg#362.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant