[CI] Jobs being killed during execution in new premerge #362

boomanaiden154 · 2025-01-21T18:47:22Z

We are seeing some workflows being killed prematurely in the new premerge system, typically in the build and test stage:

Our original investigation pointed us to autoscaling, with logs showing that the nodes that a pod would be scheduled on were scheduled to downscale while the job was executing. This led to patches like 69508a6 which still did not fix the problem. It was still occurring, eg in https://github.com/llvm/llvm-project/actions/runs/12746855571.

Spending a lot more time looking at the logs, I observed that a konnectivity-agent pod would get killed about half a second before the job get prematurely killed/interrupted, with good consistency. They seemed to be going down on other nodes as they autoscaled down. We disabled autoscaling for a couple days to test this hypothesis and observed no failures. This being the issue could make sense given konnectivity handles communications between pods and the control plane, which we have to go through when running exec on a k8s pod.

The plan is to disable the kubernetes execution mode. This takes away some flexibility, but should fix the reliability problem. We also want to open an upstream issue in konnectivity so we can hopefully get the flexibility back.

Crossref for the internal bug: b/389220221.

The text was updated successfully, but these errors were encountered:

We are running into reliability problems with the kubernetes executor mode, hypothesized currently to be due to a complex interaction with konnectivity-agent pods dying and subsequently killing in-process execs through the k8s control plane. The plan is to switch back to the in-container executor mode for now while we sort out the issue upstream with the konnectivity developers. This is related to #362.

This patch removes the container from the premerge job. We are moving away from the kubernetes executor back to executing everything in the same container due to reliability issues. This patch updates everything in the premerge job to work. This is part of a temp fix to llvm/llvm-zorg#362.

boomanaiden154 self-assigned this Jan 21, 2025

boomanaiden154 mentioned this issue Jan 21, 2025

[CI] Switch away from kubernetes executor mode #360

Merged

boomanaiden154 mentioned this issue Jan 21, 2025

[Github][CI] Remove premerge container llvm/llvm-project#123483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Jobs being killed during execution in new premerge #362

[CI] Jobs being killed during execution in new premerge #362

boomanaiden154 commented Jan 21, 2025

[CI] Jobs being killed during execution in new premerge #362

[CI] Jobs being killed during execution in new premerge #362

Comments

boomanaiden154 commented Jan 21, 2025