You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our original investigation pointed us to autoscaling, with logs showing that the nodes that a pod would be scheduled on were scheduled to downscale while the job was executing. This led to patches like 69508a6 which still did not fix the problem. It was still occurring, eg in https://github.com/llvm/llvm-project/actions/runs/12746855571.
Spending a lot more time looking at the logs, I observed that a konnectivity-agent pod would get killed about half a second before the job get prematurely killed/interrupted, with good consistency. They seemed to be going down on other nodes as they autoscaled down. We disabled autoscaling for a couple days to test this hypothesis and observed no failures. This being the issue could make sense given konnectivity handles communications between pods and the control plane, which we have to go through when running exec on a k8s pod.
The plan is to disable the kubernetes execution mode. This takes away some flexibility, but should fix the reliability problem. We also want to open an upstream issue in konnectivity so we can hopefully get the flexibility back.
Crossref for the internal bug: b/389220221.
The text was updated successfully, but these errors were encountered:
We are running into reliability problems with the kubernetes executor
mode, hypothesized currently to be due to a complex interaction with
konnectivity-agent pods dying and subsequently killing in-process execs
through the k8s control plane. The plan is to switch back to the
in-container executor mode for now while we sort out the issue upstream
with the konnectivity developers.
This is related to #362.
This patch removes the container from the premerge job. We are moving
away from the kubernetes executor back to executing everything in the
same container due to reliability issues. This patch updates everything
in the premerge job to work.
This is part of a temp fix to
llvm/llvm-zorg#362.
This patch removes the container from the premerge job. We are moving
away from the kubernetes executor back to executing everything in the
same container due to reliability issues. This patch updates everything
in the premerge job to work.
This is part of a temp fix to
llvm/llvm-zorg#362.
We are seeing some workflows being killed prematurely in the new premerge system, typically in the build and test stage:
Our original investigation pointed us to autoscaling, with logs showing that the nodes that a pod would be scheduled on were scheduled to downscale while the job was executing. This led to patches like 69508a6 which still did not fix the problem. It was still occurring, eg in https://github.com/llvm/llvm-project/actions/runs/12746855571.
Spending a lot more time looking at the logs, I observed that a
konnectivity-agent
pod would get killed about half a second before the job get prematurely killed/interrupted, with good consistency. They seemed to be going down on other nodes as they autoscaled down. We disabled autoscaling for a couple days to test this hypothesis and observed no failures. This being the issue could make sense givenkonnectivity
handles communications between pods and the control plane, which we have to go through when runningexec
on a k8s pod.The plan is to disable the kubernetes execution mode. This takes away some flexibility, but should fix the reliability problem. We also want to open an upstream issue in
konnectivity
so we can hopefully get the flexibility back.Crossref for the internal bug: b/389220221.
The text was updated successfully, but these errors were encountered: