-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
maxErrorRetry does not seem to be used for K8sResponseException 500 (tunnel disconnect) #5604
Comments
If it's returning a 500 error code then Nextflow should be retrying it: nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/client/K8sClient.groovy Lines 629 to 639 in 0aa76af
I wonder if it is retrying but the tunnel disconnect lasts longer than the duration of the 8 retries. Seems plausible if it is a regular outage. Can you see the "API request threw socket exception..." messages in your log? |
There are no such messages so it looks like Can I know whether nextflow is reading the In case it's significant, I am also starting nextflow with an explicit
It's the The |
Looking at the code, using This should be sufficient but it's clear the retries are not being used. And the documentation for the parameter says Defines the Kubernetes API max request retries (default: 4) but clearly the default isn't used either.
|
You can run You can also try setting the retry as a number: k8s.maxErrorRetry = 8 But I don't expect it to matter because there is some logic to handle strings: nextflow/modules/nextflow/src/main/groovy/nextflow/k8s/K8sConfig.groovy Lines 236 to 237 in 0aa76af
|
OK, thanks. It does look like it's picking up my configuration... # nextflow config
process {
pod = [nodeSelector:'informaticsmatters.com/purpose-fragmentor=yes', imagePullPolicy:'Always']
}
executor {
name = 'k8s'
queueSize = 600
}
k8s {
httpConnectTimeout = '120s'
httpReadTimeout = '120s'
maxErrorRetry = '8'
serviceAccount = 'fragmentor'
storageClaimName = 'work'
storageMountPath = '/work'
} |
The other thing that might be of interest is that the 'base' nextflow (as I am running it in a container) may not be Question...If
Instead the code is behaving as though the value is |
Bug report
Expected behavior and actual behavior
Nextflow receives a 500 error from the kubernetes API but it would be nice to retry (as I think the error is transient). We are using
maxErrorRetry
and NF24.10.2
but this does not appear to help.The Pod was running and actually finished successfully about 15 minutes later (as shown by the k8s Pod information of the Pod left undeleted): -
...but the response to the (in our case a short-term) 500 error appears catastrophic.
Steps to reproduce the problem
We run our workflow and it fails with this error regularly, at almost exactly the same time each day ...
00:33:31
. In the exception last night it occurred again at00:33:31
. Although we are trying to determine the cause of the underlying disconnect it would be nice if NF could comply withmaxErrorRetry
. If it is there's no evidence in the log.Program output
Environment
24.10.2
openjdk 11.0.25 2024-10-15
Debian GNU/Linux 11 (bullseye)
GNU bash, version 5.1.4(1)-release (x86_64-pc-linux-gnu)
Additional context
Running in an OpenStack kubernetes cluster (v1.30) and the following nextflow config: -
The text was updated successfully, but these errors were encountered: