Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume from crashed head job caused process and its downstream dependency to run at the same time #5595

Closed
oalmelid-gel opened this issue Dec 10, 2024 · 8 comments · Fixed by #5682

Comments

@oalmelid-gel
Copy link

oalmelid-gel commented Dec 10, 2024

Bug report

Due to an LSF compute cluster outage, two of our pipelines ended up in a state where the nextflow process was unable to submit jobs to the cluster and crashed. However, the running processes completed successfully.

When we resumed the pipeline after the cluster had recovered from the crash, we saw nextflow

  • launch the processes that had completed successfully
  • immediately conclude the process had completed and launch a process that depended on its output

The first launched process remained running for another ten minutes and wrote output files while its dependent process ran, leading to truncated output and a false negative result.

After some digging, I believe that what happened is related to the behaviour of the GridTaskHandler, specifically this line:

https://github.com/nextflow-io/nextflow/blob/18f7de132c0342102f7160ce97b2e68dc6b67[…]xtflow/src/main/groovy/nextflow/executor/GridTaskHandler.groovy

The function readExitStatus() looks for a file called .exitfile and reads the exit code of the completed process from there. Since nextflow resumed a process that had successfully completed after the head job crashed, that file was already present in the work directory when the pipeline resumed and this caused a premature launch of the downstream process.

I think an easy fix would be to make sure the .exitfile is deleted before submitting a process, possibly just with an rm -f .exitfile

Expected behavior and actual behavior

Expected behaviour: when resuming a pipeline, nextflow should wait for the resumed process to complete executing on the cluster.
Actual behaviour: nextflow immedately launches a downstream process while still running.

Steps to reproduce the problem

The following script illustrates the problem. It has a different output on a grid scheduler than when run locally.
A similar result can be achieved by removing the line echo 0 > .exitcode, launching nextflow on a grid scheduler, killing nextflow when the process has started, and then relaunching with -resume.

workflow {
    messWithExitCode | view
}


process messWithExitCode {
    beforeScript 'module load cluster/test'

    output:
    stdout

    script:
    """
    echo On a Grid scheduler, this line will print
    # the line below is just for demo purposes. On our grid cluster, this was achieved by
    # - letting nextflow launch the child process
    # - killing the nextflow executor
    # - resuming the workflow after the child process had completed
    echo 0 > .exitcode
    sleep 60s
    echo On a Grid scheduler, this line will not print
    """
}

Program output

On a grid scheduler, the above will only print

echo On a Grid scheduler, this line will print

Environment

  • Nextflow version: 22.04.5
  • Java version: 1.8.0
  • Operating system: linux
  • Bash version: 4.3.48(1)-release

Additional context

N/A

@robsyme
Copy link
Collaborator

robsyme commented Dec 10, 2024

Hi Øyvind

When you say

When we resumed the pipeline after the cluster had recovered from the crash, we saw nextflow launch the processes that had completed successfully

Are you sure that it was Nextflow that relaunched the task? If a task directory already exists, Nextflow will not resubmit a task to that same directory.

Is it possible that LSF re-submitted the task when it came back online?

@oalmelid-gel
Copy link
Author

oalmelid-gel commented Dec 10, 2024

Hi Robert,

Is it possible that LSF re-submitted the task when it came back online?

I don't think that's possible. The resume happened many hours after the cluster came back up, and the resume was submitted through seqera platform.

I've got traces from nextflow showing that it thought the resumed process had completed, and logs from the process showing it continued running.

The logs in the process working directory also show two separate runs - one on december 4th and one on the 5th.

I've verified through seqera platform that both in the original and resumed runs, the process in question has the same working directory.

@robsyme
Copy link
Collaborator

robsyme commented Dec 10, 2024

Thanks for the clarification, Øyvind.

Nextflow should not re-use the same task working directory on a resumed task, so it's very unexpected to see the same directory being re-used on resume. In the checkCachedOrLaunchTask method, Nextflow checks to see if the work directory exists and if so, avoids submitting to that directory. In pseudocode:

​tries = 0
loop:
    cache = session.getCache(taskHash)
    taskWorkDirHash = Hash(taskHash, tries)
    taskWorkdir = getWorkDir(taskWorkDirHash)
    cached = taskWorkdir.exists() && cache.isCompleted() && checkCachedOutput(resumeDir)
    if cached:
        break
    if taskWorkdir.exists():
        tries += 1
        continue
    submitTask(task, resumeDir)
    break

Essentially, Nextflow spins around this loop, generating a deterministic series of hashes looking for a non-empty directory in which to submit the job.

@bentsherman - do you have any idea how a task workDir could possible be re-used on resume?

@bentsherman
Copy link
Member

Indeed, Nextflow should never submit a new task to an existing directory. It should only reuse the directory if the task is completely cached and doesn't need to be re-executed. At a first glance, I'm not sure how this situation would happen.

  • The scheduler goes down and the pipeline crashes,
  • Assuming the compute nodes still work, the submitted jobs keep running to completion
  • Later, when the scheduler is back up and the run is resumed, Nextflow either
    1. if the job completed successfully, resume the task, or
    2. if the job failed or is still running, re-execute the task in a new directory (in which case the previous job is ignored)

@robsyme
Copy link
Collaborator

robsyme commented Dec 10, 2024

I'm having trouble reproducing the example. When you say

killing the nextflow executor

Does this require forcibly terminating the scheduler (LSF), or simply killing the Nextflow process?

@oalmelid-gel
Copy link
Author

oalmelid-gel commented Dec 11, 2024

Thanks both! And apologies for not having a clean repro.

I've reviewed the logs from our cluster, and I think the situation is more complex than I initially realised.
Here's the timeline of what happened:

  • 12:01 - nextflow job is submitted to the cluster and starts running
  • 13:14 - nextflow job starts running on the cluster
  • 15:04 - network outage prevents the cluster from talking to the node on which nextflow is running. Cluster automatically requeues the nextflow job
  • 16:28 - nextflow job is automatically re-run by the cluster but immediately crashes when attempting to fetch ephemeral config from seqera platform, since the endpoint no longer exists.
  • next day at 12:02 - workflow is manually resumed from seqera platform and exhibits the symptoms above.

Looking at the code you linked, I could see nextflow launching a job in the same work directory if one or more of the two variables below end up as null.

final entry = session.cache.getTaskEntry(hash, this)
resumeDir = entry ? FileHelper.asPath(entry.trace.getWorkDir()) : null

That would mean that the first try block completes fine with no existence check on the directory. The block below does check existence, but does not abort a launch if the directory exists.

final workDir = task.getWorkDirFor(hash)
try {
    if( resumeDir != workDir )
        exists = workDir.exists()
    if( !exists && !workDir.mkdirs() )
        throw new IOException("Unable to create directory=$workDir -- check file system permissions")
}

My best guess is that the failed relaunch relaunch (not resume) did something to the cache. Though I'm not sure if that's possible? The resume definitely skipped over all the tasks that had completed before the outage, but then somehow decided to re-launch the two processes that were running at the time of the network outage.

@bentsherman
Copy link
Member

bentsherman commented Dec 11, 2024

Indeed, if there is no cache entry for a given task hash, Nextflow will proceed to execute the task in that directory, without checking whether the directory already exists. It is assuming that if there is no cache entry then there must be no directory either.

Meanwhile, the cache entry for a task is not saved until Nextflow sees that the task is complete:

* @param handler
*/
void notifyTaskComplete( TaskHandler handler ) {
// save the completed task in the cache DB

Here is a plausible scenario:

  1. nextflow submits a job
  2. nextflow crashes
  3. the job completes successfully
  4. nextflow is resumed
  5. there is no cache entry for the job, so nextflow re-executes the job in the same directory as before, overwriting the previous job's outputs
  6. on the next status poll, nextflow sees the .exitcode file from the previous job and thinks the current job is done
  7. nextflow collects the outputs in the job directory, which will be a mish-mash of the previous and current jobs depending on how the job saves its outputs -- likely all of them will be the in-progress outputs of the current job
  8. nextflow submits a downstream job based on the upstream outputs, which are now incomplete. if you're lucky, the downstream completes, otherwise, your results are incomplete

So you might be right after all. Let me see if I can reproduce this with the local executor. I think I know what we need to change but I will feel more confident if we also have a red-green test case.

@bentsherman
Copy link
Member

bentsherman commented Dec 11, 2024

There is also this minor failsafe to make the job script delete any existing input files:

if( len<100 )
delete << "rm -f ${Escape.path(stageName)}"

But it's only intended for debugging purposes. I think during normal execution, Nextflow would see the .exitcode file and mark the job as completed before the job has a chance to reset the work directory. Especially if the job stays in the submit queue for more than a few minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants