workdir on s3 failing with slurm HPC #5673

kweisscure51 · 2025-01-15T16:44:41Z

Bug report

Launching rnaseq pipeline on a slurm HPC (AWS parallelcluster behind) is failing when using a workdir on s3, even with fusion enabled.

Expected behavior and actual behavior

Expected behavior : The pipeline will run on a Slurm HPC, and writes temp files in a s3 bucket by enabling fusion file system.
Actual Behviour : All jobs are failing without .command.out, .command.err files. Only .command.sh and.command.run are generated and present in the s3 specified work directory.

Steps to reproduce the problem

Use the version 3.18 of the rnaseq nf-core pipeline.
Launch the pipeline on AWS ParallelCluster with a slurm scheduler using the following sbatch script :

#!/bin/bash
#SBATCH --job-name=nextflow-rnaseq      # Job name
#SBATCH --output=nextflow-rnaseq-%j.out # Standard output log (%j expands to jobID)
#SBATCH --error=nextflow-rnaseq-%j.err  # Standard error log (%j expands to jobID)
pwd; hostname; date

# Run Nextflow with the specified parameters
nextflow -trace nextflow -c s3_wave_test.config run rnaseq/main.nf -profile test,docker -w 's3://fsx-s3-cure51-non-production/s3_benchmark_results/results' \
--outdir 's3://fsx-s3-cure51-non-production/s3_benchmark_results/results' \
--skip_qualimap --skip_dupradar --skip_deseq2_qc --skip_bigwig --skip_biotype_qc --skip_stringtie --skip_markduplicates \
--fasta 's3://fsx-s3-cure51-non-production/data/rnaseq/Ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz' \
--gtf 's3://fsx-s3-cure51-non-production/data/rnaseq/Ensembl/Homo_sapiens.GRCh38.112.gtf.gz'

The s3_wave_test.config is just filled with :

fusion.enabled = true
wave.enabled = true
aws {
    region = 'eu-west-3'
    client.protocol = 'HTTPS'
    accessKey="$AWS_ACCESS_KEY_ID"
    secretKey="$AWS_SECRET_ACCESS_KEY"
}

params {
    max_memory      = 60.GB
    max_cpus        = 8
    max_time        = 2.d
    publish_dir_mode= 'copy'
}

process {
    executor         = 'slurm'
    queue            = 'nf-standard-mem'  // Your default queue or partition
    containerOptions = '--user $(id -u):$(id -g)'
    resourceLimits   = [
    cpus: 8,
    memory: 60.GB,
    time: 2.d
    ]

}

Program output

Jan-15 16:31:59.668 [TaskFinalizer-1] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_RNASEQ:RNASEQ:FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE (RAP1_UNINDUCED_REP1)'

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE (RAP1_UNINDUCED_REP1)` terminated for an unknown reason -- Likely it has been terminated by the external system


Command executed:

  [ ! -f  RAP1_UNINDUCED_REP1_trimmed.fastq.gz ] && ln -s SRR6357073_1.fastq.gz RAP1_UNINDUCED_REP1_trimmed.fastq.gz
  trim_galore \
      --fastqc_args '-t 8' \
      --cores 5 \
      --gzip \
      RAP1_UNINDUCED_REP1_trimmed.fastq.gz
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE":
      trimgalore: $(echo $(trim_galore --version 2>&1) | sed 's/^.*version //; s/Last.*$//')
      cutadapt: $(cutadapt --version)
  END_VERSIONS

Command exit status:
  -

Command output:
  (empty)

Work dir:
  s3://fsx-s3-cure51-non-production/s3_benchmark_results/results/4b/3115d11697cfb337d6ba0cae4e1eae

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
nextflow.exception.ProcessFailedException: Process `NFCORE_RNASEQ:RNASEQ:FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE (RAP1_UNINDUCED_REP1)` terminated for an unknown reason -- Likely it has been terminated by the external system
	at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
	at nextflow.processor.TaskProcessor.finalizeTask(TaskProcessor.groovy:2377)
	at nextflow.processor.TaskPollingMonitor.finalizeTask(TaskPollingMonitor.groovy:686)
	at nextflow.processor.TaskPollingMonitor.safeFinalizeTask(TaskPollingMonitor.groovy:676)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.TaskPollingMonitor$_checkTaskStatus_lambda8.doCall(TaskPollingMonitor.groovy:666)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.lang.Thread.run(Thread.java:1570)
Jan-15 16:31:59.671 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /fsx-s3-cure51-non-production/s3_benchmark_results/results/87/6a8f1d1f01ad6d5ed26afa4f50c8f1/.exitcode - cause: s3://fsx-s3-cure51-non-production/s3_benchmark_results/results/87/6a8f1d1f01ad6d5ed26afa4f50c8f1/.exitcode

Environment

Nextflow version: 24.04.3.5916
Java version: openjdk 22.0.1 2024-04-16
OpenJDK Runtime Environment Corretto-22.0.1.8.1 (build 22.0.1+8-FR)
OpenJDK 64-Bit Server VM Corretto-22.0.1.8.1 (build 22.0.1+8-FR, mixed mode, sharing)
Operating system: Linux
Bash version: GNU bash, version 5.2.15(1)-release (x86_64-amazon-linux-gnu)

Additional context

(Add any other context about the problem here)

The text was updated successfully, but these errors were encountered:

pditommaso · 2025-01-15T18:35:43Z

Ca you please try using the latest version and include the .nextflow.log file?

(make sure to remove all sensitive info before sharing the log file)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workdir on s3 failing with slurm HPC #5673

workdir on s3 failing with slurm HPC #5673

kweisscure51 commented Jan 15, 2025 •

edited

Loading

pditommaso commented Jan 15, 2025 •

edited

Loading

workdir on s3 failing with slurm HPC #5673

workdir on s3 failing with slurm HPC #5673

Comments

kweisscure51 commented Jan 15, 2025 • edited Loading

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

pditommaso commented Jan 15, 2025 • edited Loading

kweisscure51 commented Jan 15, 2025 •

edited

Loading

pditommaso commented Jan 15, 2025 •

edited

Loading