Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workdir on s3 failing with slurm HPC #5673

Open
kweisscure51 opened this issue Jan 15, 2025 · 1 comment
Open

workdir on s3 failing with slurm HPC #5673

kweisscure51 opened this issue Jan 15, 2025 · 1 comment

Comments

@kweisscure51
Copy link

kweisscure51 commented Jan 15, 2025

Bug report

Launching rnaseq pipeline on a slurm HPC (AWS parallelcluster behind) is failing when using a workdir on s3, even with fusion enabled.

Expected behavior and actual behavior

Expected behavior : The pipeline will run on a Slurm HPC, and writes temp files in a s3 bucket by enabling fusion file system.
Actual Behviour : All jobs are failing without .command.out, .command.err files. Only .command.sh and.command.run are generated and present in the s3 specified work directory.

Steps to reproduce the problem

Use the version 3.18 of the rnaseq nf-core pipeline.
Launch the pipeline on AWS ParallelCluster with a slurm scheduler using the following sbatch script :

#!/bin/bash
#SBATCH --job-name=nextflow-rnaseq      # Job name
#SBATCH --output=nextflow-rnaseq-%j.out # Standard output log (%j expands to jobID)
#SBATCH --error=nextflow-rnaseq-%j.err  # Standard error log (%j expands to jobID)
pwd; hostname; date

# Run Nextflow with the specified parameters
nextflow -trace nextflow -c s3_wave_test.config run rnaseq/main.nf -profile test,docker -w 's3://fsx-s3-cure51-non-production/s3_benchmark_results/results' \
--outdir 's3://fsx-s3-cure51-non-production/s3_benchmark_results/results' \
--skip_qualimap --skip_dupradar --skip_deseq2_qc --skip_bigwig --skip_biotype_qc --skip_stringtie --skip_markduplicates \
--fasta 's3://fsx-s3-cure51-non-production/data/rnaseq/Ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz' \
--gtf 's3://fsx-s3-cure51-non-production/data/rnaseq/Ensembl/Homo_sapiens.GRCh38.112.gtf.gz' 

The s3_wave_test.config is just filled with :

fusion.enabled = true
wave.enabled = true
aws {
    region = 'eu-west-3'
    client.protocol = 'HTTPS'
    accessKey="$AWS_ACCESS_KEY_ID"
    secretKey="$AWS_SECRET_ACCESS_KEY"
}

params {
    max_memory      = 60.GB
    max_cpus        = 8
    max_time        = 2.d
    publish_dir_mode= 'copy'
}

process {
    executor         = 'slurm'
    queue            = 'nf-standard-mem'  // Your default queue or partition
    containerOptions = '--user $(id -u):$(id -g)'
    resourceLimits   = [
    cpus: 8,
    memory: 60.GB,
    time: 2.d
    ]

}

Program output

Jan-15 16:31:59.668 [TaskFinalizer-1] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_RNASEQ:RNASEQ:FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE (RAP1_UNINDUCED_REP1)'

Caused by:
  Process `NFCORE_RNASEQ:RNASEQ:FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE (RAP1_UNINDUCED_REP1)` terminated for an unknown reason -- Likely it has been terminated by the external system


Command executed:

  [ ! -f  RAP1_UNINDUCED_REP1_trimmed.fastq.gz ] && ln -s SRR6357073_1.fastq.gz RAP1_UNINDUCED_REP1_trimmed.fastq.gz
  trim_galore \
      --fastqc_args '-t 8' \
      --cores 5 \
      --gzip \
      RAP1_UNINDUCED_REP1_trimmed.fastq.gz
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE":
      trimgalore: $(echo $(trim_galore --version 2>&1) | sed 's/^.*version //; s/Last.*$//')
      cutadapt: $(cutadapt --version)
  END_VERSIONS

Command exit status:
  -

Command output:
  (empty)

Work dir:
  s3://fsx-s3-cure51-non-production/s3_benchmark_results/results/4b/3115d11697cfb337d6ba0cae4e1eae

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
nextflow.exception.ProcessFailedException: Process `NFCORE_RNASEQ:RNASEQ:FASTQ_QC_TRIM_FILTER_SETSTRANDEDNESS:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE (RAP1_UNINDUCED_REP1)` terminated for an unknown reason -- Likely it has been terminated by the external system
	at org.codehaus.groovy.vmplugin.v8.IndyInterface.fromCache(IndyInterface.java:321)
	at nextflow.processor.TaskProcessor.finalizeTask(TaskProcessor.groovy:2377)
	at nextflow.processor.TaskPollingMonitor.finalizeTask(TaskPollingMonitor.groovy:686)
	at nextflow.processor.TaskPollingMonitor.safeFinalizeTask(TaskPollingMonitor.groovy:676)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
	at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
	at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
	at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
	at nextflow.processor.TaskPollingMonitor$_checkTaskStatus_lambda8.doCall(TaskPollingMonitor.groovy:666)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.lang.Thread.run(Thread.java:1570)
Jan-15 16:31:59.671 [Task monitor] TRACE nextflow.file.FileHelper - Unable to read attributes for file: /fsx-s3-cure51-non-production/s3_benchmark_results/results/87/6a8f1d1f01ad6d5ed26afa4f50c8f1/.exitcode - cause: s3://fsx-s3-cure51-non-production/s3_benchmark_results/results/87/6a8f1d1f01ad6d5ed26afa4f50c8f1/.exitcode

Environment

  • Nextflow version: 24.04.3.5916
  • Java version: openjdk 22.0.1 2024-04-16
    OpenJDK Runtime Environment Corretto-22.0.1.8.1 (build 22.0.1+8-FR)
    OpenJDK 64-Bit Server VM Corretto-22.0.1.8.1 (build 22.0.1+8-FR, mixed mode, sharing)
  • Operating system: Linux
  • Bash version: GNU bash, version 5.2.15(1)-release (x86_64-amazon-linux-gnu)

Additional context

(Add any other context about the problem here)

@pditommaso
Copy link
Member

pditommaso commented Jan 15, 2025

Ca you please try using the latest version and include the .nextflow.log file?

(make sure to remove all sensitive info before sharing the log file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants