Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor File Staging/Unstaging Performance with Google Batch Executor #5653

Open
KatieAtGordian opened this issue Jan 7, 2025 · 0 comments
Open

Comments

@KatieAtGordian
Copy link

Bug report

I am documenting this problem as a known issue with discussion here in slack.

There is no plan to support the fix to this issue (per slack discussion, the solution is "too clunky"), but I would like to document it so that others encountering this may chime in and/or save some time debugging why their google batch pipelines are taking so long to complete in comparison to other executors.

Expected behavior and actual behavior

Expected behavior:
File staging and unstaging using the default, built-in support for the Google Batch executor should complete in a reasonable amount of time. "Reasonable" means transfer speeds comparable to standard GCP file transfer achieved by using, e.g., gcloud storage or gsutil.

Actual behavior:
File unstaging/delocalization using the built-in support in the Google Batch executor is extremely slow, e.g., taking ~11 hours for 600GB of data. This is because it's using gcsfuse, which is significantly slower than gsutil which is used by the Google LifeSciences executor (via the nxf_gs_upload function in .command.run).

The officially recommended solution is to use Wave/Fusion; however, this solution is not appropriate for all users, especially those who cannot support the injection of 3rd party services into their workflows.

This is problematic especially because (1) poor performance is essentially the default behavior for the executor (as in, wave/fusion are the recommended solution but are not enabled by default), and (2) this reflects a disparity with the lifesciences executor, which is deprecated by GCP and will not be available after July 8, 2025.

Steps to reproduce the problem

  • Config
    • process.executor = 'google-batch'
    • fusion.enabled = false or omit from config
    • wave.enabled = false or omit from config
  • Run any process that generates at least a few GB of data.
    • See the example process below. Running this with 20GB of data takes ~27 mins with defaults and ~12 mins with a gsutil-enabled solution I wrote in a fork (~3-4 mins of this is generating the file).)

Here's an example process for testing purposes that just writes a file of the given size.

process DUMMY_WRITE {
    label 'process_single'
    disk { 2 * file_size_gb as nextflow.util.MemoryUnit * 1024**3 }
    

    input:
    val file_size_gb

    output:
    path 'dummy_dir/dummy.txt', emit: ch_dummy

    publishDir (
        path: "${params.publish_dir}/",
        mode: 'copy',
        )

    script:
    """
    # Write a file of size file_size_gb
    echo "Writing a file of size ${file_size_gb}GB."
    mkdir dummy_dir/
    dd if=/dev/zero of=dummy_dir/dummy.txt bs=1G count=${file_size_gb} 
    echo "Done writing file."
    """
}

Program output

Here are the performance characteristics that caused me to look into this issue. I have a workflow that performs the following steps:

  1. Localizes data:
  • Method: Manually staged using gsutil due to the known bucket underscore issue. See related issues: #3619, #1069, #1527.
  • Size: ~300 GB.
  • Duration: ~8 minutes.
  1. Runs some code:
  • Duration: ~2 hours.
  1. Delocalizes data:
  • Method: Using Nextflow's built-in gcsfuse support. Files are moved to the workdir only (no publishing).
  • Size: ~600 GB.
  • Duration: ~11 hours.
  • Comparable upload using gsutil takes ~15 minutes.

Environment

  • Nextflow version: 24.10.0
  • Java version: openjdk 11.0.25 2024-10-15
  • Operating system: Linux
  • Bash version: zsh 5.8.1 (x86_64-ubuntu-linux-gnu)

Additional context

See this slack discussion for additional context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant