You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is no plan to support the fix to this issue (per slack discussion, the solution is "too clunky"), but I would like to document it so that others encountering this may chime in and/or save some time debugging why their google batch pipelines are taking so long to complete in comparison to other executors.
Expected behavior and actual behavior
Expected behavior:
File staging and unstaging using the default, built-in support for the Google Batch executor should complete in a reasonable amount of time. "Reasonable" means transfer speeds comparable to standard GCP file transfer achieved by using, e.g., gcloud storage or gsutil.
Actual behavior:
File unstaging/delocalization using the built-in support in the Google Batch executor is extremely slow, e.g., taking ~11 hours for 600GB of data. This is because it's using gcsfuse, which is significantly slower than gsutil which is used by the Google LifeSciences executor (via the nxf_gs_upload function in .command.run).
The officially recommended solution is to use Wave/Fusion; however, this solution is not appropriate for all users, especially those who cannot support the injection of 3rd party services into their workflows.
This is problematic especially because (1) poor performance is essentially the default behavior for the executor (as in, wave/fusion are the recommended solution but are not enabled by default), and (2) this reflects a disparity with the lifesciences executor, which is deprecated by GCP and will not be available after July 8, 2025.
Steps to reproduce the problem
Config
process.executor = 'google-batch'
fusion.enabled = false or omit from config
wave.enabled = false or omit from config
Run any process that generates at least a few GB of data.
See the example process below. Running this with 20GB of data takes ~27 mins with defaults and ~12 mins with a gsutil-enabled solution I wrote in a fork (~3-4 mins of this is generating the file).)
Here's an example process for testing purposes that just writes a file of the given size.
process DUMMY_WRITE {
label 'process_single'
disk { 2 * file_size_gb as nextflow.util.MemoryUnit * 1024**3 }
input:
val file_size_gb
output:
path 'dummy_dir/dummy.txt', emit: ch_dummy
publishDir (
path: "${params.publish_dir}/",
mode: 'copy',
)
script:
"""
# Write a file of size file_size_gb
echo "Writing a file of size ${file_size_gb}GB."
mkdir dummy_dir/
dd if=/dev/zero of=dummy_dir/dummy.txt bs=1G count=${file_size_gb}
echo "Done writing file."
"""
}
Program output
Here are the performance characteristics that caused me to look into this issue. I have a workflow that performs the following steps:
Localizes data:
Method: Manually staged using gsutil due to the known bucket underscore issue. See related issues: #3619, #1069, #1527.
Size: ~300 GB.
Duration: ~8 minutes.
Runs some code:
Duration: ~2 hours.
Delocalizes data:
Method: Using Nextflow's built-in gcsfuse support. Files are moved to the workdir only (no publishing).
Bug report
I am documenting this problem as a known issue with discussion here in slack.
There is no plan to support the fix to this issue (per slack discussion, the solution is "too clunky"), but I would like to document it so that others encountering this may chime in and/or save some time debugging why their google batch pipelines are taking so long to complete in comparison to other executors.
Expected behavior and actual behavior
Expected behavior:
File staging and unstaging using the default, built-in support for the Google Batch executor should complete in a reasonable amount of time. "Reasonable" means transfer speeds comparable to standard GCP file transfer achieved by using, e.g.,
gcloud storage
orgsutil
.Actual behavior:
File unstaging/delocalization using the built-in support in the Google Batch executor is extremely slow, e.g., taking ~11 hours for 600GB of data. This is because it's using
gcsfuse
, which is significantly slower thangsutil
which is used by the Google LifeSciences executor (via thenxf_gs_upload
function in.command.run
).The officially recommended solution is to use Wave/Fusion; however, this solution is not appropriate for all users, especially those who cannot support the injection of 3rd party services into their workflows.
This is problematic especially because (1) poor performance is essentially the default behavior for the executor (as in, wave/fusion are the recommended solution but are not enabled by default), and (2) this reflects a disparity with the lifesciences executor, which is deprecated by GCP and will not be available after July 8, 2025.
Steps to reproduce the problem
process.executor = 'google-batch'
fusion.enabled = false
or omit from configwave.enabled = false
or omit from configgsutil
-enabled solution I wrote in a fork (~3-4 mins of this is generating the file).)Here's an example process for testing purposes that just writes a file of the given size.
Program output
Here are the performance characteristics that caused me to look into this issue. I have a workflow that performs the following steps:
gsutil
due to the known bucket underscore issue. See related issues: #3619, #1069, #1527.gsutil
takes ~15 minutes.Environment
Additional context
See this slack discussion for additional context.
The text was updated successfully, but these errors were encountered: