(fusion-page)=
:::{versionadded} 22.10.0 :::
:::{versionadded} 23.02.0-edge Support for Google Cloud Storage. :::
Fusion is a distributed virtual file system for cloud-native data pipeline and optimised for Nextflow workloads.
It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations. Currently it supports AWS S3, Google Cloud Storage and Azure Blob containers.
The Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in parallel to and from object storage into a container-local temporary folder. This means that the performance of the disk volume used to carry out your computation is key to achieving maximum performance.
By default Fusion uses the container /tmp
directory as a temporary cache, so the size of the volume can be much lower
than the actual needs of your pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining
disk space on the temporary folder and immediately evicts old cached entries when necessary.
Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine
such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes. It also requires
the use of {ref}Wave containers<wave-page>
.
Fusion provides built-in support for Azure Blob Storage when running in Azure Cloud.
The support for Azure does not require any specific setting other then enabling Wave and Fusion in your Nextflow configuration. For example:
fusion.enabled = true
wave.enabled = true
process.executor = 'azure-batch'
tower.accessToken = '<your platform access token>' // optional
Then run your pipeline using the usual command:
nextflow run <your pipeline> -work-dir az://<your blob container>/scratch
Azure machines come with fast SSDs attached, therefore no additional storage configuration is required however it is
recommended to use the machine types with larger data disks attached, denoted by the suffix d
after the core number
(e.g. Standard_E32*d*_v5
). These will increase the throughput of Fusion and reduce the chance of overloading the machine.
Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor.
The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the aws
command
line tool, when setting up the AWS Batch compute environment.
The configuration for this deployment scenario looks like the following:
fusion.enabled = true
wave.enabled = true
process.executor = 'awsbatch'
process.queue = '<YOUR BATCH QUEUE>'
aws.region = '<YOUR AWS REGION>'
tower.accessToken = '<your platform access token>' // optional
Then you can run your pipeline using the following command:
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
For best performance make sure to use instance types that provide a NVMe disk as instance storage. If you are creating the AWS Batch compute environment by yourselves, you will need to make sure the NVMe is properly formatted (see below).
The recommended setup to get maximum performance is to mount a NVMe disk as the temporary folder and run the pipeline with
the {ref}scratch <process-scratch>
directive set to false
to also avoid stage-out transfer time.
Example configuration for using AWS Batch with NVMe disks to maximize performance:
aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false
:::{tip} Seqera Platform is able to automatically format and configure the NVMe instance storage by enabling the option "Use Fast storage" when creating the Batch compute environment. :::
:::{tip}
As an alternative to configuring NVMe storage on your compute node, you can use an EBS gp3
volume with a throughput of 325 MiB/s (or more) and a size of 100 GiB (or larger). While slower than NVMe storage, this configuration provides sufficient performance for many workloads.
:::
The AWS S3 bucket should be configured with the following IAM permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::YOUR-BUCKET-NAME"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectTagging",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::YOUR-BUCKET-NAME/*"
],
"Effect": "Allow"
}
]
}
Fusion provides built-in support for Google Storage when running in Google Cloud.
The support for Google does not require any specific setting other then enabling Wave and Fusion in your Nextflow configuration. For example:
fusion.enabled = true
wave.enabled = true
process.executor = 'google-batch'
tower.accessToken = '<your platform access token>' // optional
Then run your pipeline using the usual command:
nextflow run <your pipeline> -work-dir gs://<your google bucket>/scratch
When using Fusion, if the process.disk
is not set, Nextflow will attach a single local SSD disk to the machine. The size of this disk can be much lower than the actual needs of your pipeline processes because Fusion uses it only as a temporal cache. Fusion is also compatible with other types of process.disk
, but better performance is achieved when using local SSD disks.
Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Kubernetes executor.
The use of Fusion makes obsolete the need to create and manage and separate persistent volume and shared file system in the Kubernetes cluster.
The configuration for this deployment scenario looks like the following:
fusion.enabled = true
wave.enabled = true
process.executor = 'k8s'
k8s.context = '<YOUR K8S CONFIGURATION CONTEXT>'
k8s.namespace = '<YOUR K8S NAMESPACE>'
k8s.serviceAccount = '<YOUR K8S SERVICE ACCOUNT>'
tower.accessToken = '<your platform access token>' // optional
The k8s.context
represents the Kubernetes configuration context to be used for the pipeline execution. This setting can be omitted if Nextflow itself is running as a pod in the Kubernetes clusters.
The k8s.namespace
represents the Kubernetes namespace where the jobs submitted by the pipeline execution should be executed.
The k8s.serviceAccount
represents the Kubernetes service account that should be used to grant the execution permission to jobs launched by Nextflow. You can find more details how to configure it as the following link.
Having the above configuration in place, you can run your pipeline using the following command:
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
:::{note} You an also use Fusion and Kubernetes with Azure Blob Storage and Google Storage using the same deployment approach. :::
Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.
The AWS S3 bucket credentials should be made accessible via standard AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
The following configuration should be added in your Nextflow configuration file:
docker.enabled = true
fusion.enabled = true
fusion.exportStorageCredentials = true
wave.enabled = true
tower.accessToken = '<your platform access token>' // optional
Then you can run your pipeline using the following command:
nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch
Replace <YOUR PIPELINE>
and <YOUR BUCKET>
with a pipeline script and bucket of your choice, for example:
nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch
:::{warning}
The option fusion.exportStorageCredentials
leaks the AWS credentials on the task launcher script created by Nextflow.
This option should only be used for testing and development purposes.
:::
Minio is an open source, enterprise grade, object storage compatible with AWS S3. Nextflow and Fusion can use Minio (or other S3-compatible object storages) as an alternative to AWS S3 in some deployment scenarios.
This configuration requires the the use of Nextflow local execution and Docker (or similar container engine) for the execution of your pipeline tasks.
For the same of this example, runs a local instance of Minio using this command:
docker run -p 9000:9000 \
--rm -d -p 9001:9001 \
-e "MINIO_ROOT_USER=admin" \
-e "MINIO_ROOT_PASSWORD=secret" \
quay.io/minio/minio server /data --console-address ":9001"
Open the Minio console opening in your browser this address http://localhost:9001
, then create a credentials pair,
and a bucket. For the sake of this example the bucket name foobar
will be used.
The following configuration should be added in your Nextflow configuration file:
aws.accessKey = '<YOUR MINIO ACCESS KEY>'
aws.secretKey = '<YOUR MINIO SECRET KEY>'
aws.client.endpoint = 'http://localhost:9000'
aws.client.s3PathStyleAccess = true
wave.enabled = true
fusion.enabled = true
fusion.exportStorageCredentials = true
docker.enabled = true
tower.accessToken = '<your platform access token>' // optional
Then you can run your pipeline using the following command:
nextflow run <YOUR PIPELINE> -work-dir s3://foobar/scratch
Replace <YOUR PIPELINE>
with a pipeline script and bucket of your choice:
:::{warning}
The option fusion.exportStorageCredentials
leaks the AWS credentials on the task launcher script created by Nextflow.
This option should only be used for testing and development purposes.
:::
Fusion file system and Nextflow are compatible with Oracle Object Storage.
:::{note} This capability relies on the S3-like API compatibility provided by Oracle storage and not by a native support in Nextflow and Fusion. As such it may not fully work and support all Nextflow and Fusion features. :::
This configuration requires the execution of your pipeline tasks using Docker or a similar container engine.
The following should be included in your Nextflow configuration file:
aws.region = '<YOUR_REGION>'
aws.accessKey = '<YOUR_ACCESS_KEY>'
aws.secretKey = '<YOUR_SECRET_KEY>'
aws.client.endpoint = 'https://<YOUR_BUCKET_NAMESPACE>.compat.objectstorage.<YOUR_REGION>.oraclecloud.com'
aws.client.s3PathStyleAccess = true
aws.client.protocol = 'https'
aws.client.signerOverride = 'AWSS3V4SignerType'
docker.enabled = true
docker.containerOptions = '-e FUSION_AWS_REGION=<YOUR_REGION>'
fusion.enabled = true
fusion.exportStorageCredentials = true
wave.enabled = true
tower.accessToken = '<YOUR_PLATFORM_ACCESS_TOKEN>' // optional
Then you can run your pipeline using the following command:
nextflow run <YOUR_PIPELINE> -work-dir s3://<YOUR_BUCKET>/scratch
In the above snippet replace the placeholders <YOUR_ACCESS_KEY>
and <YOUR_SECRET_KEY>
with your Oracle Customer Secret Key,
and the placeholders <YOUR_BUCKET_NAMESPACE>
and <YOUR_REGION>
with the namespace and region of your Oracle bucket.
:::{warning}
The fusion.exportStorageCredentials
option leaks the Oracle credentials to the Nextflow task launcher script and should only be used for testing and development purposes.
:::
Fusion advanced configuration settings are described in the {ref}Fusion <config-fusion>
section on the Nextflow configuration page.