Skip to content

Latest commit

 

History

History
325 lines (237 loc) · 12.4 KB

fusion.md

File metadata and controls

325 lines (237 loc) · 12.4 KB

(fusion-page)=

Fusion file system

:::{versionadded} 22.10.0 :::

:::{versionadded} 23.02.0-edge Support for Google Cloud Storage. :::

Introduction

Fusion is a distributed virtual file system for cloud-native data pipeline and optimised for Nextflow workloads.

It bridges the gap between cloud-native storage and data analysis workflow by implementing a thin client that allows any existing application to access object storage using the standard POSIX interface, thus simplifying and speeding up most operations. Currently it supports AWS S3, Google Cloud Storage and Azure Blob containers.

Getting started

The Fusion file system implements a lazy download and upload algorithm that runs in the background to transfer files in parallel to and from object storage into a container-local temporary folder. This means that the performance of the disk volume used to carry out your computation is key to achieving maximum performance.

By default Fusion uses the container /tmp directory as a temporary cache, so the size of the volume can be much lower than the actual needs of your pipeline processes. Fusion has a built-in garbage collector that constantly monitors remaining disk space on the temporary folder and immediately evicts old cached entries when necessary.

Requirements

Fusion file system is designed to work with containerised workloads, therefore it requires the use of a container engine such as Docker or a container native platform for the execution of your pipeline e.g. AWS Batch or Kubernetes. It also requires the use of {ref}Wave containers<wave-page>.

Azure Cloud

Fusion provides built-in support for Azure Blob Storage when running in Azure Cloud.

The support for Azure does not require any specific setting other then enabling Wave and Fusion in your Nextflow configuration. For example:

fusion.enabled = true
wave.enabled = true
process.executor = 'azure-batch'
tower.accessToken = '<your platform access token>' // optional

Then run your pipeline using the usual command:

nextflow run <your pipeline> -work-dir az://<your blob container>/scratch

Azure machines come with fast SSDs attached, therefore no additional storage configuration is required however it is recommended to use the machine types with larger data disks attached, denoted by the suffix d after the core number (e.g. Standard_E32*d*_v5). These will increase the throughput of Fusion and reduce the chance of overloading the machine.

AWS Cloud

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the AWS Batch executor. The use of Fusion makes obsolete the need to create and configure a custom AMI that includes the aws command line tool, when setting up the AWS Batch compute environment.

The configuration for this deployment scenario looks like the following:

fusion.enabled = true
wave.enabled = true
process.executor = 'awsbatch'
process.queue = '<YOUR BATCH QUEUE>'
aws.region = '<YOUR AWS REGION>'
tower.accessToken = '<your platform access token>' // optional

Then you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

For best performance make sure to use instance types that provide a NVMe disk as instance storage. If you are creating the AWS Batch compute environment by yourselves, you will need to make sure the NVMe is properly formatted (see below).

NVMe storage

The recommended setup to get maximum performance is to mount a NVMe disk as the temporary folder and run the pipeline with the {ref}scratch <process-scratch> directive set to false to also avoid stage-out transfer time.

Example configuration for using AWS Batch with NVMe disks to maximize performance:

aws.batch.volumes = '/path/to/ec2/nvme:/tmp'
process.scratch = false

:::{tip} Seqera Platform is able to automatically format and configure the NVMe instance storage by enabling the option "Use Fast storage" when creating the Batch compute environment. :::

:::{tip} As an alternative to configuring NVMe storage on your compute node, you can use an EBS gp3 volume with a throughput of 325 MiB/s (or more) and a size of 100 GiB (or larger). While slower than NVMe storage, this configuration provides sufficient performance for many workloads. :::

AWS IAM permissions

The AWS S3 bucket should be configured with the following IAM permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET-NAME"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectTagging",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR-BUCKET-NAME/*"
            ],
            "Effect": "Allow"
        }
    ]
}

Google Cloud

Fusion provides built-in support for Google Storage when running in Google Cloud.

The support for Google does not require any specific setting other then enabling Wave and Fusion in your Nextflow configuration. For example:

fusion.enabled = true
wave.enabled = true
process.executor = 'google-batch'
tower.accessToken = '<your platform access token>' // optional

Then run your pipeline using the usual command:

nextflow run <your pipeline> -work-dir gs://<your google bucket>/scratch

When using Fusion, if the process.disk is not set, Nextflow will attach a single local SSD disk to the machine. The size of this disk can be much lower than the actual needs of your pipeline processes because Fusion uses it only as a temporal cache. Fusion is also compatible with other types of process.disk, but better performance is achieved when using local SSD disks.

Kubernetes

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Kubernetes executor.

The use of Fusion makes obsolete the need to create and manage and separate persistent volume and shared file system in the Kubernetes cluster.

The configuration for this deployment scenario looks like the following:

fusion.enabled = true
wave.enabled = true
process.executor = 'k8s'
k8s.context = '<YOUR K8S CONFIGURATION CONTEXT>'
k8s.namespace = '<YOUR K8S NAMESPACE>'
k8s.serviceAccount = '<YOUR K8S SERVICE ACCOUNT>'
tower.accessToken = '<your platform access token>' // optional

The k8s.context represents the Kubernetes configuration context to be used for the pipeline execution. This setting can be omitted if Nextflow itself is running as a pod in the Kubernetes clusters.

The k8s.namespace represents the Kubernetes namespace where the jobs submitted by the pipeline execution should be executed.

The k8s.serviceAccount represents the Kubernetes service account that should be used to grant the execution permission to jobs launched by Nextflow. You can find more details how to configure it as the following link.

Having the above configuration in place, you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

:::{note} You an also use Fusion and Kubernetes with Azure Blob Storage and Google Storage using the same deployment approach. :::

Local execution with AWS S3

Fusion file system allows the use of an S3 bucket as a pipeline work directory with the Nextflow local executor. This configuration requires the use of Docker (or similar container engine) for the execution of your pipeline tasks.

The AWS S3 bucket credentials should be made accessible via standard AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

The following configuration should be added in your Nextflow configuration file:

docker.enabled = true
fusion.enabled = true
fusion.exportStorageCredentials = true
wave.enabled = true
tower.accessToken = '<your platform access token>' // optional

Then you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://<YOUR BUCKET>/scratch

Replace <YOUR PIPELINE> and <YOUR BUCKET> with a pipeline script and bucket of your choice, for example:

nextflow run https://github.com/nextflow-io/rnaseq-nf -work-dir s3://nextflow-ci/scratch

:::{warning} The option fusion.exportStorageCredentials leaks the AWS credentials on the task launcher script created by Nextflow. This option should only be used for testing and development purposes. :::

Local execution with Minio

Minio is an open source, enterprise grade, object storage compatible with AWS S3. Nextflow and Fusion can use Minio (or other S3-compatible object storages) as an alternative to AWS S3 in some deployment scenarios.

This configuration requires the the use of Nextflow local execution and Docker (or similar container engine) for the execution of your pipeline tasks.

For the same of this example, runs a local instance of Minio using this command:

docker run -p 9000:9000 \
    --rm -d -p 9001:9001 \
    -e "MINIO_ROOT_USER=admin" \
    -e "MINIO_ROOT_PASSWORD=secret" \
    quay.io/minio/minio server /data --console-address ":9001"

Open the Minio console opening in your browser this address http://localhost:9001, then create a credentials pair, and a bucket. For the sake of this example the bucket name foobar will be used.

The following configuration should be added in your Nextflow configuration file:

aws.accessKey = '<YOUR MINIO ACCESS KEY>'
aws.secretKey = '<YOUR MINIO SECRET KEY>'
aws.client.endpoint = 'http://localhost:9000'
aws.client.s3PathStyleAccess = true
wave.enabled = true
fusion.enabled = true
fusion.exportStorageCredentials = true
docker.enabled = true
tower.accessToken = '<your platform access token>' // optional

Then you can run your pipeline using the following command:

nextflow run <YOUR PIPELINE> -work-dir s3://foobar/scratch

Replace <YOUR PIPELINE> with a pipeline script and bucket of your choice:

:::{warning} The option fusion.exportStorageCredentials leaks the AWS credentials on the task launcher script created by Nextflow. This option should only be used for testing and development purposes. :::

Local execution with Oracle Object Storage

Fusion file system and Nextflow are compatible with Oracle Object Storage.

:::{note} This capability relies on the S3-like API compatibility provided by Oracle storage and not by a native support in Nextflow and Fusion. As such it may not fully work and support all Nextflow and Fusion features. :::

This configuration requires the execution of your pipeline tasks using Docker or a similar container engine.

The following should be included in your Nextflow configuration file:

aws.region = '<YOUR_REGION>'
aws.accessKey = '<YOUR_ACCESS_KEY>'
aws.secretKey = '<YOUR_SECRET_KEY>'
aws.client.endpoint = 'https://<YOUR_BUCKET_NAMESPACE>.compat.objectstorage.<YOUR_REGION>.oraclecloud.com'
aws.client.s3PathStyleAccess = true
aws.client.protocol = 'https'
aws.client.signerOverride = 'AWSS3V4SignerType'
docker.enabled = true
docker.containerOptions = '-e FUSION_AWS_REGION=<YOUR_REGION>'
fusion.enabled = true
fusion.exportStorageCredentials = true
wave.enabled = true
tower.accessToken = '<YOUR_PLATFORM_ACCESS_TOKEN>' // optional

Then you can run your pipeline using the following command:

nextflow run <YOUR_PIPELINE> -work-dir s3://<YOUR_BUCKET>/scratch

In the above snippet replace the placeholders <YOUR_ACCESS_KEY> and <YOUR_SECRET_KEY> with your Oracle Customer Secret Key, and the placeholders <YOUR_BUCKET_NAMESPACE> and <YOUR_REGION> with the namespace and region of your Oracle bucket.

:::{warning} The fusion.exportStorageCredentials option leaks the Oracle credentials to the Nextflow task launcher script and should only be used for testing and development purposes. :::

Advanced settings

Fusion advanced configuration settings are described in the {ref}Fusion <config-fusion> section on the Nextflow configuration page.