(working-with-files)=
To access and work with files, use the file()
method, which returns a file system object given a file path string:
myFile = file('some/path/to/my_file.file')
The file()
method can reference both files and directories, depending on what the string path refers to in the file system.
When using the wildcard characters *
, ?
, []
and {}
, the argument is interpreted as a glob path matcher and the file()
method returns a list object holding the paths of files whose names match the specified pattern, or an empty list if no match is found:
listOfFiles = file('some/path/*.fa')
:::{note}
The file()
method does not return a list if only one file is matched. Use the files()
method to always return a list.
:::
:::{note}
A double asterisk (**
) in a glob pattern works like *
but also searches through subdirectories.
:::
By default, wildcard characters do not match directories or hidden files. For example, if you want to include hidden files in the result list, enable the hidden
option:
listWithHidden = file('some/path/*.fa', hidden: true)
:::{note}
To compose paths, instead of string interpolation, use the resolve()
method or the /
operator:
def dir = file('s3://bucket/some/data/path')
def sample1 = dir.resolve('sample.bam') // correct
def sample2 = dir / 'sample.bam'
def sample3 = file("$dir/sample.bam") // correct (but verbose)
def sample4 = "$dir/sample.bam" // incorrect
:::
The file()
method returns a Path, which has several methods for retrieving metadata about the file:
def path = file('/some/path/file.txt')
assert path.baseName == 'file'
assert path.extension == 'txt'
assert path.name == 'file.txt'
assert path.parent == '/some/path'
:::{tip}
When calling an object method, any method that looks like get*()
can also be accessed as a field. For example, path.getName()
is equivalent to path.name
, path.getBaseName()
is equivalent to path.baseName
, and so on.
:::
See the {ref}stdlib-types-path
reference for the list of available methods.
Given a file variable, created with the file()
method as shown previously, reading a file is as easy as getting the file's text
property, which returns the file content as a string:
print myFile.text
Similarly, you can save a string to a file by assigning it to the file's text
property:
myFile.text = 'Hello world!'
Binary data can managed in the same way, just using the file property bytes
instead of text
. Thus, the following example reads the file and returns its content as a byte array:
binaryContent = myFile.bytes
Or you can save a byte array to a file:
myFile.bytes = binaryContent
:::{note} The above assignment overwrites any existing file contents, and implicitly creates the file if it doesn't exist. :::
:::{warning} The above methods read and write the entire file contents at once, in a single variable or buffer. For this reason, when dealing with large files it is recommended that you use a more memory efficient approach, such as reading/writing a file line by line or using a fixed size buffer. :::
In order to append a string value to a file without erasing existing content, you can use the append()
method:
myFile.append('Add this line\n')
Or use the left shift operator, a more idiomatic way to append text content to a file:
myFile << 'Add a line more\n'
In order to read a text file line by line you can use the method readLines()
provided by the file object, which returns the file content as a list of strings:
myFile = file('some/my_file.txt')
allLines = myFile.readLines()
for( line : allLines ) {
println line
}
This can also be written in a more idiomatic syntax:
file('some/my_file.txt')
.readLines()
.each { println it }
:::{warning}
The method readLines()
reads the entire file at once and returns a list containing all the lines. For this reason, do not use it to read big files.
:::
To process a big file, use the method eachLine()
, which reads only a single line at a time into memory:
count = 0
myFile.eachLine { str ->
println "line ${count++}: $str"
}
The classes Reader
and InputStream
provide fine-grained control for reading text and binary files, respectively.
The method newReader()
creates a Reader object for the given file that allows you to read the content as single characters, lines or arrays of characters:
myReader = myFile.newReader()
String line
while( line = myReader.readLine() ) {
println line
}
myReader.close()
The method withReader()
works similarly, but automatically calls the close()
method for you when you have finished processing the file. So, the previous example can be written more simply as:
myFile.withReader {
String line
while( line = it.readLine() ) {
println line
}
}
The methods newInputStream()
and withInputStream()
work similarly. The main difference is that they create an InputStream object useful for writing binary data.
See the {ref}stdlib-types-path
reference for the list of available methods.
The Writer
and OutputStream
classes provide fine-grained control for writing text and binary files, respectively, including low-level operations for single characters or bytes, and support for big files.
For example, given two file objects sourceFile
and targetFile
, the following code copies the first file's content into the second file, replacing all U
characters with X
:
sourceFile.withReader { source ->
targetFile.withWriter { target ->
String line
while( line=source.readLine() ) {
target << line.replaceAll('U','X')
}
}
}
See the {ref}stdlib-types-path
reference for the list of available methods.
Methods for performing filesystem operations such as copying, deleting, and directory listing are documented in the {ref}stdlib-types-path
reference.
The simplest way to list a directory is to use list()
or listFiles()
, which return a collection of first-level elements (files and directories) of a directory:
for( def file : file('any/path').list() ) {
println file
}
Additionally, the eachFile()
method allows you to iterate through the first-level elements only (just like listFiles()
). As with other each*()
methods, eachFile()
takes a closure as a parameter:
myDir.eachFile { item ->
if( item.isFile() ) {
println "${item.getName()} - size: ${item.size()}"
}
else if( item.isDirectory() ) {
println "${item.getName()} - DIR"
}
}
In general, you should not need to manually copy files, because Nextflow will automatically stage files in and out of the task environment based on the definition of process inputs and outputs. Ideally, any operation which transforms files should be encapsulated in a process, in order to leverage Nextflow's staging capabilities as much as possible.
Nextflow works with many types of remote files and objects using the same interface as for local files. The following protocols are supported:
- HTTP(S)/FTP (
http://
,https://
,ftp://
) - Amazon S3 (
s3://
) - Azure Blob Storage (
az://
) - Google Cloud Storage (
gs://
)
To reference a remote file, simply specify the URL when opening the file:
pdb = file('http://files.rcsb.org/header/5FID.pdb')
It can then be used in the same way as a local file:
println pdb.text
:::{note} Not all operations are supported for all protocols. For example, writing and directory listing is not supported for HTTP(S) and FTP paths. :::
:::{note} Additional configuration may be necessary for cloud object storage, such as authenticating with a private bucket. See the documentation for each cloud storage provider for further details. :::
When a process input file resides on a different file system than the work directory, Nextflow copies the file into the work directory using an appropriate Java SDK.
Remote files are staged in a subdirectory of the work directory with the form stage-<session-id>/<hash>/<filename>
, where <hash>
is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID.
:::{note} Remote file staging can be a bottleneck during large-scale runs, particularly when input files are stored in object storage but need to be staged in a shared filesystem work directory. This bottleneck occurs because Nextflow handles all of these file transfers.
To mitigate this, you can implement a custom process to download the required files, allowing you to stage multiple files efficiently through parallel jobs. Files should be given as a val
input instead of a path
input to bypass Nextflow's built-in remote file staging.
Alternatively, use {ref}fusion-page
with the work directory set to object storage. In this case, tasks can access remote files directly without any prior staging, eliminating the bottleneck.
:::