Skip to content

Commit

Permalink
Merge branch 'gh-pages' into Episode5-6
Browse files Browse the repository at this point in the history
  • Loading branch information
mr-c authored Jan 19, 2022
2 parents 5ab060f + b27575a commit 7f67da8
Show file tree
Hide file tree
Showing 4 changed files with 83 additions and 63 deletions.
37 changes: 20 additions & 17 deletions _episodes/01-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,41 +8,44 @@ questions:
- "How do CWL workflows compare to shell workflows?"
- "What are the advantages of using CWL workflows?"
objectives:
- "First learning objective. (FIXME)"
- "Understand why you might use CWL instead of a shell script"
keypoints:
- "First key point. Brief Answer to questions. (FIXME)"
- "CWL is a standard for describing workflows based on command-line tools"
- "CWL workflows are written in a subset of YAML"
- "A CWL workflow is more portable than a shell script"
- "CWL supports software containers, supporting reproducibility on different machines"
---

# Common Workflow Language

Computational workflows are widely used for data analysis, enabling rapid innovation and decision making.
_Workflow thinking_ is a form of "conceptualizing processes as recipes and protocols, structured as [work- or] dataflow graphs with computational steps,
Computational workflows are widely used for data analysis, enabling rapid innovation and decision making.
_Workflow thinking_ is a form of "conceptualizing processes as recipes and protocols, structured as workflow or dataflow graphs with computational steps,
and subsequently developing tools and approaches for formalizing, analyzing and communicating these process descriptions" ([Gryk & Ludascher, 2017](https://doi.org/10.1353/lib.2017.0018)).

However as the rise in popularity of workflows has been matched by a rise in the number of dispirit workflow managers that are available,
However as the rise in popularity of workflows has been matched by a rise in the number of disparate workflow managers that are available,
each with their own standards for describing the tools and workflows, reducing portability and interoperability of these workflows.

CWL is a free and open standard for describing command-line tool based workflows[^1].
These standards provide a common, but reduced, set of abstractions that are both used in practice and implemented in many popular workflow systems.
CWL is a free and open standard for describing command-line tool based workflows[^1].
These standards provide a common, but reduced, set of abstractions that are both used in practice and implemented in many popular workflow systems.
The CWL language is declarative, enabling computational workflows to be constructed from diverse software tools, executing each through their command-line interface.

Previously researchers might write shell scripts to link together these command-line tools.
Although these scripts might provide a direct means of accessing the tools, writing and maintaining them requires specific knowledge of the system that they will be used on.
Shell scripts are not easily portable, and so researchers can easily end up spending more time maintaining the scripts than carrying out their research.
Previously researchers might write shell scripts to link together these command-line tools.
Although these scripts might provide a direct means of accessing the tools, writing and maintaining them requires specific knowledge of the system that they will be used on.
Shell scripts are not easily portable, and so researchers can easily end up spending more time maintaining the scripts than carrying out their research.
The aim of CWL is to reduce that barrier of usage of these tools to researchers.

CWL workflows are written in a subset of YAML, with a syntax that does not restrict the amount of detail provided for a tool or workflow.
The execution model is explicit, all required elements of a tool's runtime environment must be specified by the CWL tool-description author.
CWL workflows are written in a subset of YAML, with a syntax that does not restrict the amount of detail provided for a tool or workflow.
The execution model is explicit, all required elements of a tool's runtime environment must be specified by the CWL tool-description author.
On top of these basic requirements they can also add hints or requirements to the tool-description, helping to guide users (and workflow engines) on what resources are needed for a tool.

The CWL standards explicitly support the use of software container technologies, helping ensure that the execution of tools is reproducible.
Data locations are explicitly defined, and working directories kept separate for each tool invocation.
The CWL standards explicitly support the use of software container technologies, helping ensure that the execution of tools is reproducible.
Data locations are explicitly defined, and working directories kept separate for each tool invocation.
These standards ensure the portability of tools and workflows, allowing the same workflows to be run on your local machine, or in a HPC or cloud environment, with minimal changes required.

# RNA sequencing example

In this tutorial a bio-informatics RNA-sequencing analysis is used as an example. However, there is no specific knowledge needed for this tutorial.
RNA-sequencing is a technique which examines the quantity and sequences of RNA in a sample using next-generation sequencing.
In this tutorial a bio-informatics RNA-sequencing analysis is used as an example. However, there is no specific knowledge needed for this tutorial.
RNA-sequencing is a technique which examines the quantity and sequences of RNA in a sample using next-generation sequencing.
The RNA reads are analyzed to measure the relative numbers of different RNA molecules in the sample. This analysis is differential gene expression.

The process looks like this:
Expand All @@ -59,4 +62,4 @@ The different tools necessary for this analysis are already available. In this t

{% include links.md %}

[^1]: M. R. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijanić, H. Ménager, S. Soiland-Reyes, B. Gavrilović, C. Goble, The CWL Community (2021): Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language. Communication of the ACM. https://doi.org/10.1145/3486897
[^1]: M. R. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijanić, H. Ménager, S. Soiland-Reyes, B. Gavrilović, C. Goble, The CWL Community (2021): Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language. Communication of the ACM. https://doi.org/10.1145/3486897
95 changes: 56 additions & 39 deletions _episodes/02-shell_to_cwl.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ requirements for running that workflow. All CWL documents should start with two

~~~
cwlVersion: v1.2
class:
class:
~~~
{: .language-yaml}

Expand All @@ -60,8 +60,8 @@ baseCommand: echo
inputs:
message_text:
type: string
inputBinding:
position: 1
inputBinding:
position: 1
outputs: []
~~~
Expand Down Expand Up @@ -93,25 +93,25 @@ INFO Final process status is success
~~~
{: .output}

The output displayed above shows that the program has run succesfully and its output, `Hello world!`.
The output displayed above shows that the program has run succesfully and its output, `Hello world!`.

Let's take a look at the `echo.cwl` script in more detail.

As explained above, the first 2 lines are always the same, the CWL version and the class of the script are defined.
As explained above, the first 2 lines are always the same, the CWL version and the class of the script are defined.
In this example the class is `CommandLineTool`, in particular the `echo` command.
The next line, `baseCommand`, contains the command that will be run (`echo`).
The next line, `baseCommand`, contains the command that will be run (`echo`).

~~~
inputs:
message_text:
type: string
inputBinding:
position: 1
inputBinding:
position: 1
~~~
{: .language-yaml}

This block of code contains the `inputs` section of the tool description. This section provides all the inputs that are needed for running this specific tool.
To run this example we will need to provide a string which will be included on the command line. Each of the inputs has a name, to help us tell them apart; this first input has the name : `message_text`.
This block of code contains the `inputs` section of the tool description. This section provides all the inputs that are needed for running this specific tool.
To run this example we will need to provide a string which will be included on the command line. Each of the inputs has a name, to help us tell them apart; this first input has the name : `message_text`.
The field `inputBinding` is one way to specify how the input should appear on the command line.
Here the `position` field indicates at which position the input will be on the command line; in this case the `message_text` value will be the first thing added to the command line (after the `baseCommand`, `echo`).

Expand All @@ -120,11 +120,11 @@ outputs: []
~~~
{: .language-yaml}
Lastly the `outputs` of the tool description. This example doesn't have a formal output.
The text is printed directly in the terminal. So an empty YAML list (`[]`) is used as the output.
The text is printed directly in the terminal. So an empty YAML list (`[]`) is used as the output.

> ## Script order
> To make the script more readable the `input` field is put in front of the `output` field.
> However CWL syntax requires only that each field is properly defined, it does not require them to be in a particular order.
> To make the script more readable the `input` field is put in front of the `output` field.
> However CWL syntax requires only that each field is properly defined, it does not require them to be in a particular order.
{: .callout}


Expand All @@ -135,7 +135,7 @@ The text is printed directly in the terminal. So an empty YAML list (`[]`) is us
> > ## Solution
> >
> > To change the text on the command line, you only have to change the text in the `hello_world.yml` file.
> >
> >
> > For example:
> > ~~~
> > message_text: Good job!
Expand All @@ -147,7 +147,7 @@ The text is printed directly in the terminal. So an empty YAML list (`[]`) is us
## CWL single step workflow
The RNA-seq data from the introduction episode will be used for the first CWL workflow.
The RNA-seq data from the introduction episode will be used for the first CWL workflow.
The first step of RNA-sequencing analysis is a quality control of the RNA reads using the `fastqc` tool.
This tool is already available to use so there is no need to write a new CWL tool description.
Expand All @@ -160,23 +160,23 @@ class: Workflow

inputs:
rna_reads_human: File

steps:
quality_control:
run: bio-cwl-tools/fastqc/fastqc_2.cwl
in:
reads_file: rna_reads_human
in:
reads_file: rna_reads_human
out: [html_file]

outputs:
outputs:
qc_html:
type: File
outputSource: quality_control/html_file
outputSource: quality_control/html_file
~~~
{: .language-yaml}
In a __workflow__ the `steps` field must always be present. The workflow tasks or steps that you want to run are listed in this field.
At the moment the workflow only contains one step: `quality_control`. In the next episodes more steps will be added to the workflow.
In a __workflow__ the `steps` field must always be present. The workflow tasks or steps that you want to run are listed in this field.
At the moment the workflow only contains one step: `quality_control`. In the next episodes more steps will be added to the workflow.
Let's take a closer look at the workflow. First the `inputs` field will be explained.
Expand All @@ -186,45 +186,45 @@ inputs:
~~~
{: .language-yaml}
Looking at the CWL script of the `fastqc` tool, it needs a fastq file as its input. In this example the fastq file consists of human RNA reads.
So we call the variable `rna_reads_human` and it has `File` as its type.
Looking at the CWL script of the `fastqc` tool, it needs a fastq file as its input. In this example the fastq file consists of human RNA reads.
So we call the variable `rna_reads_human` and it has `File` as its type.
To make this workflow interpretable for other researchers, self-explanatory and sensible variable names are used.
> ## Input and output names
> It is very important to give inputs and outputs a sensible name. Try not to use variable names like `inputA` or `inputB` because others might not understand what is meant by it.
{: .callout}
The next part of the script is the `steps` field.
The next part of the script is the `steps` field.
~~~
steps:
quality_control:
run: bio-cwl-tools/fastqc/fastqc_2.cwl
in:
reads_file: rna_reads_human
in:
reads_file: rna_reads_human
out: [html_file]
~~~
{: .language-yaml}
Every step of a workflow needs an name, the first step of the workflow is called `quality_control`. Each step needs a `run` field, an `in` field and an `out` field.
The `run` field contains the location of the CWL file of the tool to be run. The `in` field connects the `inputs` field to the `fastqc` tool.
The `fastqc` tool has an input parameter called `reads_file`, so it needs to connect the `reads_file` to `rna_reads_human`.
Every step of a workflow needs an name, the first step of the workflow is called `quality_control`. Each step needs a `run` field, an `in` field and an `out` field.
The `run` field contains the location of the CWL file of the tool to be run. The `in` field connects the `inputs` field to the `fastqc` tool.
The `fastqc` tool has an input parameter called `reads_file`, so it needs to connect the `reads_file` to `rna_reads_human`.
Lastly, the `out` field is a list of output parameters from the tool to be used. In this example, the `fastqc` tool produces an output file called `html_file`.
The last part of the script is the `output` field.
~~~
outputs:
outputs:
qc_html:
type: File
outputSource: quality_control/html_file
~~~
{: .language-yaml}
Each output in the `outputs` field needs its own name. In this example the output is called `qc_html`.
Inside `qc_html` the type of output is defined. The output of the `quality_control` step is a file, so the `qc_html` type is `File`.
Each output in the `outputs` field needs its own name. In this example the output is called `qc_html`.
Inside `qc_html` the type of output is defined. The output of the `quality_control` step is a file, so the `qc_html` type is `File`.
The `outputSource` field refers to where the output is located, in this example it came from the step `quality_control` and it is called `html_file`.
When you want to run this workflow, you need to provide a file with the inputs the workflow needs. This file is similar to the `hello_world.yml` file in the previous section.
When you want to run this workflow, you need to provide a file with the inputs the workflow needs. This file is similar to the `hello_world.yml` file in the previous section.
The input file is called `workflow_input.yml`
__workflow_input.yml__
Expand All @@ -236,7 +236,7 @@ rna_reads_human:
~~~
{: .language-yaml}
In the input file the values for the inputs that are declared in the `inputs` section of the workflow are provided.
In the input file the values for the inputs that are declared in the `inputs` section of the workflow are provided.
The workflow takes `rna_reads_human` as an input parameter, so we use the same variable name in the input file.
When setting inputs, the class of the object needs to be defined, for example `class: File` or `class: Directory`. The `location` field contains the location of the input file.
In this example the last line is needed to provide a format for the fastq file.
Expand All @@ -248,13 +248,30 @@ cwltool rna_seq_workflow.cwl workflow_input.yml
~~~
{: .language-bash}
~~~
...
Analysis complete for Mov10_oe_1.subset.fq
INFO [job quality_control] Max memory used: 193MiB
INFO [job quality_control] completed success
INFO [step quality_control] completed success
INFO [workflow ] completed success
{
"qc_html": {
"location": "file://.../novice-tutorial-exercises/Mov10_oe_1.subset_fastqc.html",
"basename": "Mov10_oe_1.subset_fastqc.html",
"class": "File",
"checksum": "sha1$46417ab64dd657ec50d86f7d23b2859bee74199f",
"size": 383589,
"path": ".../novice-tutorial-exercises/Mov10_oe_1.subset_fastqc.html"
}
}
INFO Final process status is success
~~~
{: .output}
### Exercise
Needs some exercises
{% include links.md %}
12 changes: 6 additions & 6 deletions _episodes/03-dependency_graphs.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,18 @@ keypoints:
## Multi-Step Workflow
In the previous episode a single step workflow was shown. To make a multi-step workflow, you add more entries to the `steps` field.
In this episode, the workflow is extended with the next two steps of the RNA-sequencing analysis.
The next two steps are alignment of the reads and indexing the alignments.
The next two steps are alignment of the reads and indexing the alignments.
We will be using the [`STAR`](https://bio.tools/star) and [`samtools`](https://bio.tools/samtools) tools for these tasks.

__rna_seq_workflow.cwl__
~~~
clwVersion: v1.2
cwlVersion: v1.2
class: Workflow
inputs:
rna_reads_human: File
ref_genome: Directory
steps:
quality_control:
run: bio-cwl-tools/fastqc/fastqc_2.cwl
Expand Down Expand Up @@ -119,11 +119,11 @@ ref_genome:


> ## Iterative working
> Working on a workflow is often not something that happens all at once.
> Sometimes you already have a shell script ready that can be converted to a CWL workflow.
> Working on a workflow is often not something that happens all at once.
> Sometimes you already have a shell script ready that can be converted to a CWL workflow.
> Other times it is similar to this tutorial, you start with a single-step workflow and extend it to a multi-step workflow.
> This is all iterative working, a continuous work in progress.
{: . callout}
{: .callout}

## Visualising a workflow

Expand Down
2 changes: 1 addition & 1 deletion _episodes/04-reusing_tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ rna_reads_human:
format: http://edamontology.org/format_1930
ref_genome:
class: Directory
location: hg19-chr1-STAR-index
location: rnaseq/hg19-chr1-STAR-index
annotations:
class: File
location: rnaseq/reference_data/chr1-hg19_genes.gtf
Expand Down

0 comments on commit 7f67da8

Please sign in to comment.