Pipeline to process SARS-CoV-2 sequences and metadata, clean up irregularities, align and variant call then publish matched subsets of FASTA sequences and metadata for groups with different access to sensitive data.
Runs weekly on global sequences downloaded from GISAID.
Runs daily on COG-UK sequences, and combines with non-UK GISAID sequences.
git clone --recurse-submodules https://github.com/COG-UK/grapevine_nextflow.git
cd grapevine_nextflow
conda env create -f environment.yml
conda activate grapevine_nextflow
NXF_VER=20.10.0 nextflow run workflows/process_cog_uk.nf <params>
-
Parse GISAID dump (
export.json
) and extract FASTA of sequences and associated metadata.-
Excludes known problematic sequences listed in
gisaid_omissions.txt
-
Excludes sequences where
covv_host.lower() != 'human'
-
Excludes sequences where malformed (not
YYYY-MM-DD
) or impossible (earlier than2019-11-30
or later than today) date incovv_collection_date
-
Reformat FASTA header
-
Add
epi-week
andepi-day
columns to metadata
-
-
Run
pangolin
(https://github.com/cov-lineages/pangolin) on all new sequences. If new release ofpangolin
run on all sequences. -
Calculate the
unmapped_genome_completeness
as the proportion of sequence length which is unambiguous (notN
) -
Deduplicate by date, keeping the earliest example
-
Align to the reference (
Wuhan/WH04/2020
) withminimap2
-
Variant call using
gofasta
and type specific mutations of interest listed inAAs.csv
anddels.csv
-
Filter out low quality sequences with mapped completeness < 93%, and trim and pad alignment outside of reference coordinates
265:29674
-
Calculate distance to reference and exclude sequences with distance to more than 4.0 epi-week std devs.
-
Parse matched FASTA and metadata TSV output by Elan/Majora
-
Reformats header and unaligns sequences which have already been aligned to the reference
-
Manual date correction for samples listed in
date_corrections.csv
-
Excludes early sequences which have been resequenced as listed in
resequencing_omissions.txt
-
Adds GISAID accession if recently submitted
-
Excludes sequences where malformed (not
YYYY-MM-DD
) or impossible (earlier than2019-11-30
or later than today) date incovv_collection_date
-
Add
epi-week
andepi-day
,source_id
andpillar_2
columns to metadata
-
-
Run
pangolin
(https://github.com/cov-lineages/pangolin) on all new sequences. If new release ofpangolin
run on all sequences. -
Calculate the
unmapped_genome_completeness
as the proportion of sequence length which is unambiguous (notN
) -
Deduplicate COG-ID by completeness and label samples with duplicate
source_id
-
Align to the reference (
Wuhan/WH04/2020
) withminimap2
-
Variant call using
gofasta
and type specific mutations of interest listed inAAs.csv
anddels.csv
-
Filter out low quality sequences with mapped completeness < 93%, and trim and pad alignment outside of reference coordinates
265:29674
-
Clean up geographical metadata (https://github.com/COG-UK/geography_cleaning)
-
Combine COG-UK sequences and metadata with non-UK GISAID sequences and metadata
-
Publish subsets of the data as described in
publish_cog_global_recipes.json
grapevine
(https://github.com/COG-UK/grapevine) was the name of the original pipeline which did all of the above, made phylogenetic trees and more. As the number of sequences has grown the tree building steps take increasingly long to complete. As the majority of users only interact with the alignments and cleaned metadata, it was decided that a robust implementation of the alignment and metadata processing steps run daily would be more useful and that is what is provided here.