Input files for Sarek can be specified using a TSV file given to the --input
command.
The TSV file is a Tab Separated Value file with columns:
subject gender status sample lane fastq1 fastq2
for stepmapping
with paired-end FASTQssubject gender status sample lane bam
for stepmapping
with unmapped BAMssubject gender status sample bam bai recaltable
for steprecalibrate
with BAMssubject gender status sample bam bai
for stepvariantcalling
with BAMs
The content of these columns is quite straight-forward:
subject
designate the subject, it should be the ID of the Patient, and it must design only one patientgender
is the gender of the Patient, (XX or XY)status
is the status of the Patient, (0 for Normal or 1 for Tumor)sample
designate the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must design only one samplelane
is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same samplefastq1
is the path to the first pair of the fastq filefastq2
is the path to the second pair of the fastq filebam
is the bam filebai
is the bam index filerecaltable
is the recalibration table
It is recommended to add the absolute path of the files, but relative path should work also.
Note, the delimiter is the tab (\t
) character:
All examples are given for a normal/tumor pair. If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair.
Sarek will output results is a different directory for each sample. If multiple samples are specified in the TSV file, Sarek will consider all files to be from different samples. Multiple TSV files can be specified if the path is enclosed in quotes.
Somatic variant calling output will be in a specific directory for each normal/tumor pair.
In this sample for the normal case there are 3 read groups, and 2 for the tumor.
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFACXX111207.1_1.fastq.gz pathToFiles/C09DFACXX111207.1_2.fastq.gz
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFACXX111207.2_1.fastq.gz pathToFiles/C09DFACXX111207.2_2.fastq.gz
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFACXX111207.3_1.fastq.gz pathToFiles/C09DFACXX111207.3_2.fastq.gz
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMACXX111207.1_1.fastq.gz pathToFiles/D0ENMACXX111207.1_2.fastq.gz
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.gz pathToFiles/D0ENMACXX111207.2_2.fastq.gz
Input files for Sarek can be specified using the path to a FASTQ directory given to the --input
command only with the mapping
step.
nextflow run nf-core/sarek --input pathToDirectory ...
The input folder, containing the FASTQ files for one individual (ID) should be organized into one subfolder for every sample. All fastq files for that sample should be collected here.
ID
+--sample1
+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample2
+------sample2_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample2_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample3
+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
Fastq filename structure:
sample_lib_flowcell-index_lane_R1_1000.fastq.gz
andsample_lib_flowcell-index_lane_R2_1000.fastq.gz
Where:
sample
= sample idlib
= indentifier of libaray preparationflowcell
= identifyer of flow cell for the sequencing runlane
= identifier of the lane of the sequencing run
Read group information will be parsed from fastq file names according to this:
RGID
= "sample_lib_flowcell_index_lane"RGPL
= "Illumina"PU
= sampleRGLB
= lib
In this sample for the normal case there are 3 read groups, and 2 for the tumor.
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFAC_1.bam
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFAC_2.bam
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFAC_3.bam
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMAC_1.bam
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMAC_2.bam
The same way, if you have non recalibrated BAMs, their indexes and their recalibration tables, you should use a structure like:
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G15511.C09DFN.md.bai pathToFiles/G15511.C09DFN.md.recal.table
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table
The same way, if you have recalibrated BAMs and their indexes, you should use a structure like:
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.recal.bam pathToFiles/G15511.C09DFN.md.recal.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.recal.bam pathToFiles/G15511.D0ENMT.md.recal.bai
Input files for Sarek can be specified using the path to a VCF directory given to the --input
command only with the annotate
step.
Multiple VCF files can be specified if the path is enclosed in quotes.
As Sarek will use bgzip
and tabix
to compress and index VCF files annotated, it expects VCF files to be sorted.
nextflow run nf-core/sarek --step annotate --input "results/VariantCalling/*/.vcf.gz" ...