Skip to content

Latest commit

 

History

History
324 lines (249 loc) · 14.2 KB

deepvariant-details.md

File metadata and controls

324 lines (249 loc) · 14.2 KB

DeepVariant usage guide

Overview

DeepVariant is a set of programs used to transform aligned sequencing reads into variant calls. At the highest level, a user needs to provide three inputs:

  1. A reference genome in FASTA format and its corresponding .fai index file generated using the samtools faidx command.

  2. An aligned reads file in BAM format and its corresponding index file (.bai). The reads must be aligned to the reference genome described above.

  3. A model checkpoint for DeepVariant.

The output of DeepVariant is a list of all variant calls in VCF format.

DeepVariant can be run in a variety of ways. The simplest method is to use run_deepvariant, which will configure DeepVariant for a given model type (--model_type), and run all subprograms.

However, under the hood, DeepVariant is composed of three programs: make_examples, call_variants, and postprocess_variants. More details about each program are described in detail in the Inputs and outputs section.

Inputs and outputs

General notes

  • Sharded files are a single logical collection of files with a common naming convention. For example, we talk about filename@10 as a single 10-way sharded file named filename. On most filesystems this actually looks like 10 distinct files filename-00000-of-00010, ..., filename-00009-of-00010. DeepVariant can write sharded files using their filename@10-style name and can read sharded files using both that style as well as the glob form, such as filename-* or filename-*-of-00010.
  • Files with the .gz suffix are interpreted as being compressed with gzip and are read/written accordingly.

make_examples

make_examples consumes reads and the reference genome to create TensorFlow examples for evaluation with our deep learning models. The tf.Example protos are written out in TFRecord format. To learn more about tf.Example and TFRecord, see the Using TFRecords and tf.Example Colab.

make_examples is a single-threaded program. Since the process of generating examples is embarrassingly parallel across the genome, make_examples supports sharding of its input and output via the --task argument with a sharded output specification. For example, if the output is specified as --examples [email protected] and --task 0, the input to the program will be 10% of the regions and the output will be written to examples.tfrecord-00000-of-00010.gz. Memory usage per instance is detailed below. These values do not consider the small model, or the fast-pipeline modes of make examples, and should be considered as upper limits of memory usage.

Memory Usage

These values were calculated by looking at memory usage for 1/25th of chr20.

Model Type version mean_mem_mb median_mem_mb max_mem_mb
PacBio 1.6.1 464.6 426.4 590.7
PacBio 1.8.0 487.0 457.1 630.3
Model Type version mean_mem_mb median_mem_mb max_mem_mb
Illumina (WGS) 1.6.1 404.7 403.9 418.3
Illumina (WGS) 1.8.0 435.1 434.7 449.3

Input assumptions

make_examples requires its input files to satisfy a few basic requirements to be processed correctly.

First, the reference genome FASTA, passed in using the --ref flag, must be indexed and can either be uncompressed or compressed with bgzip.

Second, the BAM file provided to --reads should be aligned to a "compatible" version of the genome reference provided as the --ref. By compatible here we mean the BAM and FASTA share at least a reasonable set of common contigs, as DeepVariant will only process contigs shared by both the BAM and reference. As an example, suppose you have a BAM file mapped to b37 + decoy FASTA and you provide just the vanilla b37 fasta to make_examples. DeepVariant will only process variants on the shared contigs, effectively excluding the hs37d5 contig present in the BAM but not in the reference.

The BAM file must be also sorted and indexed. It must exist on disk, so you cannot pipe it into DeepVariant. Duplicate marking may be performed, in our analyses there is almost no difference in accuracy except at lower (<20x) coverages. Finally, we recommend that you do not perform BQSR. Running BQSR results in a small decrease in accuracy. It is not necessary to do any form of indel realignment, though there is not a difference in DeepVariant accuracy either way.

Third, if you are providing --regions or other similar arguments these should refer to contigs present in the reference genome. These arguments accept space-separated lists, so all of the follow examples are valid arguments for --regions or similar arguments:

  • --regions chr20 => only process all of chromosome 20
  • --regions chr20:10,000,000-11,000,000 => only process 10-11mb of chr20
  • --regions "chr20 chr21" => only process chromosomes 20 and 21

Fourth and finally, if running in training mode the truth_vcf and confident_regions arguments should point to VCF and BED files containing the true variants and regions where we are confident in our calls (i.e., calls within these regions and not in the truth_vcf are considered false positives). These should be bgzipped and tabix indexed and be on a reference consistent with the one provided with the --ref argument.

call_variants

call_variants consumes TFRecord file(s) of tf.Examples protos created by make_examples and a deep learning model checkpoint and evaluates the model on each example in the input TFRecord. The output here is a TFRecord of CallVariantsOutput protos. call_variants doesn't directly support sharding its outputs, but accepts a glob or shard-pattern for its inputs.

call_variants uses around 4 GB per process and uses TensorFlow for evaluation. When evaluating a model in CPU mode, TensorFlow can make use of multiple cores, but scaling is sub-linear. In other words, call_variants on a 64 core machine is less than 8x faster than running on a 8 core machine.

When using a GPU, call_variants is both faster, more efficient, and needs fewer CPUs. Based on a small number of experiments, currently the most efficient configuration for call_variants on a GPU instance is 4-8 CPUs and 1 GPU. Compared to our setting in the whole genome case study, we noticed a 2.5x speedup on the call_variants step using a single P100 GPU and 8 CPUs. Note that currently call_variants can only use one GPU at most. So it doesn't improve the speed if you get a multiple-GPU machine.

postprocess_variants

postprocess_variants reads all of the output TFRecord files from call_variants, sorts them, combines multi-allelic records, and writes out a VCF file. When gVCF output is requested, it also outputs a gVCF file which merges the VCF with the non-variant sites.

Because postprocess_variants combines and sorts the output of call_variants, it needs to see all of the outputs from call_variants for a single sample to merge into a final VCF. postprocess_variants is single-threaded and needs a non-trivial amount of memory to run (20-30 GB), so it is best run on a single/dual core machine with sufficient memory.

Updates on DeepVariant since precisionFDA truth challenge and bioRxiv preprint

The DeepVariant team has been hard at work since we first presented the method. Key changes and improvements include:

  • Rearchitected with open source release in mind
  • Built on TensorFlow
  • Increased variant calling accuracy, especially for indels
  • Vastly faster with reduced memory usage

We have made a number of improvements to the methodology as well. The biggest change was to move away from RGB-encoded (3-channel) pileup images and instead represent the aligned read data using a multi-channel tensor data layout. Channels represent sequencing features. All of our models currently have a set of 6 'base channels':

  • The read base (A, C, G, T)
  • The base's quality score
  • The read's mapping quality score
  • The read's strand (positive or negative)
  • Does the read support the allele being evaluated?
  • Does the base match the reference genome at this position?

base channels

We can add additional channels to this base channel set to tailor the model input to a particular sequencing platform or technology to maximize accuracy. For example, for our Illumina models (wgs, exome) we add an additional insert_size channel. Because long-read data can be phased, we add a haplotype channel to our pacbio and ont models.

Additional modeling changes were to move to the inception-v3 architecture and to train on many more independent sequencing replicates of the ground truth training samples, including 50% downsampled versions of each of those read sets. In our testing this allowed the model to better generalize to other data types.

In the end these changes reduced our error rate by more than 50% on the held out evaluation sample (NA24385 / HG002) as compared to our results in the PrecisionFDA Truth Challenge:

DeepVariant April 2016 (HG002, GIAB v3.2.2, b37):

Type # FN # FP Recall Precision F1_Score
INDEL 4175 2839 0.987882 0.991728 0.989802
SNP 1689 832 0.999447 0.999728 0.999587

DeepVariant December 2017 (HG002, GIAB v3.2.2, b37):

Type # FN # FP Recall Precision F1_Score
INDEL 2384 1811 0.993081 0.994954 0.994017
SNP 735 363 0.999759 0.999881 0.999820

See the whole genome case study, which we update with each release of DeepVariant, for the latest results.

You can also see the Colab example to see how you can visualize the pileup images.

Training data over time

For the models we've released over time, you can find more details about the training data in DeepVariant training data.

CRAM support

As of v0.7, DeepVariant accepts CRAM files as input in addition to BAM files.

As of v0.9.0, we changed the default to use the reference file specified by the --ref flag, instead of the path to the original reference in the CRAM file (encoded in the file's "UR" tag).

For more information about CRAM, see the samtools documentation in general but particularly the sections on Global Options and reference sequences in CRAM.

htslib also hosts a nice page benchmarking CRAM with information on the effect of different CRAM options on file size and encoding/decoding performance.

Here are some basic file size and runtime numbers for running a single make_examples job on a 30x whole genome sample in BAM and CRAM format.

Filetype Size (Gb) Runtime (min)
BAM 66.99 79m47.37307s
CRAM 37.85 96m53.477s
Ratio 56.50% 121.43%
  • BAM file: gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.bam
  • CRAM file: gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.cram

Runtime was measured on n1-standard-64 machines.

Starting from v1.2.0, we include samtools and bcftools.

Based on user feedback (GitHub issue #414), we added samtools and bcftools in our Docker image:

docker run google/deepvariant:"${BIN_VERSION}" samtools

and

docker run google/deepvariant:"${BIN_VERSION}" bcftools

You can read more about samtools and bcftools here: http://www.htslib.org/doc/.

Commands for requesting machines used in case studies

We report runtime in our case studies documentation. In order to make sure the results we report are reproducible without too much variation, we provide the commands we used here to show you what kind of machines we ran the case studies on. This is NOT the fastest or cheapest configuration.

Command for a CPU-only machine on Google Cloud Platform.

We used a 96-core (vCPU) machine with 384GiB of memory and no GPU, on the Google Cloud Platform. Specifying the CPU platform also allows us to report the runtime more consistently.

gcloud compute instances create "${USER}-cpu"  \
  --scopes "compute-rw,storage-full,cloud-platform" \
  --image-family "ubuntu-2204-lts" \
  --image-project "ubuntu-os-cloud" \
  --machine-type "n2-standard-96" \
  --boot-disk-size "300" \
  --zone "us-west1-b"

Command for a GPU machine on Google Cloud Platform

We tested with P100 GPU, which is currently only available on N1 machines:

gcloud compute instances create "${USER}-gpu" \
  --scopes "compute-rw,storage-full,cloud-platform" \
  --maintenance-policy "TERMINATE" \
  --accelerator=type=nvidia-tesla-p100,count=1 \
  --image-family "ubuntu-2204-lts" \
  --image-project "ubuntu-os-cloud" \
  --machine-type "n1-standard-16" \
  --boot-disk-size "300" \
  --zone "us-west1-b" \
  --min-cpu-platform "Intel Skylake"

NOTE: Be sure to manage instances efficiently. Remember to delete the instances you're not using. You can find the instances at: https://console.cloud.google.com/compute/instances?project=YOUR_PROJECT