-
Notifications
You must be signed in to change notification settings - Fork 111
Human Genome Variation Map (HGVM) Pilot Project
The HGVM Pilot project aims to create a draft reference structure that represents all “common” genetic variation, providing a means to stably name and canonically identify each variant. It aims to demonstrate such a structure can be used to improve upon current standard methods in genomics and create new ones. It is being run by members of the reference-variation GA4GH task team, and is discussed on their regular biweekly calls. To join please contact @benedictpaten or @skeenan.
The project is starting with 6 pilot regions, which will be used to test approaches at a scale more tractable than the complete human genome. The test regions are:
- Major Histocompatibility Complex (8 alt haps)
- Killer Cell Immunoglobulin-Like Receptor (KIR) Gene Cluster (35 alt haps)
- Spinal Muscular Atrophy (SMA) locus (2 alt haps)
- BRCA1 locus (reference, CHM1 mole, and LRG sequences)
- BRCA2 locus (reference, CHM1 mole, and LRG sequences)
- X chromosome centromere (CENX) reference repeat unit and reads
We are using the GRCh38 Human assembly, and are including available alternative haplotype sequences.
Full reference and alt haplotype sequences are available from Adam Novak here for all regions. These are the sequences that should be used in building test graphs.
For the first 5 regions, each region corresponds to a directory, and within that directory there is one ref.fa
with the clipped-out GRCh38.p2 primary path sequence for the region (assigned FASTA ID "ref"), and a number of FASTAs with filenames and record IDs bearing the GI numbers of alternate sequences.
MHC, SMA, and KIR sequences were extracted using this script. BRCA1 and BRCA2 sequences (including the LRG sequences) were obtained using Nancy Ouyang's script below, and formatted using this script.
The CENX data has been provided by Karen Miga, and is in a slightly different format: ref.fa
contains a the reference repeat unit, "DXZ1", with its FASTA ID set to ref
, while reads.fa
contains several independent read sequences from repeats in the region in question. The format is different because there are a few thousand reads, and each could not be realistically presented in its own file.
Per-gene sequences for the first 5 regions are available from Nancy Ouyang at Curoverse here, collected using the scripts and methodology described here. FASTA files are named by gene name and ncbi gene id, e.g. BRCA1-672.fa. She did not mirror the IMGT HLA contents (which can be gotten here ), even though that was requested on the minutes from the DWG meeting, due to their policy.
The GA4GH API now supports a graph model of the reference in which variations can be described. A description of the reference model is contained in the common.avdl file within the schemas.
Currently the plan for the pilot has three parts. Signup for contributions is below.
-
An implementation of the GA4GH API incorporating the graph model. A minimal graph server is being developed as a branch of the GA4GH reference server. This is being lead by the reference server team (see below). The implementation effort should complete all necessary end-points for the evaluation by the end of May 2015.
-
The construction of a set of graph implementations, each represented by the GA4GH API. These will be provided by community members. Either the implementor can create their own implementation of the GA4GH API serving their graph, or they can use the reference server implementation developed by (1). The data format to create a graph genome within the reference server is described below (see Graph Format). For groups unable to host their own server, UCSC will host the server upon request.
-
The construction of a set of analyses using the GA4GH APIs. This will be provided by members of the group. These may lead to further extension of the APIs.
At the end of the pilot people who have made a contribution to any of these three areas will be included as authors on a marker paper describing the graphs, the implementations and assessments. We are seeking provisional commitments to provide contributions to these three aspects (please add your name/group below with a brief description).
##Graph Format
To represent a graph in the GA4GH reference server it must be converted into a SQLite based format, from which the server serves. The format is described here. It closely reflects the AVRO schema present in the GA4GH API. An example graph in this format is shown [here] (https://github.com/ga4gh/server/blob/graph/tests/data/graphs/graphData_v023.sql).
Note that for the purposes of the pilot, all sequences and joins can be reported as part of a single ReferenceSet or VariantSet. All declared CallSets are part of that one VariantSet, each CallSet representing a single original sequence used to generate the graph. Then, we can map that CallSet to its corresponding Allele by looking for the unique AlleleCall with ploidy equal to 1.
The provided example dataset demonstrates this kind of setup.
Thus, the following will not be expected of datasets provided for the pilot:
- multiple reference/variant sets
- allele calls with ploidy > 1
##Time-line
Due to the exploratory and ambitious nature of the pilot, we propose to have two rounds of evaluation. In the first prototypes will be submitted and evaluated by the group informally without wider sharing - there should be no problem in submitting experimental graphs. In the second round the submitted graphs will be evaluated for publication. The timeline is as follows:
- Submission of 1st round prototype graphs - 1st of June 2015
- Completion of evaluations of 1st round of prototype graphs - 22nd of June 2015 (Monday before call).
- Submission of 2nd round prototype graphs - 15th of July 2015
After the 2nd round of submissions we anticipate working iteratively toward a publication targeted for September 2015.
##Contributors to the pilot (signup below!)
###Graph Contributors
- Team-UCSC (Adam Novak, Maciek Smuga-Otto, Glenn Hickey, Benedict Paten, Karen Miga, David Haussler). Will provide implementations for all pilot regions. Plans to provide 2 different implementations for five of the regions. One based upon the Cactus multiple sequence aligner, and one based upon the context-based mapping scheme that we call Camel (sticking with the desert theme). For the CENX region will provide a graph built by Karen Miga.
- Team-Hinx - (Erik Garrison @Sanger) Will provide implementations for all pilot regions using the variation graph assembler/aligner vg.
- Team-BDG - (Frank Nothaft @ Berkeley) will provide implementations for parts 2,3 for all pilot regions using the avocado assembler.
- Team-Oxford - (Phelim Bradley, Alexander Dilthey, Zamin Iqbal, Jerome Kelleher, Sorina Maciuca, Gil McVean). Will provide implementations for some pilot regions, plus some non-human examples, e.g. the MSP3.4 gene in P. falciparum.
###API Implementation Contributions
- Reference-server team (Jerome Kelleher, Danny Colligan, Maciek Smuga-Otto, et al.)
###Analysis Contributors
- Team-UCSC - a comparison of the underlying alignments of the sequences composed within the reference graph.
- Team-UCSC - a side graph comparison methodology devised by David Haussler, described here.
##Relevant Publications
Members of the group have been developing some theory and prototypes relevant to the pilot. These can be listed here (feel free to add).
Paten, Novak, Haussler paper describing approaches to constructing a reference structure
[Novak, Rosen, Haussler, Paten paper describing mapping to a reference structure (only deals with string to string case, but concept generalises)] (http://arxiv.org/pdf/1501.04128v1.pdf)
[Dilthey, Cox, Iqbal, Nelson, McVean paper describing applications of a graph-based reference structure to inference in the MHC] (http://biorxiv.org/content/early/2014/07/08/006973)