Human Genome Variation Map (HGVM) Pilot Project

The HGVM Pilot project aims to create a draft reference structure that represents all “common” genetic variation, providing a means to stably name and canonically identify each variant. It aims to demonstrate such a structure can be used to improve upon current standard methods in genomics and create new ones. It is being run by members of the reference-variation GA4GH task team, and is discussed on their regular biweekly calls. To join please contact @benedictpaten or @skeenan.

Pilot Test Data

The project is starting with 6 pilot regions, which will be used to test approaches at a scale more tractable than the complete human genome. The test regions are:

Major Histocompatibility Complex (8 alt haps)
Killer Cell Immunoglobulin-Like Receptor (KIR) Gene Cluster (35 alt haps)
Spinal Muscular Atrophy (SMA) locus (2 alt haps)
BRCA1 locus (reference, CHM1 mole, and LRG sequences)
BRCA2 locus (reference, CHM1 mole, and LRG sequences)
X chromosome centromere (CENX) reference repeat unit and reads

We are using the GRCh38 Human assembly, and are including available alternative haplotype sequences.

Full reference and alt haplotype sequences are available from Adam Novak here for all regions. These are the sequences that should be used in building test graphs.

Pilot Test Data Details

For the first 5 regions, each region corresponds to a directory, and within that directory there is one ref.fa with the clipped-out GRCh38.p2 primary path sequence for the region (assigned FASTA ID "ref"), and a number of FASTAs with filenames and record IDs bearing the GI numbers of alternate sequences.

MHC, SMA, and KIR sequences were extracted using this script. BRCA1 and BRCA2 sequences (including the LRG sequences) were obtained using Nancy Ouyang's script below, and formatted using this script.

The CENX data has been provided by Karen Miga, and is in a slightly different format: ref.fa contains a the reference repeat unit, "DXZ1", with its FASTA ID set to ref, while reads.fa contains several independent read sequences from repeats in the region in question. The format is different because there are a few thousand reads, and each could not be realistically presented in its own file.

Per-gene sequences for the first 5 regions are available from Nancy Ouyang at Curoverse here, collected using the scripts and methodology described here. FASTA files are named by gene name and ncbi gene id, e.g. BRCA1-672.fa. She did not mirror the IMGT HLA contents (which can be gotten here ), even though that was requested on the minutes from the DWG meeting, due to their policy.

Structure of the pilot

The GA4GH API now supports a graph model of the reference in which variations can be described. A description of the reference model is contained in the common.avdl file within the schemas.

Currently the plan for the pilot has three parts. Signup for contributions is below.

An implementation of the GA4GH API incorporating the graph model. This has been developed on the reference server implementation. This is being lead by the reference server team (see below). The implementation effort should complete all necessary end-points for the evaluation by the end of May 2015.
The construction of a set of graph implementations, each represented by the GA4GH API. These will be provided by community members. Either the implementor can create their own implementation of the GA4GH API serving their graph, or they can use the reference server implementation developed by (1). The data format to create a graph genome within the reference server is described below (see Graph Format). For groups unable to host their own server, UCSC will host the server upon request.
The construction of a set of analyses using the GA4GH APIs. This will be provided by members of the group. These may lead to further extension of the APIs.

At the end of the pilot people who have made a contribution to any of these three areas will be included as authors on a marker paper describing the graphs, the implementations and assessments. We are seeking provisional commitments to provide contributions to these three aspects (please add your name/group below with a brief description).

##Graph Format

Maciek to insert here

##Time-line

Due to the exploratory and ambitious nature of the pilot, we propose to have two rounds of evaluation. In the first prototypes will be submitted and evaluated by the group informally without wider sharing - there should be no problem in submitting experimental graphs. In the second round the submitted graphs will be evaluated for publication. The timeline is as follows:

Submission of 1st round prototype graphs - 1st of June 2015
Completion of evaluations of 1st round of prototype graphs - 22nd of June 2015 (Monday before call).
Submission of 2nd round prototype graphs - 15th of July 2015

After the 2nd round of submissions we anticipate working iteratively toward a publication targeted for September 2015.

##Contributors to the pilot (signup below!)

###Graph Contributors

Team-UCSC (Adam Novak, Maciek Smuga-Otto, Glenn Hickey, Benedict Paten, David Haussler). Will provide implementations for all pilot regions. Plans to provide 2 different implementations. One based upon the Cactus multiple sequence aligner, and one based upon the context-based mapping scheme that we call Camel (sticking with the desert theme).
Team-Hinx - (Erik Garrison @Sanger) Will provide implementations for all pilot regions using the variation graph assembler/aligner vg.
Team-BDG - (Frank Nothaft @ Berkeley) will provide implementations for parts 2,3 for all pilot regions using the avocado assembler.
Team-Oxford - (Phelim Bradley, Alexander Dilthey, Zamin Iqbal, Jerome Kelleher, Sorina Maciuca, Gil McVean). Will provide implementations for some pilot regions, plus some non-human examples, e.g. the MSP3.4 gene in P. falciparum.

###API Implementation Contributions

Reference-server team (Jerome Kelleher, Danny Colligan, Maciek Smuga-Otto, et al.)

###Analysis Contributors

Team-UCSC - a comparison of the underlying alignments of the sequences composed within the reference graph.
Team-UCSC - a side graph comparison methodology devised by David Haussler, described here.

##Relevant Publications

Members of the group have been developing some theory and prototypes relevant to the pilot. These can be listed here (feel free to add).

Paten, Novak, Haussler paper describing approaches to constructing a reference structure

[Novak, Rosen, Haussler, Paten paper describing mapping to a reference structure (only deals with string to string case, but concept generalises)] (http://arxiv.org/pdf/1501.04128v1.pdf)

[Dilthey, Cox, Iqbal, Nelson, McVean paper describing applications of a graph-based reference structure to inference in the MHC] (http://biorxiv.org/content/early/2014/07/08/006973)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Human Genome Variation Map (HGVM) Pilot Project

Pilot Test Data

Pilot Test Data Details

Structure of the pilot

Clone this wiki locally