-
Notifications
You must be signed in to change notification settings - Fork 111
Human Genome Variation Map (HGVM) Pilot Project
The HGVM Pilot project aims to create a draft reference structure that represents all “common” genetic variation, providing a means to stably name and canonically identify each variant. It aims to demonstrate such a structure can be used to improve upon current standard methods in genomics and create new ones. It is being run by members of the reference-variation GA4GH task team, and is discussed on their regular biweekly calls. To join please contact @benedictpaten or @skeenan.
The project is starting with 5 pilot regions, which will be used to test approaches at a scale more tractable than the complete human genome. The test regions are:
- Major Histocompatibility Complex (7 alt haps)
- Killer Cell Immunoglobulin-Like Receptor (KIR) Gene Cluster (30 alt haps)
- Spinal Muscular Atrophy (SMA) locus (2 alt haps)
- BRCA1 locus (plus LRG sequence)
- BRCA2 locus (plus LRG sequence)
We are using the GRCh38 Human assembly, and are including available alternative haplotype sequences. Nancy Ouyang at Curoverse has curated the sequences here
The sequences were gathered as described (with scripts) here, the FASTA files are named by gene name and ncbi gene id, e.g. BRCA1-672.fa)
She did not mirror the IMGT HLA contents (which can be gotten here ), even though that was requested on the minutes from the DWG meeting, due to their policy.
A pull request describes the changes to the API that integrate a graph model of reference variation. The pull requests adds provisions to describe structural variations to the API for the first time, as well providing for a more general, and potentially cleaner model of (mono)-allelic variation. We anticipate this pull request being accepted very soon (proposed deadline is Friday 20th of January).
Currently the plan for the pilot has three parts. Signup for contributions is below.
-
An implementation of the GA4GH API including the graph pull request. This will develop on the reference server implementation. This will be lead by the reference server team.
-
The construction of a set of graph implementations, each represented by the GA4GH API. These will be provided by community members. Either the implementor can create their own implementation of the GA4GH API serving their graph, or they can use the reference server implementation developed by (1). A backend format to import the data into the reference server is forthcoming.
-
The construction of a set of analyses using the GA4GH APIs. This will be provided by members of the group. These may lead to further extension of the APIs.
At the end of the pilot people who have made a contribution to any of these three areas will be included as authors on a marker paper describing the graphs, the implementations and assessments. We are seeking provisional (this is not a binding contract!) commitments to provide contributions to these three aspects (please add your name/group below with a brief description). A timeline is forthcoming.
##Graph Contributors
- Team-UCSC (Adam Novak, Maciek Smuga-Otto, Benedict Paten, David Haussler). Will provide implementations for all pilot regions. Plans to provide 2 different implementations. One based upon the Cactus multiple sequence aligner, and one based upon the context-based mapping scheme.
- Team-Hinx - (Erik Garrison @Sanger) Will provide implementations for all pilot regions using the variation graph assembler/aligner vg.
- Team-BDG - (@fnothaft) will provide implementations for parts 2,3 for all pilot regions using the avocado assembler.
##API Implementation Contributions
- Reference-server team (Jerome Kelleher, Danny Colligan, Maciek Smuga-Otto, et al.)
##Analysis Contributors
- Team-UCSC - a comparison of the underlying alignments of the sequences composed within the reference graph.
##Relevant Publications
Members of the group have been developing some theory and prototypes relevant to the pilot. These can be listed here (feel free to add).
Paten, Novak, Haussler paper describing approaches to constructing a reference structure
[Novak, Rosen, Haussler, Paten paper describing mapping to a reference structure (only deals with string to string case, but concept generalises)] (http://arxiv.org/pdf/1501.04128v1.pdf)