This pipeline is a companion to the main functional consequence & gene mapping pipeline and is intended to be used in conjunction with it. The main pipeline uses Ensembl's Variant Effect Predictor (VEP) to assign functional consequences to variants. The problem with VEP is that it does not support repeat expansion variants. They are a special class of duplication variants which represent uncertain number of repeats of a small set of nucleotides (see Table 1 in this abstract for examples). As a result of this limitation, the main pipeline cannot process these variants and ignores them. This sub-module was created to address this problem.
The pipeline has one input file and two output files.
The input file is ClinVar's TSV summary file.
The first output file, in TSV format, is the consequences table. It uses the same six-column format as the main pipeline. Example (tabs are replaced with spaces for readability):
RCV000986111 1 ENSG00000177570 SAMD12 short_tandem_repeat_expansion 0
RCV001003411 1 ENSG00000197386 HTT trinucleotide_repeat_expansion 0
The columns, in order, are:
- RCV ID: identifier of a ClinVar record containing the repeat variant. RCV IDs are used instead of the usual
CHROM:POS:REF:ALT
notation because it is not easily applied to repeat expansion variants. - Unused, retained for compatibility with the original pipeline.
- Ensembl gene ID.
- Ensembl gene name.
- Repeat expansion variant type, can be either
short_tandem_repeat_expansion
ortrinucleotide_repeat_expansion
. See more on that below. - Unused, retained for compatibility with the original pipeline.
The second output file, also in TSV format, is the dump of the Pandas dataframe used for data processing. It contains a number of intermediate columns which were used to make the decision on the final data appearing in the consequences table. This table can be used for debugging or discussion purposes.
See instructions in the main README file to install the necessary packages first. Then run the pipeline:
python3 run_repeat_expansion_variants.py \
--clinvar-summary-tsv variant_summary.txt.gz \
--output-consequences output_consequences.tsv \
--output-dataframe output_dataframe.tsv
The information which the pipeline needs to determine for each variant is fairly simple:
- Whether the variant is a trinucleotide expansion, or any other expansion, e. g. with the repeat unit length of 2.
- Which gene the variant affects.
This is made quite complicated by various systemic peculiarities in ClinVar data. In this section, the pipeline operation is explained step by step along with how those peculiarities are addressed.
As input data from Clinvar, we get four useful columns:
- Name: variant identifier, which can be in three possible formats, explained below.
- RCVaccession: accession of a ClinVar record associating this variant with a phenotype.
- GeneSymbol: symbol (name) of a gene which the variant affects.
- HGNC_ID: HGNC ID of a gene which the variant affects
RCVaccession and GeneSymbol columns may contain multiple values, delimited by semicolons. This means there are multiple records associating the variant with a phenotype, or the variant affecting multiple genes. For example, variant NM_001007026.1(ATN1):c.1462CAG[(49_55)] (p.Gln488[(49-55)])
is associated with accessions RCV000032095, RCV000032096, RCV000032097 and gene symbols ATN1 and LOC109461484.
In case there is a single gene in GeneSymbol column, then HGNC_ID will contain the ID of that gene. If there are multiple genes, the HGNC_ID column will be empty.
As the first preprocessing step, we split each record by RCVaccession and GeneSymbol columns. This means that, in the example above, a single record becomes six, one per every combination of ClinVar record and gene.
Next, we remove duplicates in the resulting table. They can occur because some (Name, RCVaccession, GeneSymbol) tuples appeared twice in the original table, with coordinates for GRCh37 and GRCh38. Since the genomic coordinates are not used by this pipeline, we simply remove the duplicates.
The Name column contains a variant description which can be in three possible formats:
- HGVS-like variant description in coding or genomic coordinates, e. g.
NM_000044.4(AR):c.172_174CAG(10_36) (p.Gln69_Gln80del)
. - HGVS-like variant description in protein coordinates, e. g.
NP_003915.2:p.Ala260(5_9)
. - Human-readable, however standardised, description, e. g.
ATN1, (CAG)n REPEAT EXPANSION
.
There are also two properties which we would like to extract:
- Length of the repeat unit. It can be determined using two approaches:
- Directly from the sequence. In examples 1 and 3 it would be len(CAG) = 3.
- Indirectly from the coordinate span of the variant. In example 1, it would be 174 – 172 + 1 = 3.
- Transcript in which the variant resides. In example 1, it would be NM_000044.
Here is the table describing which properties can be extracted from which identifier types:
Identifier type | Repeat sequence | Coordinate span | Transcript ID |
---|---|---|---|
HGVS-like, coding/genomic | sometimes* | sometimes* | yes |
HGVS-like, protein | – | – | – |
Free text | yes | – | – |
* At least one of (repeat sequence, coordinate span) is always present in this case.
At this step, we extract all properties which we can from the variant identifiers and save them to separate dataframe columns.
There are three columns which can be used to deduce gene information: HGNC_ID & GeneSymbol (from ClinVar data), and TranscriptID (parsed from the variant identifier). Because neither of them is present for all records, we successively try to annotate Ensembl gene ID based on each of those columns in that order. For example, if Ensembl gene ID for a particular record was determined based on HGNC_ID, then GeneSymbol and TranscriptID will not be tried for that column. Ensembl gene information is being annotated using their BioMart service.
In some cases, a mapping from HGNC ID produces multiple Ensembl hits. For example, HGNC:10560 is being resolved to ENSG00000285258 and ENSG00000163635. In this case, a record is further split into two, each mapping into their corresponding Ensembl gene IDs.
There are two repeat types detected by this pipeline:
trinucleotide_repeat_expansion
, corresponding to SO term http://sequenceontology.org/browser/current_release/term/SO:0002165. This is chosen when the repeat length appears to be either 3, or a multiple or 3. For example:NM_001081563.1:c.*224_*226[(50-?)]
— repeat unit spans 3 nucleotides based on coordinatesNM_001081560.2(DMPK):c.*224_*226CTG(51_?)
— spans 3 nucleotides based on coordinates, and the sequence length is also 3NM_000100.3(CSTB):c.-210CCCCGCCCCGCG(2_3)
— sequence length is 12, a multiple of 3
short_tandem_repeat_expansion
, corresponding to SO term http://sequenceontology.org/browser/current_release/term/SO:0002161. This is chosen in the cases when the repeat length can be determined by at least one method, but its unit length is not a multiple of 3.
Similarly to the previous step, repeat type can be determined using several fields. The determination algorithm is as follows:
- If the variant identifier is type 3, “Protein HGVS”: assume repeat is a
trinucleotide_repeat_expansion
, since it affects entire amino acids. (Each amino acid is encoded by a triplet of nucleotides. So if we see an insertion of N amino acids, it means an insertion of 3×N nucleotides took place, hence this will be a trinucleotide repeat expansion.) - Otherwise:
- Determine repeat unit length
- If available, use length determined directly from sequence as a priority
- Otherwise, use coordinate span
- If repeat unit length is a multiple of 3, assign
trinucleotide_repeat_expansion
. Otherwise, assignshort_tandem_repeat_expansion
.
- Determine repeat unit length
In case there is no evidence for either repeat type, the field will be left empty.
The record defined as being complete if all three fields are present: variant type; Ensembl gene ID; Ensembl gene name.
Dataframe is output as is to faciliate debugging and discussion. It includes both complete and incomplete records. Example of such a dataframe, generated on 2020-04-08: https://docs.google.com/spreadsheets/d/19rDkxD8rEbVKhIpg-1EyHuTSeFP4IrSmcm7PkO1bLIg/edit?usp=sharing.
The consequences table is obtained by removing the unnecessary columns and collapsing the dataframe. The dataframe is grouped by all triples of (RCVaccession, EnsemblGeneID, EnsemblGeneName), and there is a check to ensure that for each triple there is only one type of repeat predicted.
The consequences table only includes complete records and is saved in a six-column format suitable for use in the main Open Targets evidence string generation pipeline.
As of 2020-04-08, the pipeline was able to process 111 out of 115 records. The causes for missing the 4 records are:
- ClinVar data issues
- RCV000001021 has non-standard name “fragile site, folic acid type, rare, fra(12)(q13.1)” which cannot be parsed.
- RCV000192065 has an empty name.
- Ensembl missing a gene: RCV000000215 and RCV000006519 are mapped to two genes each,
ATXN8
andATXN8OS
, generating four RCV/gene record pairs. Mappings toATXN8OS
are processed normally; however, theATXN8
gene is missing from Ensembl 99 release.