-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gtf2gff3.pl doesn't work with recent versions of Ensembl #35
Comments
Hi Keiran, There are actually two different versions of this script--one that is part of the GBrowse distribution and one that is hosted at the Sequence Ontology. I'm working on figuring out which one is "best" and where it will be housed going forward; in that process, I'll address this bug. Scott |
Hi Scott, I've had a little time to have a look and I think the following will make it work for ensembl data where 'gene_biotype' has been defined (and so work with other incarnations of gtf).
It would be ideal if the 'transcript biotype' could be appended to the attributes output in the gff3 but that's a nicety. Regards, |
There's a typo too:
|
Hi,
I've found that some transcripts don't get converted correctly when using GTF files downloaded from Ensembl since they started attaching the correct bio-type to the genes/transcripts.
I have a set of files I can email to an appropriate account, details of them are below.
In the examples e58 versions result in correct protein coding records that display in GBrowse fine. Under e71+ BRAF (plus many others) doesn't display under GBrowse if you choose a source of protein_coding . I think this is because the same ENSG can link to protein_coding and, in this example, nonsense_mediated_decay and retained_intron. In this case the gene record is tagged as retained_intron.
I think the solution is to have a gene record for each but I'm not too sure if this is possible as this will result in 3 gene entries for the same ENSG (but each would be for a different source and potentially a different length).
Hope I've provided sufficient info to get this looked at quickly as it's preventing us from upgrading our Ensembl version in the browser.
Kind regards,
Keiran Raine
Principal Bioinformatician
Cancer Genome Project
Wellcome Trust Sanger Institute
BRAF_e58.gtf/gff3 are the GTF records and corresponding gff3 generated using the default cfg found on sequenceontology.org.
BRAF_e71.gtf/gff3 as above.
grep_for_ENST00000496384.txt is the result of grepping the two gff3 files for what the filename describes... this suggest this bit is fine
grep_for_ID=ENSG00000157764.txt is the result of grepping the two gff3 files for what the filename describes...in e71 the gene has been marked as a retained intron.
The text was updated successfully, but these errors were encountered: