make my own index #81

h20gg702 · 2024-06-13T00:23:23Z

Hi developer,

I just want to make sure I am on the right track to make my own index. Because repbase is not openable, I downloaded the fasta file from GENCODE (Genome sequence, primary assembly (GRCm38)) and GTF file from UCSC table browse (GRCm38), and then I used the gffread package to make fasta file. What do you think about this way? I know salmonTE provides mm reference though, I want to use the GTF file for another analysis and salmonTE.

Xiaofei-git · 2024-06-20T07:07:28Z

Hi @hyunhwan-jeong ,

I also have some questions about human reference. So, I just commented here instead of opening a new issue.

Which release version of Repbase you used for the built-in hg index in salmonTE?
I see you used the sequences here https://github.com/hyunhwan-jeong/SalmonTE/blob/main/scripts/hs_origin.fa for the index and there are 1068 sequences in the Fasta file. My question is that why it is not 1068 TEs/rows in output but 687 rows as indicated in your paper, which I believe it keeps consistent with the rows here https://github.com/hyunhwan-jeong/SalmonTE/blob/main/scripts/annotation_hs.tsv .
Similar question to the thread: if I want use the complete sequence of human genome here https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_009914755.1/ . I think I have to get the GTF/GFF file for TEs to extract sequences from genome sequences. But I don't see where can I find the annotation file for TEs. Do you know?

Thanks!
Xiaofei

hyunhwan-bcm · 2024-06-20T23:12:12Z

@h20gg702 It would be possible, but are you trying to measure locus-level expression? SalmonTE is a tool to measure the overall abundance of TE, not loci-specific abundance. That's why SalmonTE uses RepBase to create the index. Otherwise, I may have some ideas to utilize the GTF GFF file to build the index, but not sure it would be better than the current approach

@Xiaofei-git

It was 22.06
The 1,068 is before the filtering, and 687 is the number after the filtering.
I found https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=2301588374_I821IjTNN6GhUF2VnET4x0sQ8nbV&db=hub_3671779_hs1&c=chr9&g=hub_3671779_t2tRepeatMasker

Best,

Hyun-Hwan Jeong

h20gg702 · 2024-06-20T23:46:04Z

Hyun-Hwan

Thank you. I just wanted to make an index containing both protein-coding gene and TE. I know salmonTE is not for loci-specific abundance. I think a regular bulk RNA-seq is challenging for this.

Xiaofei-git · 2024-06-21T02:47:50Z

@h20gg702 It would be possible, but are you trying to measure locus-level expression? SalmonTE is a tool to measure the overall abundance of TE, not loci-specific abundance. That's why SalmonTE uses RepBase to create the index. Otherwise, I may have some ideas to utilize the GTF GFF file to build the index, but not sure it would be better than the current approach

@Xiaofei-git

It was 22.06

The 1,068 is before the filtering, and 687 is the number after the filtering.

I found https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=2301588374_I821IjTNN6GhUF2VnET4x0sQ8nbV&db=hub_3671779_hs1&c=chr9&g=hub_3671779_t2tRepeatMasker

Best,

Hyun-Hwan Jeong

By filtering, do you mean exclude the elements as below?
"we excluded the following elements: simple repeats and multi-copy genes, and DNA transposable. "

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make my own index #81

make my own index #81

h20gg702 commented Jun 13, 2024

Xiaofei-git commented Jun 20, 2024

hyunhwan-bcm commented Jun 20, 2024

h20gg702 commented Jun 20, 2024

Xiaofei-git commented Jun 21, 2024

make my own index #81

make my own index #81

Comments

h20gg702 commented Jun 13, 2024

Xiaofei-git commented Jun 20, 2024

hyunhwan-bcm commented Jun 20, 2024

h20gg702 commented Jun 20, 2024

Xiaofei-git commented Jun 21, 2024