Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make my own index #81

Open
h20gg702 opened this issue Jun 13, 2024 · 4 comments
Open

make my own index #81

h20gg702 opened this issue Jun 13, 2024 · 4 comments

Comments

@h20gg702
Copy link

Hi developer,

I just want to make sure I am on the right track to make my own index. Because repbase is not openable, I downloaded the fasta file from GENCODE (Genome sequence, primary assembly (GRCm38)) and GTF file from UCSC table browse (GRCm38), and then I used the gffread package to make fasta file. What do you think about this way? I know salmonTE provides mm reference though, I want to use the GTF file for another analysis and salmonTE.

@Xiaofei-git
Copy link

Hi @hyunhwan-jeong ,

I also have some questions about human reference. So, I just commented here instead of opening a new issue.

  1. Which release version of Repbase you used for the built-in hg index in salmonTE?
  2. I see you used the sequences here https://github.com/hyunhwan-jeong/SalmonTE/blob/main/scripts/hs_origin.fa for the index and there are 1068 sequences in the Fasta file. My question is that why it is not 1068 TEs/rows in output but 687 rows as indicated in your paper, which I believe it keeps consistent with the rows here https://github.com/hyunhwan-jeong/SalmonTE/blob/main/scripts/annotation_hs.tsv .
  3. Similar question to the thread: if I want use the complete sequence of human genome here https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_009914755.1/ . I think I have to get the GTF/GFF file for TEs to extract sequences from genome sequences. But I don't see where can I find the annotation file for TEs. Do you know?

Thanks!
Xiaofei

@hyunhwan-bcm
Copy link

@h20gg702 It would be possible, but are you trying to measure locus-level expression? SalmonTE is a tool to measure the overall abundance of TE, not loci-specific abundance. That's why SalmonTE uses RepBase to create the index. Otherwise, I may have some ideas to utilize the GTF GFF file to build the index, but not sure it would be better than the current approach

@Xiaofei-git

  1. It was 22.06
  2. The 1,068 is before the filtering, and 687 is the number after the filtering.
  3. I found https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=2301588374_I821IjTNN6GhUF2VnET4x0sQ8nbV&db=hub_3671779_hs1&c=chr9&g=hub_3671779_t2tRepeatMasker

Best,

Hyun-Hwan Jeong

@h20gg702
Copy link
Author

Hyun-Hwan

Thank you. I just wanted to make an index containing both protein-coding gene and TE. I know salmonTE is not for loci-specific abundance. I think a regular bulk RNA-seq is challenging for this.

@Xiaofei-git
Copy link

@h20gg702 It would be possible, but are you trying to measure locus-level expression? SalmonTE is a tool to measure the overall abundance of TE, not loci-specific abundance. That's why SalmonTE uses RepBase to create the index. Otherwise, I may have some ideas to utilize the GTF GFF file to build the index, but not sure it would be better than the current approach

@Xiaofei-git

  1. It was 22.06
  2. The 1,068 is before the filtering, and 687 is the number after the filtering.
  3. I found https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=2301588374_I821IjTNN6GhUF2VnET4x0sQ8nbV&db=hub_3671779_hs1&c=chr9&g=hub_3671779_t2tRepeatMasker

Best,

Hyun-Hwan Jeong

  1. By filtering, do you mean exclude the elements as below?
    "we excluded the following elements: simple repeats and multi-copy genes, and DNA transposable. "

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants