Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected antisense overlap and a question on extension #37

Open
jkniehaus opened this issue Oct 31, 2024 · 1 comment
Open

Unexpected antisense overlap and a question on extension #37

jkniehaus opened this issue Oct 31, 2024 · 1 comment

Comments

@jkniehaus
Copy link

Hello,

First , thanks for the tool!

I have a question about the new gene overlap implementation in 1.4.0:
If a gene already overlaps with an antisense gene, does it ignore this restriction?
It looks like that's the case since e.g. Zcchc17 extends into Fabp3.

UnexpectedExtension

Additionally, can you think of a reason why peaks2utr would only extend halfway through a peak? I compared long reads (shortened to 500bp to ensure clear peaks) vs 10X data and while long reads worked well, the UTRs are only extended about halfway through the peaks (see below). I think the long read data is more reliable so it'd be nice to use long reads to generate references for 10X data as long as the extensions are complete.
LongReadIncompleteExtension

Thanks again!
Jesse

Versions:
python/3.9.6
bedtools/2.30
peaks2utr/1.4.0

code:
peaks2utr --max-distance 10000 --extend-utr --no-strand-overlap -p 64 -o $out $gtf $bam

@haessar
Copy link
Owner

haessar commented Nov 12, 2024

Hi Jesse,

Thanks for using the tool. You bring up a couple of interesting points.

For your first, I can confirm that your assumption is correct. The code logic basically looks for other genes in the vicinity when deciding whether to annotate a UTR and orders them by start base. If the --no-strand-overlap option is given it will essentially be strand-agnostic when doing this (i.e. the next closest gene is allowed to be on the opposing strand). However, if the gene is already overlapping (as in your highlighted case), when applying the criteria it will be looking to truncate to the 5' end of the next gene, which it assumes will occur at a lower base (for reverse strand) than the existing gene's 3' end (see

elif peak.strand == "-" and adj_transcript.end < transcript.start:
). I hope this makes sense.

As for your second point, a truncation of the UTR mid-peak would usually only occur if either

  • MACS2 had determined that that is where the peak ends (see the .cache/forward_peaks.broadPeak file for the forward stranded peaks explicitly called by MACS2)
  • or there were a significant number of reads with polyA tails which terminated at that point (see the definition of SPAT algorithm in the paper https://doi.org/10.1093/bioinformatics/btad112). If you check the attributes for that UTR annotation in the GFF3/GTF output, and see colour=4, then this is likely what has happened.

Let me know if this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants