Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intention of NO_READS_READLEN #74

Open
blex-max opened this issue Nov 21, 2024 · 4 comments
Open

intention of NO_READS_READLEN #74

blex-max opened this issue Nov 21, 2024 · 4 comments

Comments

@blex-max
Copy link

_get_reads_length returns nothing when the constant NO_READS_READLEN is not exceeded in the final if condition, propagating an undefined variable down the chain until the program crashes. We've got users trying to run the tool on long read data with low depth - which I appreciate the tool wasn't designed for. The most obvious way to fix this problem would be to lower the value of NO_READS_READLEN, currently at 50,000. And notably judging by the commit history this did used to be lower - originally the return conditional checked against a hardcoded value of 1000. However, maybe this is a very bad fix, and I don't want to go changing things without understanding the intention of these checks. @davidrajones git reckons these bits of the codebase were written/modified by you. I know it was quite a while ago, but do you have insight into this? Many thanks

@AndyMenzies
Copy link
Contributor

For long read data there are likely to be far fewer reads that short read, so its possible that a long read cram may not get to 50,000 reads total.

This tool was built for short read data where there is very small variance in the read lengths, Usually there is just a little bit of base clipping off the ends. The majority of reads will have the same length. That is not true with long read data where there is a wide population of possible read lengths. Without significant testing and a full code/logic review I don't know if this is appropriate to run over long read data.

@sb43
Copy link
Member

sb43 commented Nov 27, 2024

Yes, agree with @AndyMenzies we might need to add test case for long read data and possibly a parameter to run in long read mode to ignore the read length and other short read specific cutoffs.

@blex-max
Copy link
Author

Also agreed, it's on the todo list. Nevertheless still curious as to why 50,000 reads

@AndyMenzies
Copy link
Contributor

If its just pulling from the beginning of the cram file its likely starting at the beginning of chr1. Things map strangely at the ends of chromosomes so 50K reads may be there to move away from the telomere and into better sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants