Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HMMER's handling of X's (for protein) and N's (for DNA) #117

Open
ihh opened this issue Jun 14, 2020 · 0 comments
Open

Implement HMMER's handling of X's (for protein) and N's (for DNA) #117

ihh opened this issue Jun 14, 2020 · 0 comments

Comments

@ihh
Copy link
Member

ihh commented Jun 14, 2020

HMMER weights IUPAC degenerate emissions using the reciprocal of the perplexity of the underlying match state (see esl_abc_FExpectScore function in HMMER3 source)

This has the effect that the "score" for those emissions is the expectation of what you'd get if you randomized X's using the underlying emission distribution - much to the chagrin of Roger Sewell, who argued they should be treated as missing data (Sean's counterargument is that this
would reward their alignment to the model) - this is an old argument

Practically (as noted by @jordisr) this affects <1% of sequences, but for full hmmer compatibility we ought to include it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant