Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the name tokeniser uncomp_len calculation (PR #803) #803

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jkbonfield
Copy link
Contributor

This includes all visible read name bytes plus 1 termination byte per name (e.g. '\0').

Fixes #802

Copy link

github-actions bot commented Jan 7, 2025

Changed PDFs as of bece1f7: CRAMcodecs (diff).

CRAMcodecs.tex Outdated
The serialised data stream starts with two unsigned little endian
32-bit integers holding the total size of uncompressed name buffer and
the number of read names. This is followed the array elements
themselves. Note the uncompressed size the sum of all name lengths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
themselves. Note the uncompressed size the sum of all name lengths
themselves. Note the uncompressed size is calculated as the sum of all name lengths,

@cmnbroad
Copy link
Contributor

cmnbroad commented Jan 7, 2025

Looks good - thanks!

@jkbonfield jkbonfield added sam cram and removed sam labels Jan 7, 2025
This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802
Copy link

github-actions bot commented Jan 7, 2025

Changed PDFs as of 4982e03: CRAMcodecs (diff).

Comment on lines +2455 to +2456
the number of read names. This is followed the array elements
themselves. Note the uncompressed size is calculated as the sum of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialised data stream starts with two unsigned little endian 32-bit integers... This is followed the array elements themselves.

This is unrelated to the length calculation, but note there is also a 1 byte flag between the 2 integers and the data stream:

Bytes Type Name
4 uint32 uncomp_length
4 uint32 num_reads
1 uint8 use_arith

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Needs review
Development

Successfully merging this pull request may close these issues.

Name tokenizer codec clarification
3 participants