Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract the UMI info in illumina read's name into a seperate tag #533

Open
TendoLiu opened this issue Nov 11, 2019 · 1 comment
Open
Labels
Priority: Medium Status: Pending In discussion to include in the project backlog Type: Question User/Developer question to be answer

Comments

@TendoLiu
Copy link

Hi,
Have beening working on UMI collapsing of illumina DNA seq data. The fastq header looks like this. I wonder is there a way to transfer all the UMI like "TATGTNC+NNGAGCA" to a seperate tag which could be used by duplicates markers?

@NS500211:808:HW27KAFXY:1:11101:12228:1057:TATGTNC+NNGAGCA 1:N:0:TCCGGAGA

Thanks.

@magicDGS magicDGS added Priority: Medium Status: Pending In discussion to include in the project backlog Type: Question User/Developer question to be answer labels Nov 14, 2019
@magicDGS
Copy link
Owner

Hello @TendoLiu - the name of your read looks a bit weird to me, as it contains a Casava barcode (1:N:0:TCCGGAGA) and the UMI appended to the read name (TATGTNC+NNGAGCA). Is this a FASTQ or a BAM file?

ReadTools is a bit "picky" with read names, as it only understands 2 formats that are common:

  • Casava: e.g. @NS500211:808:HW27KAFXY:1:11101:12228:1057 1:N:0:TCCGGAGA, where the identified barcode will be TCCGGAGA
  • Illumina: e.g., @NS500211:808:HW27KAFXY:1:11101:12228:1057#TATGTNC+NNGAGCA, where the identified barcode will be TATGTNC+NNGAGCA. Note that, contrary to your case, the barcodes are separated from the read name by # instead of :., and that only one barcode is detected as + is used for concatenation instead of the standard (in the specs), which is -.

ReadTools can handle only one of the problems that you are facing: the barcode separator could be overriden (although will still be used for all the output files) with the java property barcode_index_delimiter (so providing -Dbarcode_index_delimiter=+ in your case). Nevertheless, I am not sure if your use-case matches AssignReadGroupByBarcode, as it is designed for barcodes (like the one after the space) and not for UMIs (I am not familiar with them, but maybe appending them to the read name with : as separator is a standard there...)

Could you please clarify with this information? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Status: Pending In discussion to include in the project backlog Type: Question User/Developer question to be answer
Projects
None yet
Development

No branches or pull requests

2 participants