Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for read groups in distmap integration #518

Open
magicDGS opened this issue Aug 24, 2018 · 1 comment
Open

Better support for read groups in distmap integration #518

magicDGS opened this issue Aug 24, 2018 · 1 comment
Labels
Priority: Medium Status: Pending In discussion to include in the project backlog Type: Epic Tasks that should be split in different issues

Comments

@magicDGS
Copy link
Owner

@robmaz would like to integrate the capabilities of ReadTools for any kind of supported format (FASTQ/SAM/BAM/CRAM by now) into the distmap pipeline in a better way. The current pipeline is the following (all called within the distmap software):

  • ReadTools ReadsToDistmap: upload (with optional trimming) the reads to HDFS into the compact and splitable distmap-format. The current implementation only keeps the barcodes in the read name, but if barcode de-multiplexing has been already performed, keeping the read groups (@RG) is desirable to mark the reads properly. One suggestion is to dump the header with the @RG to use later on download (see below and ReadsToDistmap: dump sam header somewhere (feature request) #510), but this will bring problems if multiple read groups are present as reads cannot be re-assigned without the full de-multiplexing run.
  • Then, distmap maps the reads. Internally, it converts the distmap-format into FASTQ, runs the mapper and outputs part-files in the SAM/BAM format. As each mapper has its own features, we cannot do any assumption about how the header will look like (including @RG header lines) - this is one of the limiting factors out of our control.
  • ReadTools DownloadDistmapResult: downloads from HDFS and merge the part files (SAM/BAM) into a combined file on the local path. It will be nice to provide a SAM header with read groups (or a master SAM header with more information) to be merged with the ones downloaded from the distmap run (requested in DownloadDistmapResult: merge SAM header (feature request) #511), but it is not trivial as it should have specific rules and requires to re-assign read groups each read (as in the first step).

To make posible to roundtrip reads->distmap->reads and keep the read group information from the original reads, there are several propositions under discussion:

  1. Only allow one read group on download (suggested here: DownloadDistmapResult: merge SAM header (feature request) #511 (comment)) and fail otherwise. This can be weird, because we allow to upload/transform reads from multiple @RG but not download them if we want to retrieve the information. This is the option that requires the minimal efford, as it will just fail for multiple @RG and assign the single one otherwise. Still, it will need to set some rules to merge the rest of header fields (unless the @RG is the only header lines allowed, appart of the version one).
  2. Integrate a new distmap-format which supports adding barcodes to the read name if no @RG is present (@{{read_name}}#{{barcode_seq}}) or read-group id/index (@{{read_name}}#{{rg_id}} or @{{read_name}}#{{rg_idx}}), which can be parsed afterwards. Some complications might arrise from this: 1) always required to use the same version of ReadTools for upload/download; 2) unsupported @RG handling for legacy distmap format; 3) requirement for header while downloading if ID/idx was used; 4) lost of raw-barcode information if only-RG is handled. Nevertheless, this was just a first draft and can be modified to address this issues and discussed with @robmaz

I think that a quick implementation for option 1 is good to have this support to some extend, with a warning on upload and an error on download for more than 1 RG in the header file (saying that this limitation might be removed in the future) and then evolve the new format for distmap (#404) to contain information for the read group and maybe some arbitrary information. Another option is to change distmap to use the map-reduce code from Hadoop-BAM to split the input file, and remove completely the need of the distmap custom format.

@magicDGS magicDGS added Priority: Medium Status: Pending In discussion to include in the project backlog Type: Epic Tasks that should be split in different issues labels Aug 24, 2018
@magicDGS
Copy link
Owner Author

magicDGS commented Aug 24, 2018

@robmaz - lets discuss here the requirements for this integration with distmap instead of in independent issues. We can re-open or create new issues with the required simple components later, once we take the decision on te design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Medium Status: Pending In discussion to include in the project backlog Type: Epic Tasks that should be split in different issues
Projects
None yet
Development

No branches or pull requests

1 participant