Better support for read groups in distmap integration #518

magicDGS · 2018-08-24T10:19:52Z

@robmaz would like to integrate the capabilities of ReadTools for any kind of supported format (FASTQ/SAM/BAM/CRAM by now) into the distmap pipeline in a better way. The current pipeline is the following (all called within the distmap software):

ReadTools ReadsToDistmap: upload (with optional trimming) the reads to HDFS into the compact and splitable distmap-format. The current implementation only keeps the barcodes in the read name, but if barcode de-multiplexing has been already performed, keeping the read groups (@RG) is desirable to mark the reads properly. One suggestion is to dump the header with the @RG to use later on download (see below and ReadsToDistmap: dump sam header somewhere (feature request) #510), but this will bring problems if multiple read groups are present as reads cannot be re-assigned without the full de-multiplexing run.
Then, distmap maps the reads. Internally, it converts the distmap-format into FASTQ, runs the mapper and outputs part-files in the SAM/BAM format. As each mapper has its own features, we cannot do any assumption about how the header will look like (including @RG header lines) - this is one of the limiting factors out of our control.
ReadTools DownloadDistmapResult: downloads from HDFS and merge the part files (SAM/BAM) into a combined file on the local path. It will be nice to provide a SAM header with read groups (or a master SAM header with more information) to be merged with the ones downloaded from the distmap run (requested in DownloadDistmapResult: merge SAM header (feature request) #511), but it is not trivial as it should have specific rules and requires to re-assign read groups each read (as in the first step).

To make posible to roundtrip reads->distmap->reads and keep the read group information from the original reads, there are several propositions under discussion:

Only allow one read group on download (suggested here: DownloadDistmapResult: merge SAM header (feature request) #511 (comment)) and fail otherwise. This can be weird, because we allow to upload/transform reads from multiple @RG but not download them if we want to retrieve the information. This is the option that requires the minimal efford, as it will just fail for multiple @RG and assign the single one otherwise. Still, it will need to set some rules to merge the rest of header fields (unless the @RG is the only header lines allowed, appart of the version one).
Integrate a new distmap-format which supports adding barcodes to the read name if no @RG is present (@{{read_name}}#{{barcode_seq}}) or read-group id/index (@{{read_name}}#{{rg_id}} or @{{read_name}}#{{rg_idx}}), which can be parsed afterwards. Some complications might arrise from this: 1) always required to use the same version of ReadTools for upload/download; 2) unsupported @RG handling for legacy distmap format; 3) requirement for header while downloading if ID/idx was used; 4) lost of raw-barcode information if only-RG is handled. Nevertheless, this was just a first draft and can be modified to address this issues and discussed with @robmaz

I think that a quick implementation for option 1 is good to have this support to some extend, with a warning on upload and an error on download for more than 1 RG in the header file (saying that this limitation might be removed in the future) and then evolve the new format for distmap (#404) to contain information for the read group and maybe some arbitrary information. Another option is to change distmap to use the map-reduce code from Hadoop-BAM to split the input file, and remove completely the need of the distmap custom format.

The text was updated successfully, but these errors were encountered:

magicDGS · 2018-08-24T10:20:49Z

@robmaz - lets discuss here the requirements for this integration with distmap instead of in independent issues. We can re-open or create new issues with the required simple components later, once we take the decision on te design.

magicDGS added Priority: Medium Status: Pending In discussion to include in the project backlog Type: Epic Tasks that should be split in different issues labels Aug 24, 2018

This was referenced Aug 24, 2018

ReadsToDistmap: dump sam header somewhere (feature request) #510

Closed

DownloadDistmapResult: merge SAM header (feature request) #511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for read groups in distmap integration #518

Better support for read groups in distmap integration #518

magicDGS commented Aug 24, 2018

magicDGS commented Aug 24, 2018 •

edited

Loading

Better support for read groups in distmap integration #518

Better support for read groups in distmap integration #518

Comments

magicDGS commented Aug 24, 2018

magicDGS commented Aug 24, 2018 • edited Loading

magicDGS commented Aug 24, 2018 •

edited

Loading