Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_bams_bulk #6

Open
wants to merge 125 commits into
base: master
Choose a base branch
from
Open
Changes from 123 commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
fa66e5d
added assemble_denovo_bulk with one task inside scatter
lakras Nov 12, 2019
77c7c4d
added second task as-is to workflow
lakras Nov 12, 2019
4bd3f22
gave the tasks aliases
lakras Nov 12, 2019
2ec6e0f
one task in the workflow, with aliasing
lakras Nov 12, 2019
92d119d
added dx-defaults-assemble_denovo_bulk faile
lakras Nov 12, 2019
1333e6d
made lastal_db_fasta from first task a workflow variable outside the …
lakras Nov 12, 2019
7327f77
added all tasks without aliasing with input variables as global varia…
lakras Nov 12, 2019
c14ab66
added missing commas, oops
lakras Nov 12, 2019
ab5e06c
limited to two tasks in workflow
lakras Nov 12, 2019
cda9c49
three tasks in workflow
lakras Nov 12, 2019
5f56e1e
deleted extra whitespace to see if that makes any difference
lakras Nov 12, 2019
309b84a
added trim_clip_db as a workflow-level variable
lakras Nov 12, 2019
2103912
added remaining 2 tasks to workflow with workflow-level variables
lakras Nov 12, 2019
49e36d7
moved default value of trim_clip_db to workflow-level variable in jso…
lakras Nov 12, 2019
6899f34
turned interior of scatter in assemble_denovo_bulk into subworkflow c…
lakras Nov 12, 2019
0eb1b28
actually named the subworkflow oops
lakras Nov 12, 2019
2b475f9
added additional arguments
lakras Nov 12, 2019
6ddae5b
actually named the subworkflow again gah
lakras Nov 12, 2019
f905e27
added lastal_db_fasta and reference_genome_fasta as workflow-level va…
lakras Nov 12, 2019
2c69f08
moved trim_clip_db out to workflow-level variable and added it to the…
lakras Nov 12, 2019
1af46b3
fixed dumb syntax error
lakras Nov 12, 2019
011403a
actually fixed dumb syntax error
lakras Nov 12, 2019
4d79c3b
reverted back to before subworkflow
lakras Nov 12, 2019
fbea2de
...now without dumb syntax error
lakras Nov 12, 2019
589a18f
added option to add novocraft license
lakras Nov 12, 2019
ad3861e
removed empty defaults file for assemble_denovo_bulk
lakras Nov 12, 2019
fb2644f
trying subworkflows again...
lakras Nov 12, 2019
135c0f1
without syntax error
lakras Nov 12, 2019
6ca90ed
renamed subworkflow to avoid confusion
lakras Nov 12, 2019
1280d6b
fixed typo
lakras Nov 12, 2019
0023bd6
gave up on resolving 'Found Errors in generated WDL source' bug; reve…
lakras Nov 12, 2019
237836b
added trim_clip_db default value
lakras Nov 12, 2019
cb651d7
reordered workflow-level input variables
lakras Nov 12, 2019
b60a71c
pulled align_and_plot file variables out to workflow-level variables
lakras Nov 12, 2019
3f17184
added scatter on bam files
lakras Nov 12, 2019
d5b3f39
added alignment options to workflow-level variables so they'll show u…
lakras Nov 12, 2019
bc785a4
moved binary and string variable values into task call
lakras Nov 12, 2019
6a08adf
moved non-file variables back to 'common'
lakras Nov 12, 2019
c8e9c22
pulled variables out to workflow-level for merge_bams_bulk
lakras Nov 12, 2019
f57a1f0
added input table and started working on reading input table
lakras Nov 12, 2019
13d98da
made in_bams a two-element array
lakras Nov 12, 2019
2270e2a
automatically maps input files to the output basename; temporarily ma…
lakras Nov 14, 2019
e2601d6
removed first embedded scatter
lakras Nov 14, 2019
41463b0
removed all nested scatters, since apparently those don't exist
lakras Nov 14, 2019
d69cff6
moved contents of outer scatter to a task
lakras Nov 15, 2019
b5aa092
removed text right at start of scatter?
lakras Nov 15, 2019
47165fc
removed inside of scatter
lakras Nov 15, 2019
3d14888
added call to a task
lakras Nov 15, 2019
b059954
added some code inside the scatter
lakras Nov 15, 2019
074eeb2
just one really easy line of code in the scatter
lakras Nov 15, 2019
7b5133d
and a scatter in the scatter
lakras Nov 15, 2019
9cba71a
fixed stupid stupid syntax errors and brought back the nested scatter…
lakras Nov 15, 2019
622b060
removed unnecessary index step
lakras Nov 15, 2019
2daf2f7
renamed workflow so it gets validated first
lakras Nov 15, 2019
0a2ca27
deleted no longer useful commented-out code at the end
lakras Nov 15, 2019
8f87d81
renamed out_basenames_file
lakras Nov 15, 2019
59763a0
replaced inner scanner with placeholder line
lakras Nov 15, 2019
c0fda0c
moved scatter indices array out of the scatter
lakras Nov 15, 2019
e2c676e
removed basename list traversal by index
lakras Nov 15, 2019
c5bd718
used select_all to get rid of optional files
lakras Nov 15, 2019
306652a
added back inside scatter and coded matching task, without task call
lakras Nov 15, 2019
56d6a93
added call to does_in_bam_match_out_basename
lakras Nov 15, 2019
3f240ca
added reheader table
lakras Nov 16, 2019
d75a7d1
added task for exact matching input file name; added optional in_bam_…
lakras Nov 16, 2019
26445d2
removed else statement, since those don't exist here, and added some …
lakras Nov 16, 2019
b405037
added comments to tasks and separated out regex if statement to be le…
lakras Nov 16, 2019
01284c9
changed elsifs to elifs
lakras Nov 16, 2019
af63b20
moved declaration of relevant_in_bams_optional to outside the if stat…
lakras Nov 16, 2019
cdde672
resolved syntax errors
lakras Nov 16, 2019
8d09e93
added declaration of relevant_in_bams inside if statements
lakras Nov 16, 2019
b6baf72
switched back to what we had, with no input table option
lakras Nov 16, 2019
db53b41
readded accidentally deleted select_allline, oops
lakras Nov 16, 2019
f2f5ee2
added comments to regex task
lakras Nov 16, 2019
f78f56c
renamed merge_bams_bulk back to merge_bams_bulk
lakras Nov 16, 2019
8fff2bb
condensed some wordy code into single line
lakras Nov 17, 2019
19d7625
added input variables of different interesting filetypes to see how t…
lakras Nov 18, 2019
d2d03dc
removed duplicate variable with same name, oops
lakras Nov 18, 2019
81ccd39
commented out array_of_pairs
lakras Nov 18, 2019
8ad0efc
commented out everything with Pairs
lakras Nov 18, 2019
2003e73
deteleted Array[Pair]
lakras Nov 18, 2019
3909690
new version that takes a map of input bam file name to output bam fil…
lakras Nov 19, 2019
5a55b42
shortened really long if statement and fixed in_bam out_bam confusion…
lakras Nov 19, 2019
0d48fdb
added task to retrieve out_bam names from in_bam_out_bam_table
lakras Nov 19, 2019
3e6a314
added check to verify that key is defined in a map
lakras Nov 19, 2019
e7403be
fixed map-file confusion
lakras Nov 19, 2019
c6f83c0
replaced read_string with read_lines and made unique_values an array
lakras Nov 19, 2019
34ec57d
moved out_bams list to within a file rather than automatically retrieved
lakras Nov 19, 2019
ea81aef
renamed merge_bams_bulk so it gets put together earlier
lakras Nov 19, 2019
86b7145
assuming that filenames in the table don't end in .bam
lakras Nov 19, 2019
d23ae27
moved in_bam_to_out_bam to hardcoded literal instead of in file
lakras Nov 19, 2019
2cd8b34
made out_bams a hardcoded literal too
lakras Nov 19, 2019
e6bef52
made in_bam_to_out_bam read in from a table again
lakras Nov 19, 2019
1898c7f
made read_map go through bash script instead of reading directly from…
lakras Nov 19, 2019
70c1794
tried piping map file through stdout
lakras Nov 19, 2019
efc4cb0
added hardcoded map to task outputs so it will hopefully be displayed
lakras Nov 19, 2019
48824f5
resolved reused names oops
lakras Nov 20, 2019
2819c29
replaced map with map_output
lakras Nov 20, 2019
e8136f6
reading directly from file again
lakras Nov 20, 2019
3f6314b
hardcoded map again
lakras Nov 20, 2019
e47d6a5
trying to access elements of map at different scopes
lakras Nov 20, 2019
c0f3ab4
fixed dumb error oops
lakras Nov 20, 2019
e5c81d8
put hardcoded test first
lakras Nov 20, 2019
4281ef3
removed references to hardcoded map
lakras Nov 20, 2019
183c550
removed all except the inner-most test map access
lakras Nov 20, 2019
bd16496
pulled map access out of if statement
lakras Nov 20, 2019
38e21fe
map access in outer scatter
lakras Nov 20, 2019
b2ea552
map access outside of scatter
lakras Nov 20, 2019
4575496
map access in inner scatter, outer scatter, and outside of scatter
lakras Nov 20, 2019
3e45419
reading out_bams from in_bam_out_bam_table instead of its own file or…
lakras Nov 20, 2019
20ccbbd
map access in outer scatter commented out
lakras Nov 20, 2019
280ff5f
deleted commented out code
lakras Nov 20, 2019
a5ecff6
tried adding length of map
lakras Nov 20, 2019
d0f4166
added length of map as access to map, instead of accessing an element…
lakras Nov 20, 2019
9456b6f
trying write_map as map touch
lakras Nov 20, 2019
4e330d3
cleaned it up
lakras Nov 20, 2019
b10c66d
renamed aa_merge_bams_bulk.wdl back to merge_bams_bulk.wdl
lakras Nov 20, 2019
c211760
added function to remove .bam from inside in_bam_out_bam file if it a…
lakras Dec 15, 2019
ad5f699
removed align_and_plot_bulk and assemble_denovo_bulk since the batch …
lakras Dec 15, 2019
d03aa0e
removed variable declaration for cleaned in_bam_to_out_bam file
lakras Dec 15, 2019
888bb2a
improved comment
lakras Dec 15, 2019
85d8e62
renamed intermediate file cleaned_table
lakras Dec 15, 2019
97ddcbd
deleted defaults for assemble_denovo_bulk
lakras Dec 16, 2019
4e6d91e
removed out_bams variable that is only used once
lakras Dec 16, 2019
8c69883
Merge branch 'master' into lk-bulkifying
tomkinsc Jun 8, 2020
877f573
Merge branch 'master' into lk-bulkifying
tomkinsc Jun 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions pipes/WDL/workflows/merge_bams_bulk.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import "tasks_demux.wdl" as demux

workflow merge_bams_bulk {
Array[File]+ in_bams # any order
File in_bam_out_bam_table # first column: input bam file basename, second column: output bam file basename (one line per INPUT file)
File? reheader_table
String? docker="quay.io/broadinstitute/viral-core"

# removes ".bam"s from ends of filenames in in_bam_out_bam_table
call clean_in_bam_out_bam_table {
input: table = in_bam_out_bam_table
}

# generates map with key: input bam file name -> value: output bam file basename
Map[String, String] in_bam_to_out_bam = read_map(clean_in_bam_out_bam_table.clean_table)

# retrieves unique output bam file basenames (no repeats)
call unique_values_in_second_column {
input: table = clean_in_bam_out_bam_table.clean_table
}

# collects and merges input bam files for each output bam file
scatter (out_bam in unique_values_in_second_column.unique_values) {
String placeholder = write_map(in_bam_to_out_bam) # need to touch map in outer scatter for it to be seen in inner scatter

# retrieves the input bam files for this output bam file
scatter (in_bam in in_bams) {
String in_bam_basename = basename(in_bam, ".bam")
if(in_bam_to_out_bam[in_bam_basename] == out_bam) {
File relevant_in_bam = in_bam
}
}
Array[File] relevant_in_bams = select_all(relevant_in_bam) # gathers input bam files from the scatter
Comment on lines +24 to +33
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating -- as I stare at this, it appears you're basically trying to pull off two things in these particular lines of code highlighted here:

  1. invert the data structure (the Map) that might be transposed from what would make the workflow a bit more intuitive (you wanted to scatter on the unique values instead of the unique keys, so arguably this Map is backwards from what you want)
  2. find a way to resolve your filenames-in-a-table Strings into proper Files, something I never thought was possible, but this pulls it off by converting every File back into a String (with basename) and testing one by one in the if statement

Is it possible we could avoid all this if the workflow took, as input, an Map[String, Array[File]] out_filename_to_in_bams, where the keys were output filenames (as Strings) and the values were all the input BAM files (as WDL File objects, not text) that are to be merged into one output? Then this whole workflow could be as easy as:

workflow merge_bams_bulk {
  Map[String, Array[File]] out_filename_to_in_bams
  File? reheader_table
  String? docker="quay.io/broadinstitute/viral-core"

  scatter (pair in out_filename_to_in_bams) {
    call demux.merge_and_reheader_bams {
      input:
        out_basename = basename(pair.left, ".bam"),
        in_bams = pair.right,
        reheader_table = reheader_table,
        docker = docker
    }
  }
}

Then maybe if there's really a UI-based desire to present the input mapping as a table with filenames, perhaps we make another WDL workflow (merge_bams_bulk_from_tsv or something like that?) that reads the input text table and Array[File] input like you have here, and converts them all to a Map[String,Array[File]] using the above magic you've already written, and then invokes the simplified workflow I proposed as a subworkflow.

Copy link
Contributor Author

@lakras lakras Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be so glorious. Unfortunately Map[String, Array[File]] is not supported by DNAnexus, or at least not in any way I can understand.

The closest thing I could find to an explanation is at https://github.com/dnanexus/dxWDL/blob/master/doc/Internals_v0_81.md:

Ragged arrays of files (Array[Array[File]]), and other more complex WDL types, are mapped to two fields: a flat array of files, and a hash, which is a json serialized representation of the WDL value. The flat file array informs the job manager about data objects that need to be closed and cloned into the workspace.

—or at https://libraries.io/github/dnanexus/dxWDL under "Design":

Ragged arrays of files (Array[Array[File]]), and other more complex WDL types, are mapped to two fields: a flat array of files, and a file, which is a json serialized representation of the WDL value. The flat file array informs the job manager about data objects that need to be closed and cloned into the workspace.

I don't actually understand what this is saying, or if the takeaway is that there is some way to make complex types work with the UI. If you have any insights beyond mine then maybe we could make it work.

You can play around with a demo of what this and other complex datatypes are like at Hep A MA DPH/pipelines/2.0.11.0-86-g2003e73-lk-bulkifying_demo_of_complex_input_types/merge_bams_bulk (I can't figure out how to link to it directly, sorry.) Basically there are two inputs created for each complex type: the opportunity to select files and a string input. What to do with them, or if there is anything that can be done with them, I couldn't figure out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that quote under "Design" sounds exactly what you're doing in your pipeline. You take an input Array[File] of input bam Files and then you take a text file input that contains a serialized representation of the two dimensional data structure. Is there any reason you have to use a tab file instead of a json with the exact same info? I assume you're creating your large tab file programmatically anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously just didn't parse that at all, but no reason except that I couldn't figure out how the actual UI inputs work.

For any complex data type (in this case we are looking at map_of_strings_to_arrays_of_files), there are two inputs. The first takes a flat array of files, as described:
Screen Shot 2019-12-16 at 12 53 08 PM
Screen Shot 2019-12-16 at 12 54 07 PM

The second takes a string:
Screen Shot 2019-12-16 at 12 53 24 PM

Somehow, this string should contain the json Map[String, Array[File]]. But how? If the user has to copy/paste the mapping, I actually think the current version of merge_bams_bulk, where the mapping lives in a file, is better for reproducibility and figuring out any errors. Or should it be a hash, as described in the first of the two explanations? If so, a hash of what? created with what hash function? (And if I can't figure it out, how would an end user know what to do?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my main issue. Anytime we write a WDL pipeline, I want it to be portable across different execution environments--otherwise, why are we writing it in WDL in the first place? So it should run just fine on your laptop, on a cluster, on DNAnexus, on Terra, and other environments I've never used before. One of the main principles in making that happen is to try to remove as much of the unix file system concept from the high level pipeline itself, leaving it only to the code within each task, because most of the time, these pipelines run in places where directories don't exist, and so forth.

Talking about a bunch of filenames in the text content of an input file starts getting very infrastructure specific, because the way files are organized and referred to are very infrastructure specific. For example, in the case of DNAnexus, the official / unambiguous way to refer to a file object is by its DNAnexus opaque file identifier (project-ABC:file-XYZ). In other places, you have directories. Sometimes there's a non-opaque file identifier, such as with most cloud buckets. So what the execution engines generally promote is capturing the input files in an infrastructure-specific way when you execute the workflow and describe the Files, but handling the plumbing of the pipelines in an opaque (don't make any assumptions) kind of way. So using DNAnexus's preferred way of taking in those input mappings is at least something that doesn't mess with compatibility with other platforms. Another approach, as I mentioned before, is to keep the basic part of this pipeline clean and generic (take in the Map[String, Array[File]]) and use an optional/add-on task right before it to make life easier for DNAnexus users (perhaps even named in a way that makes it clear that it's platform-specific).

As for the json representation, almost every WDL executor uses json anyway for describing inputs to executions (that's what you used in that dx-defaults file you deleted). It'd be something to the effect of:

 {
   "outfilename1.bam": ["project-ABC:file-XYZ1A", "project-ABC:file-XYZ1B"]
   "outfilename2.bam": ["project-ABC:file-XYZ2A", "project-ABC:file-XYZ2B"] 
 }

I mean, maybe those input files could be presented in a directory/filename-like representation instead of the opaque identifiers I used above, but maybe not -- the dx-defaults don't accept those forms, but you never know until you try. Anyway, although the above json would be annoying to human-write, I'm assuming your tabfile was script-produced anyway, and there's no real reason it couldn't produce this form of the output as well.

This might seem like splitting hairs between two ways that are semantically doing the same thing (providing a serialized mapping of the two-dimensional data structure, one in tsv form, one in json, and the former is certainly simpler), but I think there's actually a significant difference philosophically between the two: in the latter case, we're telling the execution engine (whether it's DNAnexus or something else) the data structure in a way it understands and having it resolve it. In the former case, we're telling our own custom-written code about the mapping and having it resolve it -- and when it comes to File objects, I think this potentially gets dangerous, as it breaks an intended abstraction layer, and could easily have difficulty porting more broadly.

Also, if the primary purpose of the preceding bits is to solve a UI issue on DNAnexus, it seems that it would be sensible to make this part of the solution very DNAnexus-specific code. Similar to the assembly stats python script (that is DNAnexus specific), we could imagine a "parse-the-tabfile-to-json" script that is also DNAnexus specific and provided separately from all the WDL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on all counts. My only concern is that I reaaaaally hope it doesn't turn out that we need to use "project-ABC:file-XYZ1A"-type filenames in the json. I like being able to see the mapping for debugging and to verify that I didn't mess anything up—though that can be/is printed as an intermediate output, so at least we could see that. But also because an input file with those filenames would be a lot harder to put together. I guess we'll see! I'll play around with actually using the json inputs later this week.


# merges the relevant input bam files to produce this output file
call demux.merge_and_reheader_bams {
input:
out_basename = basename(out_bam, ".bam"),
in_bams = relevant_in_bams,
reheader_table = reheader_table,
docker = docker
}
}
}

task clean_in_bam_out_bam_table {
File table

command {
cat ${table} | sed 's/[.]bam$//g' | sed $'s/[.]bam\t/\t/g' | tee in_bam_out_bam_table
}

output {
File clean_table = "in_bam_out_bam_table"
}
}
Comment on lines +46 to +56
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task seems unnecessary -- you're already calling the WDL basename function on all the members of the table, so why do it twice (once in WDL, once in a task)?

Copy link
Contributor Author

@lakras lakras Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(As you definitely already can tell,) my thinking here was to deal with the user potentially including the .bam extension in the input and/or output bam filenames listed in in_bam_out_bam_table. I'd like to be able to handle, for example, a mapping like

input_a1.bam	output_a
input_a2	output_a
input_b1	output_b.bam
input_b2	output_b
input_c1.bam	output_c.bam
input_d1	output_d.bam

or even a more reasonable

input_a1.bam	output_a.bam
input_a2.bam	output_a.bam

vs.

input_a1	output_a
input_a2	output_a

I ran into problems with this line:
if(in_bam_to_out_bam[in_bam_basename] == out_bam)

Note that in_bam_basename is the basename of an input bam file from the list of bam files provided by the user (Array[File]+ in_bams), while in_bam_to_out_bam is the above mapping file provided by the user. (This map is built directly from the contents of the file, and is not mutable.)

This would be fine if we could check the map for in_bam_basename and in_bam_basename + ".bam". Unfortunately, I couldn't find a way to check if a key is present in a map—and if the key is not in the map, in_bam_to_out_bam[in_bam_basename] gives me an error. So I decided to clean out the .bams from inside the user-provided mapping file before the map is created.


task unique_values_in_second_column {
File table

command {
cut -f2 ${table} | sort | uniq | tee unique_values
}

output {
Array[String] unique_values = read_lines("unique_values")
}
}