-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge_bams_bulk #6
base: master
Are you sure you want to change the base?
Changes from 123 commits
fa66e5d
77c7c4d
4bd3f22
2ec6e0f
92d119d
1333e6d
7327f77
c14ab66
ab5e06c
cda9c49
5f56e1e
309b84a
2103912
49e36d7
6899f34
0eb1b28
2b475f9
6ddae5b
f905e27
2c69f08
1af46b3
011403a
4d79c3b
fbea2de
589a18f
ad3861e
fb2644f
135c0f1
6ca90ed
1280d6b
0023bd6
237836b
cb651d7
b60a71c
3f17184
d5b3f39
bc785a4
6a08adf
c8e9c22
f57a1f0
13d98da
2270e2a
e2601d6
41463b0
d69cff6
b5aa092
47165fc
3d14888
b059954
074eeb2
7b5133d
9cba71a
622b060
2daf2f7
0a2ca27
8f87d81
59763a0
c0fda0c
e2c676e
c5bd718
306652a
56d6a93
3f240ca
d75a7d1
26445d2
b405037
01284c9
af63b20
cdde672
8d09e93
b6baf72
db53b41
f2f5ee2
f78f56c
8fff2bb
19d7625
d2d03dc
81ccd39
8ad0efc
2003e73
3909690
5a55b42
0d48fdb
3e6a314
e7403be
c6f83c0
34ec57d
ea81aef
86b7145
d23ae27
2cd8b34
e6bef52
1898c7f
70c1794
efc4cb0
48824f5
2819c29
e8136f6
3f6314b
e47d6a5
c0f3ab4
e5c81d8
4281ef3
183c550
bd16496
38e21fe
b2ea552
4575496
3e45419
20ccbbd
280ff5f
a5ecff6
d0f4166
9456b6f
4e330d3
b10c66d
c211760
ad5f699
d03aa0e
888bb2a
85d8e62
97ddcbd
4e6d91e
8c69883
877f573
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
import "tasks_demux.wdl" as demux | ||
|
||
workflow merge_bams_bulk { | ||
Array[File]+ in_bams # any order | ||
File in_bam_out_bam_table # first column: input bam file basename, second column: output bam file basename (one line per INPUT file) | ||
File? reheader_table | ||
String? docker="quay.io/broadinstitute/viral-core" | ||
|
||
# removes ".bam"s from ends of filenames in in_bam_out_bam_table | ||
call clean_in_bam_out_bam_table { | ||
input: table = in_bam_out_bam_table | ||
} | ||
|
||
# generates map with key: input bam file name -> value: output bam file basename | ||
Map[String, String] in_bam_to_out_bam = read_map(clean_in_bam_out_bam_table.clean_table) | ||
|
||
# retrieves unique output bam file basenames (no repeats) | ||
call unique_values_in_second_column { | ||
input: table = clean_in_bam_out_bam_table.clean_table | ||
} | ||
|
||
# collects and merges input bam files for each output bam file | ||
scatter (out_bam in unique_values_in_second_column.unique_values) { | ||
String placeholder = write_map(in_bam_to_out_bam) # need to touch map in outer scatter for it to be seen in inner scatter | ||
|
||
# retrieves the input bam files for this output bam file | ||
scatter (in_bam in in_bams) { | ||
String in_bam_basename = basename(in_bam, ".bam") | ||
if(in_bam_to_out_bam[in_bam_basename] == out_bam) { | ||
File relevant_in_bam = in_bam | ||
} | ||
} | ||
Array[File] relevant_in_bams = select_all(relevant_in_bam) # gathers input bam files from the scatter | ||
|
||
# merges the relevant input bam files to produce this output file | ||
call demux.merge_and_reheader_bams { | ||
input: | ||
out_basename = basename(out_bam, ".bam"), | ||
in_bams = relevant_in_bams, | ||
reheader_table = reheader_table, | ||
docker = docker | ||
} | ||
} | ||
} | ||
|
||
task clean_in_bam_out_bam_table { | ||
File table | ||
|
||
command { | ||
cat ${table} | sed 's/[.]bam$//g' | sed $'s/[.]bam\t/\t/g' | tee in_bam_out_bam_table | ||
} | ||
|
||
output { | ||
File clean_table = "in_bam_out_bam_table" | ||
} | ||
} | ||
Comment on lines
+46
to
+56
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This task seems unnecessary -- you're already calling the WDL There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (As you definitely already can tell,) my thinking here was to deal with the user potentially including the .bam extension in the input and/or output bam filenames listed in
or even a more reasonable
vs.
I ran into problems with this line: Note that This would be fine if we could check the map for |
||
|
||
task unique_values_in_second_column { | ||
File table | ||
|
||
command { | ||
cut -f2 ${table} | sort | uniq | tee unique_values | ||
} | ||
|
||
output { | ||
Array[String] unique_values = read_lines("unique_values") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fascinating -- as I stare at this, it appears you're basically trying to pull off two things in these particular lines of code highlighted here:
basename
) and testing one by one in theif
statementIs it possible we could avoid all this if the workflow took, as input, an
Map[String, Array[File]] out_filename_to_in_bams
, where the keys were output filenames (as Strings) and the values were all the input BAM files (as WDL File objects, not text) that are to be merged into one output? Then this whole workflow could be as easy as:Then maybe if there's really a UI-based desire to present the input mapping as a table with filenames, perhaps we make another WDL workflow (merge_bams_bulk_from_tsv or something like that?) that reads the input text table and
Array[File]
input like you have here, and converts them all to aMap[String,Array[File]]
using the above magic you've already written, and then invokes the simplified workflow I proposed as a subworkflow.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be so glorious. Unfortunately
Map[String, Array[File]]
is not supported by DNAnexus, or at least not in any way I can understand.The closest thing I could find to an explanation is at https://github.com/dnanexus/dxWDL/blob/master/doc/Internals_v0_81.md:
—or at https://libraries.io/github/dnanexus/dxWDL under "Design":
I don't actually understand what this is saying, or if the takeaway is that there is some way to make complex types work with the UI. If you have any insights beyond mine then maybe we could make it work.
You can play around with a demo of what this and other complex datatypes are like at
Hep A MA DPH/pipelines/2.0.11.0-86-g2003e73-lk-bulkifying_demo_of_complex_input_types/merge_bams_bulk
(I can't figure out how to link to it directly, sorry.) Basically there are two inputs created for each complex type: the opportunity to select files and a string input. What to do with them, or if there is anything that can be done with them, I couldn't figure out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow that quote under "Design" sounds exactly what you're doing in your pipeline. You take an input
Array[File]
of input bam Files and then you take a text file input that contains a serialized representation of the two dimensional data structure. Is there any reason you have to use a tab file instead of a json with the exact same info? I assume you're creating your large tab file programmatically anyway?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously just didn't parse that at all, but no reason except that I couldn't figure out how the actual UI inputs work.
For any complex data type (in this case we are looking at
map_of_strings_to_arrays_of_files
), there are two inputs. The first takes a flat array of files, as described:The second takes a string:
Somehow, this string should contain the json
Map[String, Array[File]]
. But how? If the user has to copy/paste the mapping, I actually think the current version of merge_bams_bulk, where the mapping lives in a file, is better for reproducibility and figuring out any errors. Or should it be a hash, as described in the first of the two explanations? If so, a hash of what? created with what hash function? (And if I can't figure it out, how would an end user know what to do?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my main issue. Anytime we write a WDL pipeline, I want it to be portable across different execution environments--otherwise, why are we writing it in WDL in the first place? So it should run just fine on your laptop, on a cluster, on DNAnexus, on Terra, and other environments I've never used before. One of the main principles in making that happen is to try to remove as much of the unix file system concept from the high level pipeline itself, leaving it only to the code within each task, because most of the time, these pipelines run in places where directories don't exist, and so forth.
Talking about a bunch of filenames in the text content of an input file starts getting very infrastructure specific, because the way files are organized and referred to are very infrastructure specific. For example, in the case of DNAnexus, the official / unambiguous way to refer to a file object is by its DNAnexus opaque file identifier (
project-ABC:file-XYZ
). In other places, you have directories. Sometimes there's a non-opaque file identifier, such as with most cloud buckets. So what the execution engines generally promote is capturing the input files in an infrastructure-specific way when you execute the workflow and describe the Files, but handling the plumbing of the pipelines in an opaque (don't make any assumptions) kind of way. So using DNAnexus's preferred way of taking in those input mappings is at least something that doesn't mess with compatibility with other platforms. Another approach, as I mentioned before, is to keep the basic part of this pipeline clean and generic (take in theMap[String, Array[File]]
) and use an optional/add-on task right before it to make life easier for DNAnexus users (perhaps even named in a way that makes it clear that it's platform-specific).As for the json representation, almost every WDL executor uses json anyway for describing inputs to executions (that's what you used in that dx-defaults file you deleted). It'd be something to the effect of:
I mean, maybe those input files could be presented in a directory/filename-like representation instead of the opaque identifiers I used above, but maybe not -- the dx-defaults don't accept those forms, but you never know until you try. Anyway, although the above json would be annoying to human-write, I'm assuming your tabfile was script-produced anyway, and there's no real reason it couldn't produce this form of the output as well.
This might seem like splitting hairs between two ways that are semantically doing the same thing (providing a serialized mapping of the two-dimensional data structure, one in tsv form, one in json, and the former is certainly simpler), but I think there's actually a significant difference philosophically between the two: in the latter case, we're telling the execution engine (whether it's DNAnexus or something else) the data structure in a way it understands and having it resolve it. In the former case, we're telling our own custom-written code about the mapping and having it resolve it -- and when it comes to
File
objects, I think this potentially gets dangerous, as it breaks an intended abstraction layer, and could easily have difficulty porting more broadly.Also, if the primary purpose of the preceding bits is to solve a UI issue on DNAnexus, it seems that it would be sensible to make this part of the solution very DNAnexus-specific code. Similar to the assembly stats python script (that is DNAnexus specific), we could imagine a "parse-the-tabfile-to-json" script that is also DNAnexus specific and provided separately from all the WDL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on all counts. My only concern is that I reaaaaally hope it doesn't turn out that we need to use
"project-ABC:file-XYZ1A"
-type filenames in the json. I like being able to see the mapping for debugging and to verify that I didn't mess anything up—though that can be/is printed as an intermediate output, so at least we could see that. But also because an input file with those filenames would be a lot harder to put together. I guess we'll see! I'll play around with actually using the json inputs later this week.