Appendix A: WDL Value Serialization and Deserialization
This section provides suggestions for ways to deal with primitive and compound values in the task command section. When a WDL execution engine instantiates a command specified in the command section of a task, it must evaluate all expression placeholders (~{...} and ${...}) in the command and coerce their values to strings. There are multiple different ways that WDL values can be communicated to the command(s) being called in the command section, and the best method will vary by command.
For example, a task that wraps a tool that operates on an Array of FASTQ files has several ways that it can specify the list of files to the tool:
- A file containing one file path per line, e.g.
Rscript analysis.R --files=fastq_list.txt - A file containing a JSON list, e.g.
Rscript analysis.R --files=fastq_list.json - Enumerated on the command line, e.g.
Rscript analysis.R 1.fastq 2.fastq 3.fastq
On the other end, command line tools will output results in files or to standard output, and these outputs need to be converted to WDL values to be used as task outputs. For example, the FASTQ processor task mentioned above outputs a mapping of the input files to the number of reads in each file. This output might be represented as a two-column TSV or as a JSON object, both of which would need to be deserialized to a WDL Map[File, Int] value.
The various methods for serializing and deserializing primitive and compound values are enumerated below.
§Primitive Values
WDL primitive values are naturally converted to string values. This is described in detail in the string interpolation section.
Deserialization of primitive values is done via one of the read_* functions, each of which deserializes a different type of primitive value from a file. The file must contain a single value of the expected type, with optional whitespace. The value is read as a string and then converted to the appropriate type, or raises an error if the value cannot be converted.
Example: read_write_primitives_task.wdl
version 1.2
task read_write_primitives {
input {
String s
Int i
}
command <<<
printf ~{s} > str_file
printf ~{i} > int_file
>>>
output {
String sout = read_string("str_file")
String istr = read_string("int_file")
Int iout = read_int("int_file")
# This would cause an error since "hello" cannot be converted to an Int:
#Int sint = read_int("str_file")
}
requirements {
container: "ubuntu:latest"
}
}Example input:
{
"read_write_primitives.s": "hello",
"read_write_primitives.i": 42
}
Example output:
{
"read_write_primitives.sout": "hello",
"read_write_primitives.istr": "42",
"read_write_primitives.iout": 42
}§Compound Values
A compound value such as Array or Map must be serialized to a string before it can be used in the command. There are a two general strategies for converting a compound value to a string:
- JSON: most compound values can be written to JSON format using
write_json. - Delimitation: convert each element of the compound value to a string, then join them together into a single string using a delimiter. Some common approaches are:
- Separate values by a tool-specific delimiter (e.g., whitespace or comma) and pass the string as a single command line argument. This can be accomplished with the
sepfunction. - Prefix each value with a command line option. This can be accomplished with the
prefixfunction. - Separate values by newlines (
\n) and write them to a file. This can be accomplished with thewrite_linesfunction. - For nested types such as
Structs andObject, separate the fields of each value with a tab (\t), and write each tab-delimited line to a file. This is commonly called tab_separated value (TSV) format. This can be accomplished usingwrite_tsv,write_map,write_object, orwrite_objects.
- Separate values by a tool-specific delimiter (e.g., whitespace or comma) and pass the string as a single command line argument. This can be accomplished with the
Similarly, data output by a command must be deserialized to be used in WDL. Commands generally either write output to stdout (or sometimes stderr) or to a regular file. The contents of stdout and stderr can be read a files using the stdout and stderr functions. The two general strategies for deserializing data from a file are:
- If the output is in JSON format, it can be read into a WDL value using
read_json. - If the output is line-oriented (i.e., one value per line), it can be read into a WDL
Arrayusingread_lines. - If the output is tab-delimited (TSV), it can be read into a structured value using
read_tsv,read_map,read_object, orread_objects.
Specific examples of serializing and deserializing each type of compound value are given below.
§Array
§Array serialization by delimitation
This method applies to an array of a primitive type. Each element of the array is coerced to a string, and the strings are then joined into a single string separated by a delimiter. This is done using the sep function.
Example: serialize_array_delim_task.wdl
version 1.2
task serialize_array_delim {
input {
File infile
Array[Int] counts
}
Array[String] args = squote(prefix("-n", counts))
command <<<
for arg in ~{sep(" ", args)}; do
head $arg ~{infile}
done
>>>
output {
Array[String] heads = read_lines(stdout())
}
requirements {
container: "ubuntu:latest"
}
}Example input:
{
"serialize_array_delim.infile": "data/greetings.txt",
"serialize_array_delim.counts": [1, 2]
}
Example output:
{
"serialize_array_delim.heads": [
"hello world",
"hello world",
"hi_world"
]
}Given an array [1, 2], the instantiated command would be:
for arg in '-n1' '-n2'; do
head $arg greetings.txt
done§Array serialization/deserialization using write_lines()/read_lines()
This method applies to an array of a primitive type. Using write_lines, Each element of the array is coerced to a string, and the strings are written to a file, one element per line. Using read_lines, each line of the file is read as a String and coerced to the target type.
Example: serde_array_lines_task.wdl
version 1.2
task serde_array_lines {
input {
File infile
Array[String] patterns
}
command <<<
while read -r pattern; do
grep -c "$pattern" ~{infile}
done < ~{write_lines(patterns)}
>>>
output {
Array[String] matches = read_lines(stdout())
}
}Example input:
{
"serde_array_lines.infile": "data/greetings.txt",
"serde_array_lines.patterns": ["hello", "world"]
}
Example output:
{
"serde_array_lines.matches": ["2", "2"]
}Given an array of patterns ["hello", "world"], the instantiated command would be:
while read pattern; do
grep "$pattern" greetings.txt | wc -l
done < /jobs/564758/patterns
Where /jobs/564758/patterns contains:
hello
world§Array serialization/deserialization using write_json()/read_json()
This method applies to an array of any type that can be serialized to JSON. Calling write_json with an Array parameter results in the creation of a file containing a JSON array.
Example: serde_array_json_task.wdl
version 1.2
task serde_array_json {
input {
Map[String, Int] string_to_int
}
command <<<
python <<CODE
import json
import sys
with open("~{write_json(string_to_int)}") as j:
d = json.load(j)
json.dump(list(d.keys()), sys.stdout)
CODE
>>>
output {
Array[String] keys = read_json(stdout())
}
requirements {
container: "python:latest"
}
}Example input:
{
"serde_array_json.string_to_int": {
"a": 1,
"b": 2
}
}
Example output:
{
"serde_array_json.keys": ["a", "b"]
}Given the Map {"a": 1, "b": 2}, the instantiated command would be:
import json
import sys
with open("/jobs/564758/string_to_int.json") as j:
d = json.load(j)
json.dump(list(d.keys()), sys.stdout)
Where /jobs/564758/string_to_int.json would contain:
{
"a": 1,
"b": 2
}§Pair
A Pair cannot be directly serialized to a String, nor can it be deserialized from a string or a file.
The most common approach to Pair serialization is to serialize the left and right values separately, e.g., by converting each to a String or writing each to a separate file using one of the write_* functions. Similarly, two values can be deserialized independently and then used to create a Pair.
Example: serde_pair.wdl
version 1.2
task tail {
input {
Pair[File, Int] to_tail
}
command <<<
tail -n ~{to_tail.right} ~{to_tail.left}
>>>
output {
Array[String] lines = read_lines(stdout())
}
}
workflow serde_pair {
input {
Map[File, Int] to_tail
}
scatter (item in as_pairs(to_tail)) {
call tail {
to_tail = item
}
Pair[String, String]? two_lines =
if item.right >= 2 then (tail.lines[0], tail.lines[1]) else None
}
output {
Map[String, String] tails_of_two = as_map(select_all(two_lines))
}
}Example input:
{
"serde_pair.to_tail": {
"data/cities.txt": 2,
"data/hello.txt": 1
}
}
Example output:
{
"serde_pair.tails_of_two": {
"Chicago": "Piscataway"
}
}§Homogeneous Pair serialization/deserialization as Array
A homogeneous Pair[X, X] can be converted to/from an Array and then serialized/deserialized by any of the methods in the previous section.
Example: serde_homogeneous_pair.wdl
version 1.2
task serde_int_strings {
input {
Pair[String, String] int_strings
}
Array[String] pair_array = [int_strings.left, int_strings.right]
command <<<
cat ~{write_lines(pair_array)}
>>>
output {
Array[String] ints = read_lines(stdout())
}
}
workflow serde_homogeneous_pair {
input {
Map[String, String] int_strings
}
scatter (pair in as_pairs(int_strings)) {
call serde_int_strings { int_strings = pair }
}
output {
Array[String] ints = flatten(serde_int_strings.ints)
}
}Example input:
{
"serde_homogeneous_pair.int_strings": {
"1": "2",
"3": "4"
}
}
Example output:
{
"serde_homogeneous_pair.ints": ["1", "2", "3", "4"]
}§Pair serialization/deserialization using read_json/write_json
A Pair[X, Y] can be converted to JSON and then serialized using write_json and deserialized using read_json.
§Map
§Map serialization by delimitation
A Map is a common way to represent a set of arguments that need to be passed to a command. Each key/value pair can be converted to a String using a scatter, or the keys and value can be independently converted to Bash arrays and referenced by index.
Example: serialize_map.wdl
version 1.2
task grep1 {
input {
File infile
String pattern
Array[String] args
}
command <<<
grep ~{sep(" ", args)} ~{pattern} ~{infile}
>>>
output {
Array[String] results = read_lines(stdout())
}
}
task grep2 {
input {
File infile
String pattern
Map[String, String] args
}
Pair[Array[String], Array[String]] opts_and_values = unzip(as_pairs(args))
Int n = length(opts_and_values.left)
command <<<
opts=( ~{sep(" ", quote(opts_and_values.left))} )
values=( ~{sep(" ", quote(opts_and_values.right))} )
command="grep"
for i in {0..~{n-1}}; do
command="$command ${opts[i]}"="${values[i]}"
done
$command ~{pattern} ~{infile}
>>>
output {
Array[String] results = read_lines(stdout())
}
}
workflow serialize_map {
input {
File infile
String pattern
Map[String, String] args
}
scatter (arg in as_pairs(args)) {
String arg_str = "~{arg.left}=~{arg.right}"
}
call grep1 { infile, pattern, args = arg_str }
call grep2 { infile, pattern, args }
output {
Array[String] results1 = grep1.results
Array[String] results2 = grep2.results
}
}Example input:
{
"serialize_map.infile": "data/greetings.txt",
"serialize_map.pattern": "hello",
"serialize_map.args": {
"--after-context": "1",
"--max-count": "1"
}
}
Example output:
{
"serialize_map.results1": ["hello world", "hi_world"],
"serialize_map.results2": ["hello world", "hi_world"]
}§Map serialization/deserialization using write_map()/read_map()
A Map[String, String] value can be serialized as a two-column TSV file using write_map, and deserialized from a two-column TSV file using read_map.
Example: serde_map_tsv_task.wdl
version 1.2
task serde_map_tsv {
input {
Map[String, String] items
}
File item_file = write_map(items)
command <<<
cut -f 1 ~{item_file} >> lines
cut -f 2 ~{item_file} >> lines
paste -s -d '\t\n' lines
>>>
output {
Map[String, String] new_items = read_map(stdout())
}
}Example input:
{
"serde_map_tsv.items": {
"a": "b",
"c": "d",
"e": "f"
}
}
Example output:
{
"serde_map_tsv.new_items": {
"a": "c",
"e": "b",
"d": "f"
}
}Given a Map { "a": "b", "c": "d", "e": "f" }, the instantiated command would be:
cut -f 1 /jobs/564757/item_file >> lines
cut -f 2 /jobs/564757/item_file >> lines
paste -s -d '\t\n' lines
Where /jobs/564757/item_file would contain:
a\tb
c\td
e\tf
And the created lines file would contain:
a\tc,
e\tb,
d\tf
Which is deserialized to the Map {"a": "c", "e": "b", "d": "f"}.
§Map serialization/deserialization using write_json()/read_json()
A Map[String, Y] value can be serialized as a JSON object using write_json, and a JSON object can be read into a Map[String, Y] using read_json so long as all the values of the JSON object are coercible to Y.
Example: serde_map_json_task.wdl
version 1.2
task serde_map_json {
input {
Map[String, Int] read_quality_scores
}
command <<<
python <<CODE
import json
import sys
with open("~{write_json(read_quality_scores)}") as j:
d = json.load(j)
for key in d.keys():
d[key] += 33
json.dump(d, sys.stdout)
CODE
>>>
output {
Map[String, Int] ascii_values = read_json(stdout())
}
requirements {
container: "python:latest"
}
}Example input:
{
"serde_map_json.read_quality_scores": {
"read1": 32,
"read2": 41,
"read3": 55
}
}
Example output:
{
"serde_map_json.ascii_values": {
"read1": 65,
"read2": 74,
"read3": 88
}
}Given a Map { "read1": 32, "read2": 41, "read3": 55 }, the instantiated command would be:
import json
import sys
with open("/jobs/564757/sample_quality_scores.json") as j:
d = json.load(j)
for key in d.keys():
d[key] += 33
json.dump(d, sys.stdout)
Where /jobs/564757/sample_quality_scores.json would contain:
{
"read1": 32,
"read2": 41,
"read3": 55,
}§Struct and Object serialization/deserialization
There are two alternative serialization formats for Structs and `Objects:
- JSON:
Structs andObjects are serialized identically usingwrite_json. A JSON object is deserialized to a WDLObjectusingread_json, which can then be coerced to aStructtype if necessary. - TSV:
Structs andObjects can be serialized to TSV format usingwrite_object. The generated file has two lines tab-delimited: a header with the member names and the values, which must be coercible toStrings. An array ofStructs orObjects can be written usingwrite_objects, in which case the generated file has one line of values for each struct/object.Structs andObjects can be deserialized from the same TSV format usingread_object/read_objects. Object member values are always of typeStringwhereas struct member types must be coercible fromString.