From Script to CWL

This starter’s guide to FAIRifying script-based data analysis workflows walks you through the initial steps to turn your existing script into a reusable CWL workflow. Whether you use Python, R, Bash, or another language the same logic applies.

Refactor the script

Look through your script and identify what data it reads and what data it writes. These are your inputs and outputs, for example:

Inputs: path to a FASTA file, a CSV file (e.g. data.csv)
Outputs: path to where a result is written, e.g. a figure, a processed file, a table (sorted.csv)

For easier overview and handling, move the input and output variables to its own section (e.g. on top of your script).
Replace any paths pointing outside the ARC
Replace absolute with relative paths

From interactive to reusable

Replace hard-coded paths (pointing to your inputs and outputs) with command-line arguments.

Example

Each script below reads a CSV file (data.csv), sorts it by the first column, and writes the result to sorted.csv.

Change this…

import sys
import pandas as pd

data = pd.read_csv("data.csv")
data_sorted = data.sort_values(by=data.columns[0])
data_sorted.to_csv("sorted.csv", index=False)

data <- read.csv("data.csv")
data_sorted <- data[order(data[[1]]), ]
write.csv(data_sorted, "sorted.csv", row.names=FALSE)

#!/bin/bash

(head -n 1 data.csv && tail -n +2 data.csv | sort) > sorted.csv

…to this:

import sys
import pandas as pd

input_file = sys.argv[1]
output_file = sys.argv[2]

data = pd.read_csv(input_file)
data_sorted = data.sort_values(by=data.columns[0])
data_sorted.to_csv(output_file, index=False)

args <- commandArgs(trailingOnly=TRUE)
input_file <- args[1]
output_file <- args[2]

data <- read.csv(input_file)
data_sorted <- data[order(data[[1]]), ]
write.csv(data_sorted, output_file, row.names=FALSE)

#!/bin/bash
input_file="$1"
output_file="$2"

(head -n 1 "$input_file" && tail -n +2 "$input_file" | sort) > "$output_file"

The script can then be run via

python sort-csv.py data.csv sorted.csv

Rscript sort-csv.R data.csv sorted.csv

bash sort-csv.sh data.csv sorted.csv

Write a minimal workflow CWL descriptor

Once the interactive script is converted into a generally executable script, it can be described with a CWL descriptor file, that wraps the script as a CommandLineTool.

Example

Create a file named workflow.cwl

cwlVersion: v1.2
class: CommandLineTool
requirements:
  - class: InitialWorkDirRequirement
    listing:
      - entryname: sort-csv.py
        entry:
          $include: sort-csv.py
baseCommand: [python3, sort-csv.py]
inputs:
  input_file:
    type: File
    inputBinding:
      position: 1
  output_filename:
    type: string
    inputBinding:
      position: 2
outputs:
  output_file:
    type: File
    outputBinding:
      glob: $(inputs.output_filename)

This basically runs python3 sort-csv.py data.csv sorted.csv.

Save it alongside the sort-csv.py script.
Store the workflow.cwl and script in an ARC’s workflows folder
- …
- Directoryworkflows
  - Directorysort-csv
    sort-csv.py
    workflow.cwl
- Directoryruns
  - …
- …

Create a file named workflow.cwl

cwlVersion: v1.2
class: CommandLineTool
requirements:
  - class: InitialWorkDirRequirement
    listing:
      - entryname: sort-csv.R
        entry:
          $include: sort-csv.R
baseCommand: [Rscript, sort-csv.R]
inputs:
  input_file:
    type: File
    inputBinding:
      position: 1
  output_filename:
    type: string
    inputBinding:
      position: 2
outputs:
  output_file:
    type: File
    outputBinding:
      glob: $(inputs.output_filename)

This basically runs Rscript sort-csv.R data.csv sorted.csv.

Save it alongside the sort-csv.R script.
Store the workflow.cwl and script in an ARC’s workflows folder
- …
- Directoryworkflows
  - Directorysort-csv
    sort-csv.R
    workflow.cwl
- Directoryruns
  - …
- …

Create a file named workflow.cwl

cwlVersion: v1.2
class: CommandLineTool
requirements:
  - class: InitialWorkDirRequirement
    listing:
      - entryname: sort-csv.sh
        entry:
          $include: sort-csv.sh
baseCommand: [bash, sort-csv.sh]
inputs:
  input_file:
    type: File
    inputBinding:
      position: 1
  output_filename:
    type: string
    inputBinding:
      position: 2
outputs:
  output_file:
    type: File
    outputBinding:
      glob: $(inputs.output_filename)

This basically runs bash sort-csv.sh data.csv sorted.csv.

Save it alongside the sort-csv.sh script.
Store the workflow.cwl and script in an ARC’s workflows folder
- …
- Directoryworkflows
  - Directorysort-csv
    sort-csv.sh
    workflow.cwl
- Directoryruns
  - …
- …

Add a minimal `Run`

Example

Create a run.cwl file that uses the workflow.cwl.

cwlVersion: v1.2
class: Workflow

inputs:
  input_file: File
  output_filename: string

steps:
  step1:
    run: ../../workflows/sort-csv/workflow.cwl
    in:
      input_file: input_file
      output_filename: output_filename
    out: [output_file]

outputs:
  output_file:
    type: File
    outputSource: step1/output_file

Create a run.yml to provide the parameters required by the run.cwl:

input_file:
  class: File
  path: data.csv
output_filename: sorted.csv

Place both files in an ARC’s runs folder
- …
- Directoryworkflows
  - Directorysort-csv
    sort-csv.py / .R / .sh
    workflow.cwl
- Directoryruns
  - Directorysort-my-data-table
    run.cwl
    run.yml
- …

The workflow can now be executed with

cwltool run.cwl run.yml

Specify Required Tools, Packages and Environment

Go back to your script and list all external packages and tools it depends on.

Look for e.g.

Python packages (pandas, version 1.5.3, numpy, version 1.23.0)
R libraries (ggplot2)
Command-line tools (samtools, awk)

In the workflow.cwl describing your script, such software dependencies and resource requirements can be specified under the sections hints (i.e. “soft requirements”) or requirements (i.e. “hard requirements”).

SoftwareRequirement allows to specify software version and reference
- package: the name of the software or package
- version: the version of the software or package
- specs: a reference URL for the software or package (e.g. from bio.tools or SciCrunch)
ResourceRequirement allows to specify the required compute resources

Example

...
hints:
  SoftwareRequirement
    packages:
      - package: python
        version: [3.10]
      - package: pandas
        version: [1.5.3]
requirements:
  ResourceRequirement
    coresMin: 1
    ramMin: 500
...

Add a container

For full portability, specify a container with all dependencies. Use the DockerRequirement to load a published Docker image or reference a local Dockerfile.

Example

Load a public image

...
requirements:
  - class: DockerRequirement
    dockerPull: python:3.10-slim
...

Example

If you cannot find a suitable container matching your dependencies, you can also design a Dockerfile.

Create your own Dockerfile

FROM python:3.10-slim
RUN pip install pandas==1.5.3

Load the Dockerfile in workflow.cwl

...
hints:
  DockerRequirement:
    dockerImageId: "mydocker"
    dockerFile: {$include: "Dockerfile"}
...

Namespaces and schemas

Adding namespaces and schemas allows to reuse them elsewhere in a CWL document

Example

...
$namespaces:
  s: https://schema.org/
  edam: http://edamontology.org/

$schemas:
  - https://schema.org/version/latest/schemaorg-current-https.rdf
  - http://edamontology.org/EDAM_1.18.owl
...

Attribute authors and contributors

Example

...
s:author:
  - class: s:Person
    s:identifier: <author ORCID>
    s:email: mailto:<author email>
    s:name: <author name>

s:contributor:
  - class: s:Person
    s:identifier: <contributor ORCID>
    s:email: mailto:<contributor email>
    s:name: <contributor name>
...

From Script to CWL

Refactor the script

From interactive to reusable

Write a minimal workflow CWL descriptor

Add a minimal Run

Specify Required Tools, Packages and Environment

Add a container

Namespaces and schemas

Attribute authors and contributors

Add a minimal `Run`