Skip to content

From Script to CWL

This starter’s guide to FAIRifying script-based data analysis workflows walks you through the initial steps to turn your existing script into a reusable CWL workflow. Whether you use Python, R, Bash, or another language the same logic applies.

Look through your script and identify what data it reads and what data it writes. These are your inputs and outputs, for example:

  • Inputs: path to a FASTA file, a CSV file (e.g. data.csv)
  • Outputs: path to where a result is written, e.g. a figure, a processed file, a table (sorted.csv)
  1. For easier overview and handling, move the input and output variables to its own section (e.g. on top of your script).
  2. Replace any paths pointing outside the ARC
  3. Replace absolute with relative paths
  1. Replace hard-coded paths (pointing to your inputs and outputs) with command-line arguments.

Example

Each script below reads a CSV file (data.csv), sorts it by the first column, and writes the result to sorted.csv.

Change this…

import sys
import pandas as pd
data = pd.read_csv("data.csv")
data_sorted = data.sort_values(by=data.columns[0])
data_sorted.to_csv("sorted.csv", index=False)

…to this:

sort-csv.py
import sys
import pandas as pd
input_file = sys.argv[1]
output_file = sys.argv[2]
data = pd.read_csv(input_file)
data_sorted = data.sort_values(by=data.columns[0])
data_sorted.to_csv(output_file, index=False)

The script can then be run via

Terminal window
python sort-csv.py data.csv sorted.csv

Write a minimal workflow CWL descriptor

Section titled Write a minimal workflow CWL descriptor

Once the interactive script is converted into a generally executable script, it can be described with a CWL descriptor file, that wraps the script as a CommandLineTool.

Example

  1. Create a file named workflow.cwl

    workflow.cwl
    cwlVersion: v1.2
    class: CommandLineTool
    requirements:
    - class: InitialWorkDirRequirement
    listing:
    - entryname: sort-csv.py
    entry:
    $include: sort-csv.py
    baseCommand: [python3, sort-csv.py]
    inputs:
    input_file:
    type: File
    inputBinding:
    position: 1
    output_filename:
    type: string
    inputBinding:
    position: 2
    outputs:
    output_file:
    type: File
    outputBinding:
    glob: $(inputs.output_filename)

    This basically runs python3 sort-csv.py data.csv sorted.csv.

  2. Save it alongside the sort-csv.py script.

  3. Store the workflow.cwl and script in an ARC’s workflows folder

    • Directoryworkflows
      • Directorysort-csv
        • sort-csv.py
        • workflow.cwl
    • Directoryruns

Example

  1. Create a run.cwl file that uses the workflow.cwl.

    run.cwl
    cwlVersion: v1.2
    class: Workflow
    inputs:
    input_file: File
    output_filename: string
    steps:
    step1:
    run: ../../workflows/sort-csv/workflow.cwl
    in:
    input_file: input_file
    output_filename: output_filename
    out: [output_file]
    outputs:
    output_file:
    type: File
    outputSource: step1/output_file
  2. Create a run.yml to provide the parameters required by the run.cwl:

    run.yml
    input_file:
    class: File
    path: data.csv
    output_filename: sorted.csv
  3. Place both files in an ARC’s runs folder

    • Directoryworkflows
      • Directorysort-csv
        • sort-csv.py / .R / .sh
        • workflow.cwl
    • Directoryruns
      • Directorysort-my-data-table
        • run.cwl
        • run.yml

The workflow can now be executed with

Terminal window
cwltool run.cwl run.yml

Specify Required Tools, Packages and Environment

Section titled Specify Required Tools, Packages and Environment

Go back to your script and list all external packages and tools it depends on.

Look for e.g.

  • Python packages (pandas, version 1.5.3, numpy, version 1.23.0)
  • R libraries (ggplot2)
  • Command-line tools (samtools, awk)

In the workflow.cwl describing your script, such software dependencies and resource requirements can be specified under the sections hints (i.e. “soft requirements”) or requirements (i.e. “hard requirements”).

  • SoftwareRequirement allows to specify software version and reference

    • package: the name of the software or package
    • version: the name of the software or package
    • specs: a reference URL for the software or package (e.g. from bio.tools or SciCrunch)
  • ResourceRequirement allows to specify the required compute resources

Example

workflow.cwl
...
hints:
SoftwareRequirement
packages:
- package: python
version: [3.10]
- package: pandas
version: [1.5.3]
requirements:
ResourceRequirement
coresMin: 1
ramMin: 500
...

For full portability, specify a container with all dependencies. Use the DockerRequirement to load a published Docker image or reference a local Dockerfile.

Example

Load a public image

workflow.cwl
...
requirements:
- class: DockerRequirement
dockerPull: python:3.10-slim
...

Example

If you cannot find a suitable container matching your dependencies, you can also design a Dockerfile.

  1. Create your own Dockerfile

    Dockerfile
    FROM python:3.10-slim
    RUN pip install pandas==1.5.3
  2. Load the Dockerfile in workflow.cwl

    workflow.cwl
    ...
    hints:
    DockerRequirement:
    dockerImageId: "mydocker"
    dockerFile: {$include: "Dockerfile"}
    ...

Adding namespaces and schemas allows to reuse them elsewhere in a CWL document

Example

workflow.cwl
...
$namespaces:
s: https://schema.org/
edam: http://edamontology.org/
$schemas:
- https://schema.org/version/latest/schemaorg-current-https.rdf
- http://edamontology.org/EDAM_1.18.owl
...

Attribute authors and contributors

Section titled Attribute authors and contributors

Example

workflow.cwl
...
s:author:
- class: s:Person
s:identifier: <author ORCID>
s:email: mailto:<author email>
s:name: <author name>
s:contributor:
- class: s:Person
s:identifier: <contributor ORCID>
s:email: mailto:<contributor email>
s:name: <contributor name>
...