This starter’s guide to FAIRifying script-based data analysis workflows walks you through the initial steps to turn your existing script into a reusable CWL workflow. Whether you use Python, R, Bash, or another language the same logic applies.
Once the interactive script is converted into a generally executable script, it can be described with a CWL descriptor file, that wraps the script as a CommandLineTool.
Go back to your script and list all external packages and tools it depends on.
Look for e.g.
Python packages (pandas, version 1.5.3, numpy, version 1.23.0)
R libraries (ggplot2)
Command-line tools (samtools, awk)
In the workflow.cwl describing your script, such software dependencies and resource requirements can be specified under the sections hints (i.e. “soft requirements”) or requirements (i.e. “hard requirements”).
SoftwareRequirement allows to specify software version and reference
package: the name of the software or package
version: the name of the software or package
specs: a reference URL for the software or package (e.g. from bio.tools or SciCrunch)
ResourceRequirement allows to specify the required compute resources
For full portability, specify a container with all dependencies.
Use the DockerRequirement to load a published Docker image or reference a local Dockerfile.
Example
Load a public image
workflow.cwl
...
requirements:
- class: DockerRequirement
dockerPull: python:3.10-slim
...
Example
If you cannot find a suitable container matching your dependencies, you can also design a Dockerfile.