Skip to content

Consume ARCs from DataHUB

This guide demonstrates how to consume ARCs via ARCtrl directly from the DataHUB, i.e. loading an ARC without cloning the ARC repository locally.

First, we need to install the arctrl and python-gitlab packages.

Terminal window
py pip install arctrl
py pip install python-gitlab

then we can make the required imports

consume-datahub.py
from arctrl import ARC, Contract
from arctrl.py.fable_modules.fs_spreadsheet_py.xlsx import Xlsx
from arctrl.py.Contract.contract import DTO
from io import BytesIO
import gitlab

With the GitLab API package setup, we can define some helper functions to download files from GitLab repositories. We need three functions:

  • Retrieve all file paths in a GitLab repository
  • Download a file from GitLab as string (for cwl files)
  • Download a file from GitLab as bytes (for xlsx files)
consume-datahub.py
# For a given repository, list all files in the repository using GitLab API
def list_repo_files(project_path: str, branch: str = "main", token: str | None = None) -> list[str]:
gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token)
project = gl.projects.get(project_path)
items = project.repository_tree(ref=branch, recursive=True, all=True)
return [item["path"] for item in items if item["type"] == "blob"]
# Download a specific file from repository as string using GitLab API
def download_file(project_path: str, file_path: str, branch: str = "main", token: str | None = None) -> str:
gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token)
project = gl.projects.get(project_path)
file = project.files.get(file_path=file_path, ref=branch)
return file.decode().decode("utf-8")
# Download a specific file from repository as bytes using GitLab API
def download_file_bytes(project_path: str, file_path: str, branch: str = "main", token: str | None = None) -> bytes:
gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token)
project = gl.projects.get(project_path)
file = project.files.get(file_path=file_path, ref=branch)
return BytesIO(file.decode())

ARCtrl IO is based upon a contract handling layer, that allows to flexibly load and write ARCs from different sources. To load an ARC from the DataHUB, we need to implement a custom contract handling function that uses the above defined GitLab helper functions to retrieve the required files.

consume-datahub.py
# Fullfill read contract by downloading file according to expected dto type
def handle_read_contract(project_path: str, contract : Contract) -> Contract:
# cwl files are handled as text
if contract.DTOType.name.__contains__("CWL"):
cwl_str = download_file(project_path, contract.Path)
contract.DTO = DTO(1, cwl_str)
# isa files are handled as xlsx spreadsheets
if contract.DTOType.name.__contains__("ISA"):
xlsx_bytes = download_file_bytes(project_path, contract.Path)
contract.DTO = DTO(0, Xlsx.from_xlsx_bytes(xlsx_bytes))
return contract

Finally, we can define a function download_arc that takes the GitLab repository identifier as input and returns an ARC object that represents the loaded ARC. This function:

  • Retrieves the file paths from the GitLab repository
  • Creates an empty ARC object from these file paths
  • Gets empty read contracts for the ARC object
  • Uses the contract handling function to fill the read contracts
  • Injects filled read contracts back into the ARC object and returns it
consume-datahub.py
def download_arc(project_path: str, branch: str = "main", token: str | None = None) -> ARC:
print ("retrieve file paths")
filepaths = list_repo_files(project_path, branch, token)
print ("init arc from file paths")
arc = ARC.from_file_paths(filepaths)
print ("retrieve and fulfill metadata file read contracts")
contracts = [
handle_read_contract(project_path, contract) for contract in arc.GetReadContracts()
]
print ("inject metadata from contracts into ARC")
arc.SetISAFromContracts(contracts)
return arc

Now we can simply call the download_arc function with the GitLab repository identifier of the ARC we want to load from the DataHUB. In this example, we load the brilator/Facultative-CAM-in-Talinum ARC.

consume-datahub.py
project_path = "brilator/Facultative-CAM-in-Talinum"
arc = download_arc(project_path)
print ("check correctness of assay identifiers")
print(arc.AssayIdentifiers)

returning

Terminal window
['GCqTOF_targets', 'MassHunter_targets', 'RNASeq']