Consume ARCs from DataHUB

This guide demonstrates how to consume ARCs via ARCtrl directly from the DataHUB, i.e. loading an ARC without cloning the ARC repository locally.

Prerequisites

First, we need to install the arctrl and python-gitlab packages.

1
py pip install arctrl
2
py pip install python-gitlab

then we can make the required imports

Python

1
from arctrl import ARC, Contract
2
from arctrl.py.fable_modules.fs_spreadsheet_py.xlsx import Xlsx
3
from arctrl.py.Contract.contract import DTO
4

5
from io import BytesIO
6
import gitlab

Define GitLab download helper functions

With the GitLab API package setup, we can define some helper functions to download files from GitLab repositories. We need three functions:

Retrieve all file paths in a GitLab repository
Download a file from GitLab as string (for cwl files)
Download a file from GitLab as bytes (for xlsx files)

Python

1
# For a given repository, list all files in the repository using GitLab API
2
def list_repo_files(project_path: str, branch: str = "main", token: str | None = None) -> list[str]:
3
    gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token)
4
    project = gl.projects.get(project_path)
5

6
    items = project.repository_tree(ref=branch, recursive=True, all=True)
7
    return [item["path"] for item in items if item["type"] == "blob"]
8

9
# Download a specific file from repository as string using GitLab API
10
def download_file(project_path: str, file_path: str, branch: str = "main", token: str | None = None) -> str:
11
    gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token)
12
    project = gl.projects.get(project_path)
13

14
    file = project.files.get(file_path=file_path, ref=branch)
15
    return file.decode().decode("utf-8")
16

17
# Download a specific file from repository as bytes using GitLab API
18
def download_file_bytes(project_path: str, file_path: str, branch: str = "main", token: str | None = None) -> bytes:
19
    gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token)
20
    project = gl.projects.get(project_path)
21

22
    file = project.files.get(file_path=file_path, ref=branch)
23
    return BytesIO(file.decode())

Define Contract handling function

ARCtrl IO is based upon a contract handling layer, that allows to flexibly load and write ARCs from different sources. To load an ARC from the DataHUB, we need to implement a custom contract handling function that uses the above defined GitLab helper functions to retrieve the required files.

Python

1
# Fullfill read contract by downloading file according to expected dto type
2
def handle_read_contract(project_path: str, contract : Contract) -> Contract:
3
    # cwl files are handled as text
4
    if contract.DTOType.name.__contains__("CWL"):
5
        cwl_str = download_file(project_path, contract.Path)
6
        contract.DTO = DTO(1, cwl_str)
7
    # isa files are handled as xlsx spreadsheets
8
    if contract.DTOType.name.__contains__("ISA"):
9
        xlsx_bytes = download_file_bytes(project_path, contract.Path)
10
        contract.DTO = DTO(0, Xlsx.from_xlsx_bytes(xlsx_bytes))
11
    return contract

Define a function to load an ARC from DataHUB

Finally, we can define a function download_arc that takes the GitLab repository identifier as input and returns an ARC object that represents the loaded ARC. This function:

Retrieves the file paths from the GitLab repository
Creates an empty ARC object from these file paths
Gets empty read contracts for the ARC object
Uses the contract handling function to fill the read contracts
Injects filled read contracts back into the ARC object and returns it

Python

1
def download_arc(project_path: str, branch: str = "main", token: str | None = None) -> ARC:
2
    print ("retrieve file paths")
3
    filepaths = list_repo_files(project_path, branch, token)
4

5
    print ("init arc from file paths")
6
    arc = ARC.from_file_paths(filepaths)
7

8
    print ("retrieve and fulfill metadata file read contracts")
9
    contracts = [
10
        handle_read_contract(project_path, contract) for contract in arc.GetReadContracts()
11
    ]
12

13
    print ("inject metadata from contracts into ARC")
14
    arc.SetISAFromContracts(contracts)
15

16
    return arc

Profit

Now we can simply call the download_arc function with the GitLab repository identifier of the ARC we want to load from the DataHUB. In this example, we load the brilator/Facultative-CAM-in-Talinum ARC.

Python

1
project_path = "brilator/Facultative-CAM-in-Talinum"
2

3
arc = download_arc(project_path)
4

5
print ("check correctness of assay identifiers")
6
print(arc.AssayIdentifiers)

returning

1
['GCqTOF_targets', 'MassHunter_targets', 'RNASeq']