Consume ARCs from DataHUB
This guide demonstrates how to consume ARCs via ARCtrl directly from the DataHUB, i.e. loading an ARC without cloning the ARC repository locally.
First, we need to install the arctrl and python-gitlab packages.
py pip install arctrlpy pip install python-gitlabthen we can make the required imports
from arctrl import ARC, Contractfrom arctrl.py.fable_modules.fs_spreadsheet_py.xlsx import Xlsxfrom arctrl.py.Contract.contract import DTO
from io import BytesIOimport gitlabDefine GitLab download helper functions
Section titled “Define GitLab download helper functions”With the GitLab API package setup, we can define some helper functions to download files from GitLab repositories. We need three functions:
- Retrieve all file paths in a GitLab repository
- Download a file from GitLab as string (for cwl files)
- Download a file from GitLab as bytes (for xlsx files)
# For a given repository, list all files in the repository using GitLab APIdef list_repo_files(project_path: str, branch: str = "main", token: str | None = None) -> list[str]: gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token) project = gl.projects.get(project_path)
items = project.repository_tree(ref=branch, recursive=True, all=True) return [item["path"] for item in items if item["type"] == "blob"]
# Download a specific file from repository as string using GitLab APIdef download_file(project_path: str, file_path: str, branch: str = "main", token: str | None = None) -> str: gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token) project = gl.projects.get(project_path)
file = project.files.get(file_path=file_path, ref=branch) return file.decode().decode("utf-8")
# Download a specific file from repository as bytes using GitLab APIdef download_file_bytes(project_path: str, file_path: str, branch: str = "main", token: str | None = None) -> bytes: gl = gitlab.Gitlab("https://git.nfdi4plants.org", private_token=token) project = gl.projects.get(project_path)
file = project.files.get(file_path=file_path, ref=branch) return BytesIO(file.decode())Define Contract handling function
Section titled “Define Contract handling function”ARCtrl IO is based upon a contract handling layer, that allows to flexibly load and write ARCs from different sources. To load an ARC from the DataHUB, we need to implement a custom contract handling function that uses the above defined GitLab helper functions to retrieve the required files.
# Fullfill read contract by downloading file according to expected dto typedef handle_read_contract(project_path: str, contract : Contract) -> Contract: # cwl files are handled as text if contract.DTOType.name.__contains__("CWL"): cwl_str = download_file(project_path, contract.Path) contract.DTO = DTO(1, cwl_str) # isa files are handled as xlsx spreadsheets if contract.DTOType.name.__contains__("ISA"): xlsx_bytes = download_file_bytes(project_path, contract.Path) contract.DTO = DTO(0, Xlsx.from_xlsx_bytes(xlsx_bytes)) return contractDefine a function to load an ARC from DataHUB
Section titled “Define a function to load an ARC from DataHUB”Finally, we can define a function download_arc that takes the GitLab repository identifier as input and returns an ARC object that represents the loaded ARC. This function:
- Retrieves the file paths from the GitLab repository
- Creates an empty ARC object from these file paths
- Gets empty read contracts for the ARC object
- Uses the contract handling function to fill the read contracts
- Injects filled read contracts back into the ARC object and returns it
def download_arc(project_path: str, branch: str = "main", token: str | None = None) -> ARC: print ("retrieve file paths") filepaths = list_repo_files(project_path, branch, token)
print ("init arc from file paths") arc = ARC.from_file_paths(filepaths)
print ("retrieve and fulfill metadata file read contracts") contracts = [ handle_read_contract(project_path, contract) for contract in arc.GetReadContracts() ]
print ("inject metadata from contracts into ARC") arc.SetISAFromContracts(contracts)
return arcNow we can simply call the download_arc function with the GitLab repository identifier of the ARC we want to load from the DataHUB. In this example, we load the brilator/Facultative-CAM-in-Talinum ARC.
project_path = "brilator/Facultative-CAM-in-Talinum"
arc = download_arc(project_path)
print ("check correctness of assay identifiers")print(arc.AssayIdentifiers)returning
['GCqTOF_targets', 'MassHunter_targets', 'RNASeq']