Validate, Review & Publish your ARC

The ARC offers a lot of flexibility in how I structure my data and annotation. How can I make sure that it is intact or holds enough data to be published and receive a DOI?

Your ARC can be validated in the PLANTDataHUB against a selection of validation packages. For instance, you can validate the ARC

against the arc_specification – to make sure that your ARC is intact
against the invenio requirements – Invenio is the basis of DataPLANT’s publication service

Add validation to your ARC

You can simply select the validation via ARCitect.

ARCitect

Open the ARC in ARCitect
In the sidebar, go to the Validation menu
Check the box next to the arc_specification dropdown
In the section “Custom Validation Packages”, select the invenio package
Select the latest version for both packages and click Save
Commit your changes via the Commit menu
Upload your changes via the DataHUB Sync menu
Open your ARC in the DataHUB

Check the validation results

The validation is being run in the PLANTDataHUB, where you can also find the validation results.

PLANTDataHUB

Sign in to the PLANTDataHUB
Navigate to the validated ARC
In the right sidebar you find several validation badges (e.g. arc specification and invenio)
Click on the arc specification badge to explore the test results.

The validation tests check many things, but cannot catch everything. Whenever you are getting ready to publish an ARC, we suggest you have a look at our guide for reviewing your ARC

Reviewing your ARC before publication

Before an ARC can be published, it must clearly and completely describe the data it contains. In this guide we outline things to consider before submitting your ARC for publication.

Check the ARC repository on DataHUB

First, check the ARC’s top level information:

DataHUB repository

Open your ARC on PLANTDataHUB
Is the name of the repository descriptive?
The repository should be set to public. This is currently the best way to ensure that the people reviewing the corresponding publication have access to the ARC. Check the visibility by looking at the icon behind the repository name.
- You can change the visibility under Settings -> General -> Visibility, project features, permissions.
The ARC should have a README.md with useful information? This is the first thing people will see when they look at your ARC on DataHUB and is a useful way to ‘orientate’ the viewer in the ARC. (However, be sure that any important information in the README is also contained within the actual ARC structure in order to adhere to the ISA framework.)
Does your repository have a LICENSE file? This is required to allow re-use of your ARC. Follow these instructions and recommendations to add an open license such as MIT or Creative Commons to your ARC. [IMPORTANT]
Check the Project Storage on the right side. Is most of the storage space “LFS” storage? As a rule of thumb, the usage under “Repository” storage should be less than 1 GB.

Check the ARC (e.g. in ARCitect)

Next, open the ARC itself, and run through everything one last time, or else ask a friend or colleague to have a look at your ARC to make sure it’s understandable.

Investigation

The title and description should be clear and informative.
Check the contacts. Are all collaborators included? Is all data (name, email, ORCiD, affiliation) for the contacts correct and complete?

Studies and Assays

For each study and assay in your ARC, check that:

The top-level metadata is present and informative, including the measurement/technology metadata (in the case of assays).
Within all annotation tables:
- Columns use ontology terms where possible.
- Column types (e.g. characteristic, parameter, factor) are correct. See the Annotation Principles for more details.
- Fixed ISA Column headers (e.g. Protocol REF) may not be manually renamed.
- Input column headers match the ‘type’ of input in the column (this can be Input[Source Name], Input[Sample Name], Input[Material], or Input[Data]).
- Output column headers match the ‘type’ of output in the column (this can be Output[Source Name], Output[Sample Name], Output[Material], or Output[Data]).
- Columns have units (where applicable).
- The values in the columns make sense (e.g. numerical values in a temperature column), no obvious copy-paste errors, excel dragging errors, missing values, etc.
- There are no missing columns.
  - Important metadata present in a protocol document has been added to the annotation table.
  - Any community requirements such as templates or validation packages are utilized.
Protocol documents in the protocols folder must be referenced in the corresponding annotation sheet via a Protocol REF column.
[Assays] The data contained in the dataset folder is organised in an understandable way AND is referenced in the ‘Output [Data]’ column of the corresponding annotation sheet.
[Assays] Check the Datamap if one is present
Make sure that all column headers within a table are unique.
Did you create or make the last editions in your ISA files in MS Excel rather than in one of the ARC tools? Make sure that:
- Your ISA-tables are tables in the MS Excel sense rather than ranges. Tables should include all lines and rows that are meant to contain Isa-formatted data.
- To check this, right-click a cell that should belong to the table. If you get the Menu item “Table” > and in sub-menu “convert to range”, you are fine. Otherwise, select the all cells of the table, go to the “Insert” tab and select the “Table” button.

Runs and Workflows

Analysis code is included either as a CWL workflow in the workflows/runs folder, or as a virtual assay. See also our guide on adding data analysis
Was any external data used in the analysis? Is it clear where this came from?
Follow this guide to design a study describing the external data (e.g. downloads from a database or supplemental data from publications)
[CWL workflow] Ensure that it runs using cwltool. See guide for more details.
[Virtual Assay] Code should be in the “protocols” folder, and output data in “dataset” folder. Ensure that the code runs. Typically, the “input” data consumed by the code is already stored somewhere in the ARC, e.g. assay datasets or study resources.
[Virtual Assay] For non-code based data analysis using softwares e.g. for statistics or graphics, we recommend the same logic: describe the data analysis as comprehensible as possible in the “protocols” folder, and provide the resulting data in the “dataset” folder.

All together

To check the overall consistency of your ARC, make sure that the full connection from sample to raw data to processed data is there.

For every output, can you trace its origins back through the ARC? Is the provenance of all data fully described?
- A mermaid graph is a quick way to visualize this provenance, showing how the inputs and outputs from your studies and assays are connected. Learn more about the arcIsaProcessMermaid tool here
- The arc-summary.md lists the files that are properly linked in the ARC. This is provided as a downloadable artifact by the CI/CD job “Create ARC json”.

Publish your ARC

I think my ARC is now ready for publication. How can I submit my ARC for a data publication?

PLANTDataHUB

Sign in to the PLANTDataHUB
Navigate to the ARC you want to publish
In the right sidebar click on the “Invenio” badge
You are automatically redirected to the DataPLANT Publication Service (https://archigator.nfdi4plants.org).
Login with your DataPLANT account
A summary of the ARC is shown including the investigation metadata.
You can follow the steps in the right sidebar to publish your ARC.

Please avoid to request a publication for your demo ARC 😉