Skip to content

Validate, Review & Publish your ARC

Tailwind CSS chat bubble component

The ARC offers a lot of flexibility in how I structure my data and annotation. How can I make sure that it is intact or holds enough data to be published and receive a DOI?

Your ARC can be validated in the PLANTDataHUB against a selection of validation packages. For instance, you can validate the ARC

  1. against the arc_specification – to make sure that your ARC is intact
  2. against the invenio requirements – Invenio is the basis of DataPLANT’s publication service

You can simply select the validation via ARCitect.

ARCitect

  1. Open the ARC in ARCitect
  2. In the sidebar, go to the Validation menu
  3. Check the box next to the arc_specification dropdown
  4. In the section “Custom Validation Packages”, select the invenio package
  5. Select the latest version for both packages and click Save
  6. Commit your changes via the Commit menu
  7. Upload your changes via the DataHUB Sync menu
  8. Open your ARC in the DataHUB

The validation is being run in the PLANTDataHUB, where you can also find the validation results.

PLANTDataHUB

  1. Sign in to the PLANTDataHUB
  2. Navigate to the validated ARC
  3. In the right sidebar you find several validation badges (e.g. arc specification and invenio)
  4. Click on the arc specification badge to explore the test results.

The validation tests check many things, but cannot catch everything. Whenever you are getting ready to publish an ARC, we suggest you have a look at our guide for reviewing your ARC

Reviewing your ARC before publication

Section titled Reviewing your ARC before publication

Before an ARC can be published, it must clearly and completely describe the data it contains. In this guide we outline things to consider before submitting your ARC for publication.

Check the ARC repository on DataHUB

Section titled Check the ARC repository on DataHUB

First, check the ARC’s top level information:

DataHUB repository

  1. Open your ARC on PLANTDataHUB
  2. Is the name of the repository descriptive?
  3. The repository should be set to public. This is currently the best way to ensure that the people reviewing the corresponding publication have access to the ARC. Check the visibility by looking at the icon behind the repository name.
    • You can change the visibility under Settings -> General -> Visibility, project features, permissions.
  4. The ARC should have a README.md with useful information? This is the first thing people will see when they look at your ARC on DataHUB and is a useful way to ‘orientate’ the viewer in the ARC. (However, be sure that any important information in the README is also contained within the actual ARC structure in order to adhere to the ISA framework.)
  5. Does your repository have a LICENSE file? This is required to allow re-use of your ARC. Follow these instructions and recommendations to add an open license such as MIT or Creative Commons to your ARC. [IMPORTANT]
  6. Check the Project Storage on the right side. Is most of the storage space “LFS” storage? As a rule of thumb, the usage under “Repository” storage should be less than 1 GB.

Check the ARC (e.g. in ARCitect)

Section titled Check the ARC (e.g. in ARCitect)

Next, open the ARC itself, and run through everything one last time, or else ask a friend or colleague to have a look at your ARC to make sure it’s understandable.

Investigation

  1. The title and description should be clear and informative.
  2. Check the contacts. Are all collaborators included? Is all data (name, email, ORCiD, affiliation) for the contacts correct and complete?

Studies and Assays

For each study and assay in your ARC, check that:

  1. The top-level metadata is present and informative, including the measurement/technology metadata (in the case of assays).
  2. Within all annotation tables:
    • Columns use ontology terms where possible.
    • Column types (e.g. characteristic, parameter, factor) are correct. See the Annotation Principles for more details.
    • Fixed ISA Column headers (e.g. Protocol REF) may not be manually renamed.
    • Input column headers match the ‘type’ of input in the column (this can be Input[Source Name], Input[Sample Name], Input[Material], or Input[Data]).
    • Output column headers match the ‘type’ of output in the column (this can be Output[Source Name], Output[Sample Name], Output[Material], or Output[Data]).
    • Columns have units (where applicable).
    • The values in the columns make sense (e.g. numerical values in a temperature column), no obvious copy-paste errors, excel dragging errors, missing values, etc.
    • There are no missing columns.
      • Important metadata present in a protocol document has been added to the annotation table.
      • Any community requirements such as templates or validation packages are utilized.
  3. Protocol documents in the protocols folder must be referenced in the corresponding annotation sheet via a Protocol REF column.
  4. [Assays] The data contained in the dataset folder is organised in an understandable way AND is referenced in the ‘Output [Data]’ column of the corresponding annotation sheet.
  5. [Assays] Check the Datamap if one is present
  6. Make sure that all column headers within a table are unique.
  7. Did you create or make the last editions in your ISA files in MS Excel rather than in one of the ARC tools? Make sure that:
    • Your ISA-tables are tables in the MS Excel sense rather than ranges. Tables should include all lines and rows that are meant to contain Isa-formatted data.
    • To check this, right-click a cell that should belong to the table. If you get the Menu item “Table” > and in sub-menu “convert to range”, you are fine. Otherwise, select the all cells of the table, go to the “Insert” tab and select the “Table” button.

Runs and Workflows

  1. Analysis code is included either as a CWL workflow in the workflows/runs folder, or as a virtual assay. See also our guide on adding data analysis
  2. Was any external data used in the analysis? Is it clear where this came from?
  3. [CWL workflow] Ensure that it runs using cwltool. See guide for more details.
  4. [Virtual Assay] Code should be in the “protocols” folder, and output data in “dataset” folder. Ensure that the code runs. Typically, the “input” data consumed by the code is already stored somewhere in the ARC, e.g. assay datasets or study resources.
  5. [Virtual Assay] For non-code based data analysis using softwares e.g. for statistics or graphics, we recommend the same logic: describe the data analysis as comprehensible as possible in the “protocols” folder, and provide the resulting data in the “dataset” folder.

All together

To check the overall consistency of your ARC, make sure that the full connection from sample to raw data to processed data is there.

  1. For every output, can you trace its origins back through the ARC? Is the provenance of all data fully described?
    • A mermaid graph is a quick way to visualize this provenance, showing how the inputs and outputs from your studies and assays are connected. Learn more about the arcIsaProcessMermaid tool here
    • The arc-summary.md lists the files that are properly linked in the ARC. This is provided as a downloadable artifact by the CI/CD job “Create ARC json”.
Tailwind CSS chat bubble component

I think my ARC is now ready for publication. How can I submit my ARC for a data publication?

PLANTDataHUB

  1. Sign in to the PLANTDataHUB

  2. Navigate to the ARC you want to publish

  3. In the right sidebar click on the “Invenio” badge
  4. You are automatically redirected to the DataPLANT Publication Service (https://archigator.nfdi4plants.org).

  5. Login with your DataPLANT account

  6. A summary of the ARC is shown including the investigation metadata.

  7. You can follow the steps in the right sidebar to publish your ARC.