Skip to content

External data

Tailwind CSS chat bubble component

How can I annotate data sets from external sources in my ARC? For instance data sets from online repositories or supplemental data from publications.

Research projects rarely start out of the blue. Most projects build on previous findings and published or unpublished datasets. In this example we recommend routines to add data from external sources to your ARC.

To properly re-use and reference such a dataset, we recommend to add a study to your ARC.

  1. Add a study to your ARC, e.g. EnsemblReferences

    • Directorystudies
      • DirectoryEnsemblReferences
        • Directoryresources
          • ArabidopsisCDS.fasta
          • HordeumCDS.fasta
        • Directoryprotocols
          • DownloadFromEnsembl.md
        • isa.study.xlsx
  2. In the protocols directory you can add notes on how you retrieved the data and from where, e.g.

    protocols/DownloadFromEnsembl.md
    # Download CDS fasta files from Ensembl Plants
    1. Go to [Ensembl Plants](https://plants.ensembl.org/index.html)
    2. Search for
    - Arabidopsis thaliana
    - Hordeum vulgare
    3. Go to the "Download DNA, cDNA, ncRNA and protein sequences" section
    4. Download the "cDNA FASTA" file
  3. Add publication details to the study metadata, e.g.

    EnsemblReferences: Study
    Identifier: EnsemblReferences
    Title: Reference CDS for Arabidopsis thaliana
    Publications
    - DOI: https://doi.org/10.1111/tpj.13415
    - Title: Araport11: a complete reannotation of the Arabidopsis thaliana reference genome
    - Authors: Chia‐Yi Cheng, Vivek Krishnakumar, Agnes P. Chan, Françoise Thibaud‐Nissen, Seth Schobel, Christopher D. Town
  4. Contextualize the data (stored in resources) via a datamap

To attach more meaning to the external data, you can annotate the data files in the annotation table of the study.

ProtocolUriFactor [Organism]Output[Data]
./protocols/DownloadFromEnsembl.mdArabidopsis thalianaArabidopsisCDS.fasta
./protocols/DownloadFromEnsembl.mdHordeum vulgareHordeumCDS.fasta

Having the dataset files properly annotated and contextualized, you can now integrate them into your ARC just as you would do for any other dataset. For example, you can use the downloaded reference sequences as input for a mapping step in a transcriptomics workflow or assay.

Input[Data]ProtocolUriOutput[Data]
./assays/RNA-Seq/dataset/Sample1.fastq.gz./protocols/RNASeq-mapping.mdSample1-counts.tsv
./studies/EnsemblReferences/resources/ArabidopsisCDS.fasta./protocols/RNASeq-mapping.mdSample1-counts.tsv
./assays/RNA-Seq/dataset/Sample2.fastq.gz./protocols/RNASeq-mapping.mdSample2-counts.tsv
./studies/EnsemblReferences/resources/ArabidopsisCDS.fasta./protocols/RNASeq-mapping.mdSample2-counts.tsv
./assays/RNA-Seq/dataset/Sample3.fastq.gz./protocols/RNASeq-mapping.mdSample3-counts.tsv
./studies/EnsemblReferences/resources/ArabidopsisCDS.fasta./protocols/RNASeq-mapping.mdSample3-counts.tsv
./assays/RNA-Seq/dataset/Sample4.fastq.gz./protocols/RNASeq-mapping.mdSample4-counts.tsv
./studies/EnsemblReferences/resources/ArabidopsisCDS.fasta./protocols/RNASeq-mapping.mdSample4-counts.tsv