External data
How can I annotate data sets from external sources in my ARC? For instance data sets from online repositories or supplemental data from publications.
Research projects rarely start out of the blue. Most projects build on previous findings and published or unpublished datasets. In this example we recommend routines to add data from external sources to your ARC.
Add a study for the external data
Section titled “Add a study for the external data”To properly re-use and reference such a dataset, we recommend to add a study to your ARC.
-
Add a study to your ARC, e.g.
EnsemblReferencesDirectorystudies
DirectoryEnsemblReferences
Directoryresources
- ArabidopsisCDS.fasta
- HordeumCDS.fasta
Directoryprotocols
- DownloadFromEnsembl.md
- isa.study.xlsx
-
In the
protocolsdirectory you can add notes on how you retrieved the data and from where, e.g.protocols/DownloadFromEnsembl.md # Download CDS fasta files from Ensembl Plants1. Go to [Ensembl Plants](https://plants.ensembl.org/index.html)2. Search for- Arabidopsis thaliana- Hordeum vulgare3. Go to the "Download DNA, cDNA, ncRNA and protein sequences" section4. Download the "cDNA FASTA" file -
Add publication details to the study metadata, e.g.
EnsemblReferences: Study Identifier: EnsemblReferencesTitle: Reference CDS for Arabidopsis thalianaPublications- DOI: https://doi.org/10.1111/tpj.13415- Title: Araport11: a complete reannotation of the Arabidopsis thaliana reference genome- Authors: Chia‐Yi Cheng, Vivek Krishnakumar, Agnes P. Chan, Françoise Thibaud‐Nissen, Seth Schobel, Christopher D. Town -
Contextualize the data (stored in
resources) via adatamap
Annotate the downloaded dataset files
Section titled “Annotate the downloaded dataset files”To attach more meaning to the external data, you can annotate the data files in the annotation table of the study.
ProtocolUri | Factor [Organism] | Output[Data] |
|---|---|---|
| ./protocols/DownloadFromEnsembl.md | Arabidopsis thaliana | ArabidopsisCDS.fasta |
| ./protocols/DownloadFromEnsembl.md | Hordeum vulgare | HordeumCDS.fasta |
Integrate the files in your data analysis
Section titled “Integrate the files in your data analysis”Having the dataset files properly annotated and contextualized, you can now integrate them into your ARC just as you would do for any other dataset. For example, you can use the downloaded reference sequences as input for a mapping step in a transcriptomics workflow or assay.
Input[Data] | ProtocolUri | Output[Data] |
|---|---|---|
| ./assays/RNA-Seq/dataset/Sample1.fastq.gz | ./protocols/RNASeq-mapping.md | Sample1-counts.tsv |
| ./studies/EnsemblReferences/resources/ArabidopsisCDS.fasta | ./protocols/RNASeq-mapping.md | Sample1-counts.tsv |
| ./assays/RNA-Seq/dataset/Sample2.fastq.gz | ./protocols/RNASeq-mapping.md | Sample2-counts.tsv |
| ./studies/EnsemblReferences/resources/ArabidopsisCDS.fasta | ./protocols/RNASeq-mapping.md | Sample2-counts.tsv |
| ./assays/RNA-Seq/dataset/Sample3.fastq.gz | ./protocols/RNASeq-mapping.md | Sample3-counts.tsv |
| ./studies/EnsemblReferences/resources/ArabidopsisCDS.fasta | ./protocols/RNASeq-mapping.md | Sample3-counts.tsv |
| ./assays/RNA-Seq/dataset/Sample4.fastq.gz | ./protocols/RNASeq-mapping.md | Sample4-counts.tsv |
| ./studies/EnsemblReferences/resources/ArabidopsisCDS.fasta | ./protocols/RNASeq-mapping.md | Sample4-counts.tsv |