Describing data with the DataMap
From structure to meaning
Section titled “From structure to meaning”In the previous sections, we focused on improving the structure of our data. We transformed a messy spreadsheet into a clean and partially normalized table where key components of the measurement became explicit.
At this point, our data looks like this:
| PlantID | Day | Height |
|---|---|---|
| P1 | 1 | 10 |
| P1 | 2 | 12 |
| P2 | 1 | 8 |
| P2 | 2 | 9 |
This table is already much easier to interpret than the original spreadsheet. However, an important question remains:
How do we make sure that others understand what this data means?
The table tells us how values are arranged, but not fully what they represent, how they were generated, or how they should be interpreted.
To complete the picture, we now introduce two additional layers:
- the assay, which provides experimental context
- the DataMap, which provides semantic annotation
A small LiDAR assay example
Section titled “A small LiDAR assay example”To make this concrete, let us assume that plant height was measured using a LiDAR-based assay.
Within an Annotated Research Context (ARC), this would be represented as an assay folder:
Directoryassays/
DirectoryLiDARHeightMeasurement/
Directorydataset/
- plant_height_lidar.tsv
Directoryprotocols/
- lidar_measurement_protocol.md
- isa.assay.xlsx
- README.md
The actual measurement table is stored in:
dataset/plant_height_lidar.tsv
and contains:
| PlantID | Day | Height |
|---|---|---|
| P1 | 1 | 10 |
| P1 | 2 | 12 |
| P2 | 1 | 8 |
| P2 | 2 | 9 |
This file contains the values, but not their full meaning.
The role of the assay table
Section titled “The role of the assay table”The assay itself provides the experimental and technical context. A minimal isa.assay.xlsx for this example could look like this:
| Sample Name | Protocol REF | Parameter[Day] | Parameter [instrument model] | Raw Data File |
|---|---|---|---|---|
| P1 | LiDAR height acquisition | 1 | LiDAR | dataset/plant_height_lidar.tsv |
| P1 | LiDAR height acquisition | 2 | LiDAR | dataset/plant_height_lidar.tsv |
| P2 | LiDAR height acquisition | 1 | LiDAR | dataset/plant_height_lidar.tsv |
| P2 | LiDAR height acquisition | 2 | LiDAR | dataset/plant_height_lidar.tsv |
This table does not duplicate the measurement values. Instead, it:
- links samples to the measurement protocol
- records relevant parameters (e.g. day, instrument)
- connects the assay to the data file
In other words:
The assay tells us where the data comes from and how it was generated.
However, it still does not fully define what each column in the dataset means.
The role of the DataMap
Section titled “The role of the DataMap”This is where the DataMap comes in.
The DataMap provides a way to annotate fragments of data files, such as columns, in a structured and machine-readable way.
Each row in the DataMap refers to a specific fragment of a file and describes its meaning.
For our LiDAR example, a DataMap could look like this:
| Data | Data Format | Data Selector Format | Explication | Term Source REF | Term Accession Number | Unit | Term Source REF | Term Accession Number | Object Type | Term Source REF | Term Accession Number | Description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| plant_height_lidar.tsv#col=1 | text/tab-separated-values | rfc7111 | plant identifier | string | xsd | xsd:string | Identifier of the measured plant | |||||
| plant_height_lidar.tsv#col=2 | text/tab-separated-values | rfc7111 | day of measurement | day | UO | UO_0000033 | integer | xsd | xsd:integer | Measurement time point | ||
| plant_height_lidar.tsv#col=3 | text/tab-separated-values | rfc7111 | plant height | OBA | OBA_VT0001251 | centimeter | UO | UO_0000015 | double | xsd | xsd:double | Plant height measured using LiDAR |
For readability, URIs are shortened to prefixes (e.g. UO, OBA, xsd) in this example.
Putting the pieces together
Section titled “Putting the pieces together”At this point, we have three complementary layers:
-
the dataset (
plant_height_lidar.tsv) → contains the actual measurement values -
the assay table (
isa.assay.xlsx) → describes the measurement process and links data files -
the DataMap → annotates what each column in the dataset represents
Each layer answers a different question:
- dataset → What are the values?
- assay → How were they generated?
- DataMap → What do they mean?
This separation is essential, because:
Not all meaning can or should be encoded directly in the table structure.
Why this matters for real-world data
Section titled “Why this matters for real-world data”In practice, datasets are rarely perfectly normalized or uniform. We often encounter:
- partially normalized tables
- outputs from instruments
- intermediate analysis files
- mixed representations across workflows
Instead of forcing all data into one strict format, the combination of:
- structured tables
- assay context
- DataMap annotation
allows us to work with heterogeneous data in a consistent and interpretable way.
A note on storage formats
Section titled “A note on storage formats”Once data is structured and annotated, it should be stored in a format that preserves these properties.
Plain text tabular formats such as TSV or CSV are recommended because they are:
- easy to process programmatically
- suitable for long-term storage
- compatible with a wide range of tools
- well suited for version control systems such as Git
Transition to the final step
Section titled “Transition to the final step”At this stage, we have:
- improved the structure of the data
- provided experimental context through the assay
- annotated semantic meaning through the DataMap
The final step is to decide whether we also want to make the variable itself explicit within the table.
This leads to the fully normalized long format, where each row represents one complete measurement.