Skip to content

Describing data with the DataMap

In the previous sections, we focused on improving the structure of our data. We transformed a messy spreadsheet into a clean and partially normalized table where key components of the measurement became explicit.

At this point, our data looks like this:

PlantIDDayHeight
P1110
P1212
P218
P229

This table is already much easier to interpret than the original spreadsheet. However, an important question remains:

How do we make sure that others understand what this data means?

The table tells us how values are arranged, but not fully what they represent, how they were generated, or how they should be interpreted.

To complete the picture, we now introduce two additional layers:

  • the assay, which provides experimental context
  • the DataMap, which provides semantic annotation

To make this concrete, let us assume that plant height was measured using a LiDAR-based assay.

Within an Annotated Research Context (ARC), this would be represented as an assay folder:

  • Directoryassays/
    • DirectoryLiDARHeightMeasurement/
      • Directorydataset/
        • plant_height_lidar.tsv
      • Directoryprotocols/
        • lidar_measurement_protocol.md
      • isa.assay.xlsx
      • README.md

The actual measurement table is stored in:

dataset/plant_height_lidar.tsv

and contains:

PlantIDDayHeight
P1110
P1212
P218
P229

This file contains the values, but not their full meaning.

The assay itself provides the experimental and technical context. A minimal isa.assay.xlsx for this example could look like this:

Sample NameProtocol REFParameter[Day]Parameter [instrument model]Raw Data File
P1LiDAR height acquisition1LiDARdataset/plant_height_lidar.tsv
P1LiDAR height acquisition2LiDARdataset/plant_height_lidar.tsv
P2LiDAR height acquisition1LiDARdataset/plant_height_lidar.tsv
P2LiDAR height acquisition2LiDARdataset/plant_height_lidar.tsv

This table does not duplicate the measurement values. Instead, it:

  • links samples to the measurement protocol
  • records relevant parameters (e.g. day, instrument)
  • connects the assay to the data file

In other words:

The assay tells us where the data comes from and how it was generated.

However, it still does not fully define what each column in the dataset means.

This is where the DataMap comes in.

The DataMap provides a way to annotate fragments of data files, such as columns, in a structured and machine-readable way.

Each row in the DataMap refers to a specific fragment of a file and describes its meaning.

For our LiDAR example, a DataMap could look like this:

DataData FormatData Selector FormatExplicationTerm Source REFTerm Accession NumberUnitTerm Source REFTerm Accession NumberObject TypeTerm Source REFTerm Accession NumberDescription
plant_height_lidar.tsv#col=1text/tab-separated-valuesrfc7111plant identifierstringxsdxsd:stringIdentifier of the measured plant
plant_height_lidar.tsv#col=2text/tab-separated-valuesrfc7111day of measurementdayUOUO_0000033integerxsdxsd:integerMeasurement time point
plant_height_lidar.tsv#col=3text/tab-separated-valuesrfc7111plant heightOBAOBA_VT0001251centimeterUOUO_0000015doublexsdxsd:doublePlant height measured using LiDAR

For readability, URIs are shortened to prefixes (e.g. UO, OBA, xsd) in this example.

At this point, we have three complementary layers:

  • the dataset (plant_height_lidar.tsv) → contains the actual measurement values

  • the assay table (isa.assay.xlsx) → describes the measurement process and links data files

  • the DataMap → annotates what each column in the dataset represents

Each layer answers a different question:

  • dataset → What are the values?
  • assay → How were they generated?
  • DataMap → What do they mean?

This separation is essential, because:

Not all meaning can or should be encoded directly in the table structure.

In practice, datasets are rarely perfectly normalized or uniform. We often encounter:

  • partially normalized tables
  • outputs from instruments
  • intermediate analysis files
  • mixed representations across workflows

Instead of forcing all data into one strict format, the combination of:

  • structured tables
  • assay context
  • DataMap annotation

allows us to work with heterogeneous data in a consistent and interpretable way.

Once data is structured and annotated, it should be stored in a format that preserves these properties.

Plain text tabular formats such as TSV or CSV are recommended because they are:

  • easy to process programmatically
  • suitable for long-term storage
  • compatible with a wide range of tools
  • well suited for version control systems such as Git

At this stage, we have:

  • improved the structure of the data
  • provided experimental context through the assay
  • annotated semantic meaning through the DataMap

The final step is to decide whether we also want to make the variable itself explicit within the table.

This leads to the fully normalized long format, where each row represents one complete measurement.