Before looking at tables, we begin with a small set of measurements.
We measure the height of two plants (P1, P2) on two days (Day 1, Day 2).
1
Plant:P1 → Day:1 → Height:10
2
Plant:P1 → Day:2 → Height:12
3
Plant:P2 → Day:1 → Height:8
4
Plant:P2 → Day:2 → Height:9
Each line represents one measurement. All components are explicit, and there is no ambiguity in how the data should be interpreted. This representation serves as our ground truth. Every table we will encounter is simply a different way of encoding these same measurements, sometimes clearly, sometimes in a way that obscures their structure.
In practice, data is rarely recorded in such an explicit form. Instead, it is typically entered into spreadsheets designed for human readability. These spreadsheets aim to be easy to understand visually, but they often hide important structural information.
Before we even encounter encoded column names, many datasets already contain implicit structure that depends on layout, formatting, and context.
Consider the following example of a typical spreadsheet:
A
B
C
D
1
Experiment: Growth Study
2
3
Plant
Day 1
Day 2
Notes
4
P1
10
12
healthy
5
P2
8
9
slightly wilted
6
Mean
=AVERAGE(B4:B5)
=AVERAGE(C4:C5)
All of the original measurements are still present in this table, but their components are now distributed across different parts of the layout.
The entity is represented in the first column, the day is encoded in the column headers, and the measured values are stored in individual cells. The variable itself (Height) is not written anywhere and must be inferred from context. In addition, derived values such as the mean are mixed into the same table, and non-data elements such as titles and empty rows are included.
In real-world spreadsheets, this situation is often even more complex. Colors may indicate treatment groups, bold text may distinguish headers from data, empty rows may be used for visual separation, and units may appear in arbitrary places. It is also common to find multiple small tables on a single sheet or data that does not start in cell A1.
The result is a structure that is easy for humans to interpret, but difficult to process computationally.
The spreadsheet does not explicitly store measurements. It stores a visual arrangement from which measurements must be reconstructed.
The first step towards a more structured representation is to separate data from presentation. This means identifying the actual region of the table that contains observations and removing everything that is not part of the measurements.
In practice, this involves determining:
where the data begins and ends
which rows contain actual observations
which rows contain annotations, summaries, or formatting
After removing titles, empty rows, formatting, and computed summary rows, we obtain a cleaned table:
PlantID
Day1
Day2
Notes
P1
10
12
healthy
P2
8
9
slightly wilted
At this point, the table has a clear rectangular structure. The raw measurement values are now isolated, and the data can be interpreted without relying on formatting or layout.
A critical step in this process is the separation of raw measurements from derived results.
In the original spreadsheet, computed values such as means were included directly in the table. While this may be convenient for inspection, it introduces ambiguity and makes it difficult to reproduce results.
For example, the following table mixes observations and results:
PlantID
Day1
Day2
P1
10
12
Mean
9
10.5
Here, it is unclear how the mean was calculated, and the table no longer represents a consistent set of observations.
A more robust approach is to store raw data and derived values separately. The measurement data remains unchanged:
PlantID
Day1
Day2
P1
10
12
and computed results are stored in a separate table:
Statistic
Day
Value
Mean
1
9
Mean
2
10.5
This separation ensures that raw data remains stable and that results can be recomputed transparently.
Within the context of an Annotated Research Context (ARC), this distinction is reflected in the folder structure. Raw measurement tables are typically stored under assays/<assay>/datasets/, while derived results and analyses are stored under runs/<run>/. This mirrors the conceptual difference between observations and results and supports reproducible workflows.
At this stage, we have removed presentation and formatting, identified the data region, and isolated the raw measurements in a clean table. The structure is now explicit enough to be stored, versioned, and processed, but it still contains encoded information.
In particular, parts of the measurement are still hidden in column names or implied by the table structure.
This leads directly to the next step:
making encoded structure, such as time, explicit in the data itself
Self-check questions
Where are measurements hidden in the original spreadsheet?
Which parts of the table are not actual data?
Why should summary statistics not be stored together with raw data?
How would you identify the data region in a messy Excel sheet?