Skip to content

From format-driven to clean wide table(s)

Before looking at tables, we begin with a small set of measurements.

We measure the height of two plants (P1, P2) on two days (Day 1, Day 2).

Plant:P1 → Day:1 → Height:10
Plant:P1 → Day:2 → Height:12
Plant:P2 → Day:1 → Height:8
Plant:P2 → Day:2 → Height:9

Each line represents one measurement. All components are explicit, and there is no ambiguity in how the data should be interpreted. This representation serves as our ground truth. Every table we will encounter is simply a different way of encoding these same measurements, sometimes clearly, sometimes in a way that obscures their structure.

In practice, data is rarely recorded in such an explicit form. Instead, it is typically entered into spreadsheets designed for human readability. These spreadsheets aim to be easy to understand visually, but they often hide important structural information.

Before we even encounter encoded column names, many datasets already contain implicit structure that depends on layout, formatting, and context.

Consider the following example of a typical spreadsheet:

ABCD
1Experiment: Growth Study
2
3PlantDay 1Day 2Notes
4P11012healthy
5P289slightly wilted
6Mean=AVERAGE(B4:B5)=AVERAGE(C4:C5)

All of the original measurements are still present in this table, but their components are now distributed across different parts of the layout.

The entity is represented in the first column, the day is encoded in the column headers, and the measured values are stored in individual cells. The variable itself (Height) is not written anywhere and must be inferred from context. In addition, derived values such as the mean are mixed into the same table, and non-data elements such as titles and empty rows are included.

In real-world spreadsheets, this situation is often even more complex. Colors may indicate treatment groups, bold text may distinguish headers from data, empty rows may be used for visual separation, and units may appear in arbitrary places. It is also common to find multiple small tables on a single sheet or data that does not start in cell A1.

The result is a structure that is easy for humans to interpret, but difficult to process computationally.

The spreadsheet does not explicitly store measurements. It stores a visual arrangement from which measurements must be reconstructed.

The first step towards a more structured representation is to separate data from presentation. This means identifying the actual region of the table that contains observations and removing everything that is not part of the measurements.

In practice, this involves determining:

  • where the data begins and ends
  • which rows contain actual observations
  • which rows contain annotations, summaries, or formatting

After removing titles, empty rows, formatting, and computed summary rows, we obtain a cleaned table:

PlantIDDay1Day2Notes
P11012healthy
P289slightly wilted

At this point, the table has a clear rectangular structure. The raw measurement values are now isolated, and the data can be interpreted without relying on formatting or layout.

Even though the table is now clean, it still does not fully match the measurement model.

We can partially reconstruct the measurements:

  • the entity is represented by PlantID
  • the values are explicitly stored in the cells

However, important parts of the measurement are still implicit:

  • the day is encoded in the column names (Day1, Day2)
  • the variable (Height) is not explicitly represented at all

This means that the table is easier to work with than the original spreadsheet, but still contains hidden structure.

A critical step in this process is the separation of raw measurements from derived results.

In the original spreadsheet, computed values such as means were included directly in the table. While this may be convenient for inspection, it introduces ambiguity and makes it difficult to reproduce results.

For example, the following table mixes observations and results:

PlantIDDay1Day2
P11012
Mean910.5

Here, it is unclear how the mean was calculated, and the table no longer represents a consistent set of observations.

A more robust approach is to store raw data and derived values separately. The measurement data remains unchanged:

PlantIDDay1Day2
P11012

and computed results are stored in a separate table:

StatisticDayValue
Mean19
Mean210.5

This separation ensures that raw data remains stable and that results can be recomputed transparently.

Within the context of an Annotated Research Context (ARC), this distinction is reflected in the folder structure. Raw measurement tables are typically stored under assays/<assay>/datasets/, while derived results and analyses are stored under runs/<run>/. This mirrors the conceptual difference between observations and results and supports reproducible workflows.

At this stage, we have removed presentation and formatting, identified the data region, and isolated the raw measurements in a clean table. The structure is now explicit enough to be stored, versioned, and processed, but it still contains encoded information.

In particular, parts of the measurement are still hidden in column names or implied by the table structure.

This leads directly to the next step:

making encoded structure, such as time, explicit in the data itself

Self-check questions

  • Where are measurements hidden in the original spreadsheet?
  • Which parts of the table are not actual data?
  • Why should summary statistics not be stored together with raw data?
  • How would you identify the data region in a messy Excel sheet?