After extracting a clean wide table from a messy spreadsheet, we have already made significant progress. The data is no longer mixed with formatting or annotations, and the table has a clear rectangular structure that can be interpreted without relying on visual cues.
However, even in this cleaned form, the table still does not fully represent the underlying measurements. Some parts of the measurement are still encoded in the structure of the table itself. The next step is therefore not about storage or formatting, but about making meaning explicit.
At first glance, this table appears simple and usable. But if we interpret it through the lens of the measurement model introduced earlier, an important limitation becomes visible.
Some parts of the measurement are still encoded in the column names.
Revisiting the measurement components:
Entity is explicit (PlantID)
Value is explicit (cell values)
Context (Day) is encoded in the column names
Variable (Height) is not represented explicitly at all
The table is therefore only partially explicit, even though it looks clean.
Encoding information in column names is convenient for small datasets, but it does not scale well. As soon as the dataset grows, the structure of the table becomes increasingly complex.
If additional time points are added, new columns must be introduced:
PlantID
Day1
Day2
Day3
Day4
If additional variables are measured, the encoding becomes more elaborate:
PlantID
Height_Day1
Height_Day2
Weight_Day1
Weight_Day2
At this point, the structure of the table begins to carry a large amount of implicit information. This leads to rapidly growing numbers of columns, inconsistent naming conventions, and increasing difficulty in querying or transforming the data.
The table structure itself starts to encode the experiment, rather than simply representing measurements.
To address this, we begin by extracting one dimension of the encoded structure and turning it into an explicit column. In this example, we make the time dimension (Day) explicit.
PlantID
Day
Height
P1
1
10
P1
2
12
P2
1
8
P2
2
9
This transformation is commonly referred to as pivoting or unpivoting, and it represents the first concrete step toward normalization.
Even this partial transformation provides significant benefits. The table becomes easier to extend, as new time points can be added without introducing new columns. It also becomes easier to filter and process, for example when selecting all measurements from a specific day.
At the same time, the table remains relatively compact and intuitive, especially for users familiar with spreadsheet-based workflows. This makes partially normalized tables a useful intermediate representation between informal spreadsheets and fully structured data.
This is a very important addition. You are introducing two key ideas:
partial normalization as a valid and often ideal compromise
the need for annotation (DataMap, provenance vs. semantic meaning) when structure alone is not sufficient
I’ve rewritten your section to:
clean up language and grammar
integrate it smoothly into the narrative
clarify the two “sides” of annotation (usage vs. meaning)
It is important to recognize that full normalization is not always required.
In many practical situations, partially normalized tables represent an ideal compromise between machine readability and human readability. They are especially useful when working with a small and well-defined set of variables, performing standard statistical analyses, or preparing intermediate datasets for further processing.
At this stage, we have already achieved several important improvements:
formatting and visual encoding have been removed
the data region has been clearly identified
structural complexity has been reduced
contextual information (such as time) has been made explicit
This results in a table that is significantly easier to interpret and process, while still remaining intuitive for users familiar with spreadsheet-based workflows. In many real-world scenarios, this level of structure is sufficient.
Despite these improvements, one important component of the measurement remains implicit: the variable itself.
To fully represent measurements in a general and scalable way, the variable must also become an explicit part of the data. This becomes particularly important when we want to extend datasets by stacking different variables into a common structure.
Full normalization therefore requires that:
each measurement is represented independently by making the variable itself explicit
This leads to the fully normalized long format, where each row corresponds to one complete measurement. In this representation, the variable is no longer implied by the table structure but becomes an explicit column.
To consistently describe both the structure and the meaning of data across different representations, we need a unifying layer.
In DataPLANT, this role is fulfilled by the DataMap, which allows us to annotate datasets in a structured and interoperable way.
In the next part, we will explore how data is stored, versioned, and annotated using the DataMap, and how this enables us to work with heterogeneous and partially normalized data in a consistent manner.
Self-check questions
Which component of the measurement is still implicit in the partially normalized table?
Why do column names such as Height_Day1 become problematic as data grows?
What advantages does introducing a dedicated “Day” column provide?
In which situations is a partially normalized table an appropriate or even ideal representation?
Why is it not always necessary to fully normalize a dataset?
How could you describe the meaning and usage of a column if it is not explicitly encoded in the table structure?