Skip to content

Partially normalized tables: making structure explicit

After extracting a clean wide table from a messy spreadsheet, we have already made significant progress. The data is no longer mixed with formatting or annotations, and the table has a clear rectangular structure that can be interpreted without relying on visual cues.

However, even in this cleaned form, the table still does not fully represent the underlying measurements. Some parts of the measurement are still encoded in the structure of the table itself. The next step is therefore not about storage or formatting, but about making meaning explicit.

Consider the cleaned table:

PlantIDDay1Day2
P11012
P289

At first glance, this table appears simple and usable. But if we interpret it through the lens of the measurement model introduced earlier, an important limitation becomes visible.

Some parts of the measurement are still encoded in the column names.

Revisiting the measurement components:

  • Entity is explicit (PlantID)
  • Value is explicit (cell values)
  • Context (Day) is encoded in the column names
  • Variable (Height) is not represented explicitly at all

The table is therefore only partially explicit, even though it looks clean.

Encoding information in column names is convenient for small datasets, but it does not scale well. As soon as the dataset grows, the structure of the table becomes increasingly complex.

If additional time points are added, new columns must be introduced:

PlantIDDay1Day2Day3Day4

If additional variables are measured, the encoding becomes more elaborate:

PlantIDHeight_Day1Height_Day2Weight_Day1Weight_Day2

At this point, the structure of the table begins to carry a large amount of implicit information. This leads to rapidly growing numbers of columns, inconsistent naming conventions, and increasing difficulty in querying or transforming the data.

The table structure itself starts to encode the experiment, rather than simply representing measurements.

To address this, we begin by extracting one dimension of the encoded structure and turning it into an explicit column. In this example, we make the time dimension (Day) explicit.

PlantIDDayHeight
P1110
P1212
P218
P229

This transformation is commonly referred to as pivoting or unpivoting, and it represents the first concrete step toward normalization.

With the introduction of the Day column, the table moves closer to the measurement model.

  • Entity is explicit
  • Context is now explicit
  • Value remains explicit
  • Variable is still implicit (represented by the column Height)

Each row now corresponds to a measurement at a specific time point, but the representation is still limited to a single variable.

In other words:

We have made the context explicit, but the measurement is not yet fully generalized.

Even this partial transformation provides significant benefits. The table becomes easier to extend, as new time points can be added without introducing new columns. It also becomes easier to filter and process, for example when selecting all measurements from a specific day.

At the same time, the table remains relatively compact and intuitive, especially for users familiar with spreadsheet-based workflows. This makes partially normalized tables a useful intermediate representation between informal spreadsheets and fully structured data.

This is a very important addition. You are introducing two key ideas:

  • partial normalization as a valid and often ideal compromise
  • the need for annotation (DataMap, provenance vs. semantic meaning) when structure alone is not sufficient

I’ve rewritten your section to:

  • clean up language and grammar
  • integrate it smoothly into the narrative
  • clarify the two “sides” of annotation (usage vs. meaning)
  • keep your intent and examples intact
  • align with the style of previous parts

It is important to recognize that full normalization is not always required.

In many practical situations, partially normalized tables represent an ideal compromise between machine readability and human readability. They are especially useful when working with a small and well-defined set of variables, performing standard statistical analyses, or preparing intermediate datasets for further processing.

At this stage, we have already achieved several important improvements:

  • formatting and visual encoding have been removed
  • the data region has been clearly identified
  • structural complexity has been reduced
  • contextual information (such as time) has been made explicit

This results in a table that is significantly easier to interpret and process, while still remaining intuitive for users familiar with spreadsheet-based workflows. In many real-world scenarios, this level of structure is sufficient.

Despite these improvements, one important component of the measurement remains implicit: the variable itself.

To fully represent measurements in a general and scalable way, the variable must also become an explicit part of the data. This becomes particularly important when we want to extend datasets by stacking different variables into a common structure.

Full normalization therefore requires that:

each measurement is represented independently by making the variable itself explicit

This leads to the fully normalized long format, where each row corresponds to one complete measurement. In this representation, the variable is no longer implied by the table structure but becomes an explicit column.

In practice, data rarely exists in a single, perfectly normalized form. Instead, real-world datasets are often a mixture of:

  • partially normalized tables
  • fully normalized tables
  • wide tables produced by different tools
  • intermediate representations created during analysis

As a result, it is not always possible or even desirable to enforce a single structure across all data.

This raises an important question:

How can we work with mixed or partially structured data in a consistent and interpretable way?

When structure alone is not sufficient, we need additional mechanisms to describe what the data represents and how it is used.

Within an Annotated Research Context (ARC), this is achieved by separating two complementary aspects:

One way to understand a variable is through the process that produced it and how it is used in downstream analysis.

For example, the same column may play different roles in different workflows:

  • in one workflow, values from column B are used as a feature
  • in another workflow, the same values are used as a response variable
  • in yet another workflow, they are used for normalization

In this sense, the meaning of a variable is partly defined by its usage context.

Independently of how a variable is used, we can also describe its semantic meaning.

For example:

  • values in column C may represent amounts
  • in another dataset, values in column C may represent gene identifiers

This aspect describes what the data is, independent of how it is used.

These two aspects are deliberately kept separate, because:

The same semantic designation can be used in different operational contexts without conflict.

For example, a measurement of gene expression may be used as:

  • a feature in one analysis
  • a response variable in another
  • an input for normalization in a third

The underlying meaning of the data does not change, but its role in analysis does.

To consistently describe both the structure and the meaning of data across different representations, we need a unifying layer.

In DataPLANT, this role is fulfilled by the DataMap, which allows us to annotate datasets in a structured and interoperable way.

In the next part, we will explore how data is stored, versioned, and annotated using the DataMap, and how this enables us to work with heterogeneous and partially normalized data in a consistent manner.

Self-check questions

  • Which component of the measurement is still implicit in the partially normalized table?
  • Why do column names such as Height_Day1 become problematic as data grows?
  • What advantages does introducing a dedicated “Day” column provide?
  • In which situations is a partially normalized table an appropriate or even ideal representation?
  • Why is it not always necessary to fully normalize a dataset?
  • How could you describe the meaning and usage of a column if it is not explicitly encoded in the table structure?