Published on

Introduction To Tidy Data Principles

Authors

What is Tidy Data

As developers, we abide by certain principles, heuristics and rules to ensure that we code consistently, but often, it's simply to stop opinionated arguments.

The way we structure our data objects is the same. There are some clear rules about how to structure our data to be tidy data; however, the tidy data design principles are less known and potentially more critical if you work with analytics or large data sets.

"Tidy datasets are all alike, but every messy dataset is messy in its own way." – Hadley Wickham you can read Hadley Wickham's full research paper into tidy data here

Or you can read a summary of the most crucial tidy data principles below:

Each Variable Forms a Column

A variable is a pretty well-understood term in computer science. If we had a data set with a location column, it might look like this.

TemperatureLocation
21New York, USA
19London, UK

However, in this example, we have concatenated two separate variables for Location, and we should treat them as such like this:

TemperatureCityCountry
21New YorkUSA
19LondonUK

Each Observation Forms a Row

An observation is a term used in data analytics, meaning a single reading. Suppose we took a reading of the temperature in the morning and in the evening. There are two separate observations, but we could store the data in a single row like this:

DateMorning TempEvening Temp
01/02/20242125
02/02/20242224

Tidy Data principles require us to split each observation (temperature reading) into a separate row.

DateTime of DayTemperature
01/02/2024Morning21
01/02/2024Evening25
02/02/2024Morning22
02/02/2024Evening24

Column Headers Are Variable Names, Not Values

The headings should always be the variable names and never be values themselves. In the example below, we have put three observations in a single row and used variables in the heading names.

Product202120222023
A100150200
B80120160

These should split into individual observations, and the variable year should be added as a new column.

ProductYearSales
A2021100
A2022150
A2023200
B202180
B2022120
B2023160

Each Type of Observational Unit Forms a Table

A dataframe or table is a collection of observations. In the following example, observations for books and authors are in a single table.

TitlePublication YearAuthorBirth YearNationality
Moby Dick1851Herman Melville1819American
Pride and Prejudice1813Jane Austen1775British

Books and authors are two different observation types, so we should store them in separate tables. We may want to merge the data to perform some analysis and visualisation, but this will be the final step in working with this data.

TitlePublication YearAuthor ID
Moby Dick18511
Pride and Prejudice18132
Author IDBirth YearNationality
11819American
21775British

Advantages of Out New Tidy Data

Tidy Data gives us advantages when working with the data:

  • Clarity: Each row now represents a complete picture of a single observation, and each column is a single variable with the name at the top. Each table represents a single collection of observations.
  • Compatibility: Most programming languages, tools, or even co-workers will expect data formatted like this. Making it more accessible to import, export, visualise, and maintain.
  • Aggregation: Computing aggregation statistics, for example, mean, median, or sum, becomes more accessible when the values are all in a single column, and observations are in a row.
  • Data Manipulation: Pivoting, melting, filtering, and sorting by variables are more straightforward when each variable is in an individual column
  • Data Merging: It's more straightforward to perform joins when things
  • Scalability: Adding, updating, or deleting observations and variables is far more accessible in this format

Formatting our Data

Anytime we import raw data, we should make the first step to tidy it to follow the above rules. Especially if we then store this data for future use, database structures should also follow the tidy data principles. To do this, check the following links depending on which language you use to handle your data.

If you'd like to read Hadley Wickham's full research paper Tidy Data, you can here