Tidy Data
Tidy data is a framework to structure data sets so they can be easily analyzed and visualized.
Hadley Wickham, PhD,1 one of the more prominent members of the R community, introduced the concept of tidy data in a Journal of Statistical Software paper.2 Tidy data is a framework to structure data sets so they can be easily analyzed and visualized. It can be thought of as a goal one should aim for when cleaning data. Once you understand what tidy data is, that knowledge will make your data analysis, visualization, and collection much easier.
What is tidy data? Hadley Wickham’s paper defines it as meeting the following criteria: (1) Each row is an observation, (2) Each column is a variable, and (3) Each type of observational unit forms a table.
The newer definition from the R4DS book3 focuses on an individual data set (i.e., table):
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
This chapter goes through the various ways to tidy data using examples from Wickham’s paper.
Learning Objectives
The concept map for this chapter can be found in Figure A.4.
Identify the components of tidy data
Identify common data errors
Use functions and methods to process and tidy data
Note About This Chapter
Data used in this chapter will have NaN missing values when they are loaded into Pandas (Chapter 9). In the raw CSV files, they will appear as empty values. I typically try to avoid forward referencing in the book, but I felt that the concept of tidy data warranted a much earlier place in the book because it is so fundamental to how we should be thinking about data technically (as opposed to ethically), that the chapter was moved toward the front of the book without having to cover more detailed data processing steps first. I could have changed the data sets such that there were no missing values, but opted not to do so because (1) it would no longer follow the data used in Wickam’s “Tidy Data” paper, and (2) it would be a less realistic data set.