Tidy Data

By Daniel Y. Chen
Mar 5, 2023

📄 Contents

␡

Learning Objectives
4.1 Columns Contain Values, Not Variables
4.2 Columns Contain Multiple Variables
4.3 Variables in Both Rows and Columns
Conclusion

⎙ Print

Page 1 of 5 Next >

Tidy data is a framework to structure data sets so they can be easily analyzed and visualized.

This chapter is from the book 

Pandas for Everyone: Python Data Analysis, 2nd Edition

Learn More Buy

Hadley Wickham, PhD,¹ one of the more prominent members of the R community, introduced the concept of tidy data in a Journal of Statistical Software paper.² Tidy data is a framework to structure data sets so they can be easily analyzed and visualized. It can be thought of as a goal one should aim for when cleaning data. Once you understand what tidy data is, that knowledge will make your data analysis, visualization, and collection much easier.

What is tidy data? Hadley Wickham’s paper defines it as meeting the following criteria: (1) Each row is an observation, (2) Each column is a variable, and (3) Each type of observational unit forms a table.

The newer definition from the R4DS book³ focuses on an individual data set (i.e., table):

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

This chapter goes through the various ways to tidy data using examples from Wickham’s paper.

Learning Objectives

The concept map for this chapter can be found in Figure A.4.

Identify the components of tidy data
Identify common data errors
Use functions and methods to process and tidy data

Note About This Chapter

Data used in this chapter will have NaN missing values when they are loaded into Pandas (Chapter 9). In the raw CSV files, they will appear as empty values. I typically try to avoid forward referencing in the book, but I felt that the concept of tidy data warranted a much earlier place in the book because it is so fundamental to how we should be thinking about data technically (as opposed to ethically), that the chapter was moved toward the front of the book without having to cover more detailed data processing steps first. I could have changed the data sets such that there were no missing values, but opted not to do so because (1) it would no longer follow the data used in Wickam’s “Tidy Data” paper, and (2) it would be a less realistic data set.

Page 1 of 5 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Tidy Data

This chapter is from the book

This chapter is from the book

This chapter is from the book 

Learning Objectives

Note About This Chapter

InformIT Promotional Mailings & Special Offers