- 6.1 Introduction
- Concept Map
- Objectives
- 6.2 Columns Contain Values, Not Variables
- 6.3 Columns Contain Multiple Variables
- 6.4 Variables in Both Rows and Columns
- 6.5 Multiple Observational Units in a Table (Normalization)
- 6.6 Observational Units Across Multiple Tables
- 6.7 Conclusion
6.5 Multiple Observational Units in a Table (Normalization)
One of the simplest ways of knowing whether multiple observational units are represented in a table is by looking at each of the rows, and taking note of any cells or values that are being repeated from row to row. This is very common in government education administration data, where student demographics are reported for each student for each year the student is enrolled.
Let’s look again at the Billboard data we cleaned in Section 6.2.2.
print(billboard_long.head())
year artist track time date.entered 0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 3 2000 3 Doors Down Loser 4:24 2000-10-21 4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 week rating 0 wk1 87.0 1 wk1 91.0 2 wk1 81.0 3 wk1 76.0 4 wk1 57.0
Suppose we subset (Section 2.4.1) the data based on a particular track:
print(billboard_long[billboard_long.track == 'Loser'].head())
year artist track time date.entered week rating 3 2000 3 Doors Down Loser 4:24 2000-10-21 wk1 76.0 320 2000 3 Doors Down Loser 4:24 2000-10-21 wk2 76.0 637 2000 3 Doors Down Loser 4:24 2000-10-21 wk3 72.0 954 2000 3 Doors Down Loser 4:24 2000-10-21 wk4 69.0 1271 2000 3 Doors Down Loser 4:24 2000-10-21 wk5 67.0
We can see that this table actually holds two types of data: the track information and the weekly ranking. It would be better to store the track information in a separate table. This way, the information stored in the year, artist, track, and time columns would not be repeated in the data set. This consideration is particularly important if the data is manually entered. Repeating the same values over and over during data entry increases the risk of inconsistent data.
What we should do in this case is to place the year, artist, track, time, and date.entered in a new dataframe, with each unique set of values being assigned a unique ID. We can then use this unique ID in a second dataframe that represents a song, date, week number, and ranking. This entire process can be thought of as reversing the steps in concatenating and merging data described in Chapter 4.
billboard_songs = billboard_long[['year', 'artist', 'track', 'time']] print(billboard_songs.shape)
(24092, 4)
We know there are duplicate entries in this dataframe, so we need to drop the duplicate rows.
billboard_songs = billboard_songs.drop_duplicates() print(billboard_songs.shape)
(317, 4)
We can then assign a unique value to each row of data.
billboard_songs['id'] = range(len(billboard_songs)) print(billboard_songs.head(n=10))
year artist track time id 0 2000 2 Pac Baby Don't Cry (Keep... 4:22 0 1 2000 2Ge+her The Hardest Part Of ... 3:15 1 2 2000 3 Doors Down Kryptonite 3:53 2 3 2000 3 Doors Down Loser 4:24 3 4 2000 504 Boyz Wobble Wobble 3:35 4 5 2000 98^0 Give Me Just One Nig... 3:24 5 6 2000 A*Teens Dancing Queen 3:44 6 7 2000 Aaliyah I Don't Wanna 4:15 7 8 2000 Aaliyah Try Again 4:03 8 9 2000 Adams, Yolanda Open My Heart 5:30 9
Now that we have a separate dataframe about songs, we can use the newly created id column to match a song to its weekly ranking.
# Merge the song dataframe to the original data set billboard_ratings = billboard_long.merge( billboard_songs, on=['year', 'artist', 'track', 'time']) print(billboard_ratings.shape)
(24092, 8)
print(billboard_ratings.head())
year artist track time date.entered week 0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk1 1 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk2 2 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk3 3 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk4 4 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 wk5 rating id 0 87.0 0 1 82.0 0 2 72.0 0 3 77.0 0 4 87.0 0
Finally, we subset the columns to the ones we want in our ratings dataframe.
billboard_ratings = billboard_ratings[['id', 'date.entered', 'week', 'rating']] print(billboard_ratings.head())
id date.entered week rating 0 0 2000-02-26 wk1 87.0 1 0 2000-02-26 wk2 82.0 2 0 2000-02-26 wk3 72.0 3 0 2000-02-26 wk4 77.0 4 0 2000-02-26 wk5 87.0