6.2 Columns Contain Values, Not Variables
Data can have columns that contain values instead of variables. This is usually a convenient format for data collection and presentation.
6.2.1 Keep One Column Fixed
We’ll use data on income and religion in the United States from the Pew Research Center to illustrate how to work with columns that contain values, rather than variables.
import pandas as pd pew = pd.read_csv('../data/pew.csv')
When we look at this data set, we can see that not every column is a variable. The values that relate to income are spread across multiple columns. The format shown is a great choice when presenting data in a table, but for data analytics, the table needs to be reshaped so that we have religion, income, and count variables.
# show only the first few columns print(pew.iloc[:, 0:6])
religion <$10k $10-20k $20-30k $30-40k 0 Agnostic 27 34 60 81 1 Atheist 12 27 37 52 2 Buddhist 27 21 30 34 3 Catholic 418 617 732 670 4 Don't know/refused 15 14 15 11 5 Evangelical Prot 575 869 1064 982 6 Hindu 1 9 7 9 7 Historically Black Prot 228 244 236 238 8 Jehovah's Witness 20 27 24 24 9 Jewish 19 19 25 25 10 Mainline Prot 289 495 619 655 11 Mormon 29 40 48 51 12 Muslim 6 7 9 10 13 Orthodox 13 17 23 32 14 Other Christian 9 7 11 13 15 Other Faiths 20 33 40 46 16 Other World Religions 5 2 3 4 17 Unaffiliated 217 299 374 365 $40-50k 0 76 1 35 2 33 3 638 4 10 5 881 6 11 7 197 8 21 9 30 10 651 11 56 12 9 13 32 14 13 15 49 16 2 17 341
This view of the data is also known as “wide” data. To turn it into the “long” tidy data format, we will have to unpivot/melt/gather (depending on which statistical programming language we use) our dataframe. Pandas has a function called melt that will reshape the dataframe into a tidy format. melt takes a few parameters:
id_vars is a container (list, tuple, ndarray) that represents the variables that will remain as is.
value_vars identifies the columns you want to melt down (or unpivot). By default, it will melt all the columns not specified in the id_vars parameter.
var_name is a string for the new column name when the value_vars is melted down. By default, it will be called variable.
value_name is a string for the new column name that represents the values for the var_name. By default, it will be called value.
# we do not need to specify a value_vars since we want to pivot # all the columns except for the 'religion' column pew_long = pd.melt(pew, id_vars='religion') print(pew_long.head())
religion variable value 0 Agnostic <$10k 27 1 Atheist <$10k 12 2 Buddhist <$10k 27 3 Catholic <$10k 418 4 Don't know/refused <$10k 15
print(pew_long.tail())
religion variable value 175 Orthodox Don't know/refused 73 176 Other Christian Don't know/refused 18 177 Other Faiths Don't know/refused 71 178 Other World Religions Don't know/refused 8 179 Unaffiliated Don't know/refused 597
We can change the defaults so that the melted/unpivoted columns are named.
pew_long = pd.melt(pew, id_vars='religion', var_name='income', value_name='count') print(pew_long.head())
religion income count 0 Agnostic <$10k 27 1 Atheist <$10k 12 2 Buddhist <$10k 27 3 Catholic <$10k 418 4 Don't know/refused <$10k 15
print(pew_long.tail())
religion income count 175 Orthodox Don't know/refused 73 176 Other Christian Don't know/refused 18 177 Other Faiths Don't know/refused 71 178 Other World Religions Don't know/refused 8 179 Unaffiliated Don't know/refused 597
6.2.2 Keep Multiple Columns Fixed
Not every data set will have one column to hold still while you unpivot the rest of the columns. As an example, consider the Billboard data set.
billboard = pd.read_csv('../data/billboard.csv') # look at the first few rows and columns print(billboard.iloc[0:5, 0:16])
year artist track time date.entered 0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 3 2000 3 Doors Down Loser 4:24 2000-10-21 4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15
wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11 0 87 82.0 72.0 77.0 87.0 94.0 99.0 NaN NaN NaN NaN 1 91 87.0 92.0 NaN NaN NaN NaN NaN NaN NaN NaN 2 81 70.0 68.0 67.0 66.0 57.0 54.0 53.0 51.0 51.0 51.0 3 76 76.0 72.0 69.0 67.0 65.0 55.0 59.0 62.0 61.0 61.0 4 57 34.0 25.0 17.0 17.0 31.0 36.0 49.0 53.0 57.0 64.0
You can see here that each week has its own column. Again, there is nothing wrong with this form of data. It may be easy to enter the data in this form, and it is much quicker to understand what it means when the data is presented in a table. However, there may be a time when you will need to melt the data. For example, if you wanted to create a faceted plot of the weekly ratings, the facet variable would need to be a column in the dataframe.
billboard_long = pd.melt( billboard, id_vars=['year', 'artist', 'track', 'time', 'date.entered'], var_name='week', value_name='rating')
print(billboard_long.head())
year artist track time date.entered 0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 3 2000 3 Doors Down Loser 4:24 2000-10-21 4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 week rating 0 wk1 87.0 1 wk1 91.0 2 wk1 81.0 3 wk1 76.0 4 wk1 57.0
print(billboard_long.tail())
year artist track time 24087 2000 Yankee Grey Another Nine Minutes 3:10 24088 2000 Yearwood, Trisha Real Live Woman 3:55 24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 24090 2000 Zombie Nation Kernkraft 400 3:30 24091 2000 matchbox twenty Bent 4:12 date.entered week rating 24087 2000-04-29 wk76 NaN 24088 2000-04-01 wk76 NaN 24089 2000-03-18 wk76 NaN 24090 2000-09-02 wk76 NaN 24091 2000-04-29 wk76 NaN