6.4 Variables in Both Rows and Columns
At times data will be formatted so that variables are in both rows and columns—that is, in some combination of the formats described in previous sections of this chapter. Most of the methods needed to tidy up such data have already been presented. What is left to show is what happens if a column of data actually holds two variables instead of one variable. In this case, we will have to pivot or cast the variable into separate columns.
weather = pd.read_csv('../data/weather.csv') print(weather.iloc[:5, :11])
id year month element d1 d2 d3 d4 d5 d6 d7 0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN 1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN 2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN 3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN 4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN
The weather data include minimum and maximum (tmin and tmax values in the element column, respectively) temperatures recorded for each day (d1, d2, ..., d31)ofthe month (month). The element column contains variables that need to be casted/pivoted to become new columns, and the day variables need to be melted into row values. Again, there is nothing wrong with the data in the current format. It is simply not in a shape amenable to analysis, although this kind of formatting can be helpful when presenting data in reports. Let’s first melt/unpivot the day values.
weather_melt = pd.melt(weather, id_vars=['id', 'year', 'month', 'element'], var_name='day', value_name='temp') print(weather_melt.head())
id year month element day temp 0 MX17004 2010 1 tmax d1 NaN 1 MX17004 2010 1 tmin d1 NaN 2 MX17004 2010 2 tmax d1 NaN 3 MX17004 2010 2 tmin d1 NaN 4 MX17004 2010 3 tmax d1 NaN
print(weather_melt.tail())
id year month element day temp 677 MX17004 2010 10 tmin d31 NaN 678 MX17004 2010 11 tmax d31 NaN 679 MX17004 2010 11 tmin d31 NaN 680 MX17004 2010 12 tmax d31 NaN 681 MX17004 2010 12 tmin d31 NaN
Next, we need to pivot up the variables stored in the element column. This process is referred to as casting or spreading in other statistical languages. One of the main differences between pivot_table and melt is that melt is a function within Pandas, whereas pivot_table is a method we call on a DataFrame object.
weather_tidy = weather_melt.pivot_table( index=['id', 'year', 'month', 'day'], columns='element', values='temp')
Looking at the pivoted table, we notice that each value in the element column is now a separate column. We can leave this table in its current state, but we can also flatten the hierarchical columns.
weather_tidy_flat = weather_tidy.reset_index() print(weather_tidy_flat.head())
element id year month day tmax tmin 0 MX17004 2010 1 d1 NaN NaN 1 MX17004 2010 1 d10 NaN NaN 2 MX17004 2010 1 d11 NaN NaN 3 MX17004 2010 1 d12 NaN NaN 4 MX17004 2010 1 d13 NaN NaN
Likewise, we can apply these methods without the intermediate dataframe:
weather_tidy = weather_melt. pivot_table( index=['id', 'year', 'month', 'day'], columns='element', values='temp'). reset_index()
print(weather_tidy.head())
element id year month day tmax tmin 0 MX17004 2010 1 d1 NaN NaN 1 MX17004 2010 1 d10 NaN NaN 2 MX17004 2010 1 d11 NaN NaN 3 MX17004 2010 1 d12 NaN NaN 4 MX17004 2010 1 d13 NaN NaN