- Learning Objectives
- 4.1 Columns Contain Values, Not Variables
- 4.2 Columns Contain Multiple Variables
- 4.3 Variables in Both Rows and Columns
- Conclusion
4.3 Variables in Both Rows and Columns
At times, data will be formatted so that variables are in both rows and columns – that is, in some combination of the formats described in previous sections of this chapter. Most of the methods needed to tidy up such data have already been presented (.melt() and some string parsing with the .str. accessor attribute). What is left to show is what happens if a column of data actually holds two variables instead of one variable. In this case, we will have to “pivot” the variable into separate columns, i.e., go from long data to wide data.
weather = pd.read_csv('data/weather.csv') print(weather.iloc[:5, :11])
id year month element d1 d2 d3 d4 d5 d6 d7 0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN 1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN 2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN 3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN 4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN
The weather data include minimum (tmin) and maximum (tmax) temperatures recorded for each day (d1, d2, …, d31) of the month (month). The element column contains variables that need to be pivoted wider to become new columns, and the day variables need to be melted into row values.
Again, there is nothing wrong with the data in the current format. It is simply not in a shape amenable to analysis, although this kind of formatting can be helpful when presenting data in reports. Let’s first fix the day values.
weather_melt = weather.melt( id_vars=["id", "year", "month", "element"], var_name="day", value_name="temp", )
print(weather_melt)
id year month element day temp 0 MX17004 2010 1 tmax d1 NaN 1 MX17004 2010 1 tmin d1 NaN 2 MX17004 2010 2 tmax d1 NaN 3 MX17004 2010 2 tmin d1 NaN 4 MX17004 2010 3 tmax d1 NaN .. ... ... ... ... ... ... 677 MX17004 2010 10 tmin d31 NaN 678 MX17004 2010 11 tmax d31 NaN 679 MX17004 2010 11 tmin d31 NaN 680 MX17004 2010 12 tmax d31 NaN 681 MX17004 2010 12 tmin d31 NaN [682 rows x 6 columns]
Next, we need to pivot up the variables stored in the element column.
weather_tidy = weather_melt.pivot_table( index=['id', 'year', 'month', 'day'], columns='element', values='temp' )
print(weather_tidy)
element tmax tmin id year month day MX17004 2010 1 d30 27.8 14.5 2 d11 29.7 13.4 d2 27.3 14.4 d23 29.9 10.7 d3 24.1 14.4 ... ... ... 11 d27 27.7 14.2 d26 28.1 12.1 d4 27.2 12.0 12 d1 29.9 13.8 d6 27.8 10.5 [33 rows x 2 columns]
Looking at the pivoted table, we notice that each value in the element column is now a separate column. We can leave this table in its current state, but we can also flatten the hierarchical columns.
weather_tidy_flat = weather_tidy.reset_index() print(weather_tidy_flat)
element id year month day tmax tmin 0 MX17004 2010 1 d30 27.8 14.5 1 MX17004 2010 2 d11 29.7 13.4 2 MX17004 2010 2 d2 27.3 14.4 3 MX17004 2010 2 d23 29.9 10.7 4 MX17004 2010 2 d3 24.1 14.4 .. ... ... ... ... ... ... 28 MX17004 2010 11 d27 27.7 14.2 29 MX17004 2010 11 d26 28.1 12.1 30 MX17004 2010 11 d4 27.2 12.0 31 MX17004 2010 12 d1 29.9 13.8 32 MX17004 2010 12 d6 27.8 10.5 [33 rows x 6 columns]
Likewise, we can apply these methods without the intermediate dataframe:
weather_tidy = ( weather_melt .pivot_table( index=['id', 'year', 'month', 'day'], columns='element', values='temp') .reset_index() ) print(weather_tidy)
element id year month day tmax tmin 0 MX17004 2010 1 d30 27.8 14.5 1 MX17004 2010 2 d11 29.7 13.4 2 MX17004 2010 2 d2 27.3 14.4 3 MX17004 2010 2 d23 29.9 10.7 4 MX17004 2010 2 d3 24.1 14.4 .. ... ... ... ... ... ... 28 MX17004 2010 11 d27 27.7 14.2 29 MX17004 2010 11 d26 28.1 12.1 30 MX17004 2010 11 d4 27.2 12.0 31 MX17004 2010 12 d1 29.9 13.8 32 MX17004 2010 12 d6 27.8 10.5 [33 rows x 6 columns]