Home > Articles

Tidy Data

This chapter is from the book

4.3 Variables in Both Rows and Columns

At times, data will be formatted so that variables are in both rows and columns – that is, in some combination of the formats described in previous sections of this chapter. Most of the methods needed to tidy up such data have already been presented (.melt() and some string parsing with the .str. accessor attribute). What is left to show is what happens if a column of data actually holds two variables instead of one variable. In this case, we will have to “pivot” the variable into separate columns, i.e., go from long data to wide data.

weather = pd.read_csv('data/weather.csv')
print(weather.iloc[:5, :11])
        id  year  month element   d1    d2    d3    d4    d5   d6   d7
0  MX17004  2010      1    tmax  NaN   NaN   NaN   NaN   NaN  NaN  NaN
1  MX17004  2010      1    tmin  NaN   NaN   NaN   NaN   NaN  NaN  NaN
2  MX17004  2010      2    tmax  NaN  27.3  24.1   NaN   NaN  NaN  NaN
3  MX17004  2010      2    tmin  NaN  14.4  14.4   NaN   NaN  NaN  NaN
4  MX17004  2010      3    tmax  NaN   NaN   NaN   NaN  32.1  NaN  NaN

The weather data include minimum (tmin) and maximum (tmax) temperatures recorded for each day (d1, d2, …, d31) of the month (month). The element column contains variables that need to be pivoted wider to become new columns, and the day variables need to be melted into row values.

Again, there is nothing wrong with the data in the current format. It is simply not in a shape amenable to analysis, although this kind of formatting can be helpful when presenting data in reports. Let’s first fix the day values.

weather_melt = weather.melt(
  id_vars=["id", "year", "month", "element"],
  var_name="day",
  value_name="temp",
)
print(weather_melt)
          id  year  month element  day  temp
0    MX17004  2010      1    tmax   d1   NaN
1    MX17004  2010      1    tmin   d1   NaN
2    MX17004  2010      2    tmax   d1   NaN
3    MX17004  2010      2    tmin   d1   NaN
4    MX17004  2010      3    tmax   d1   NaN
..       ...   ...    ...     ...   ...  ...
677  MX17004  2010     10    tmin   d31  NaN
678  MX17004  2010     11    tmax   d31  NaN
679  MX17004  2010     11    tmin   d31  NaN
680  MX17004  2010     12    tmax   d31  NaN
681  MX17004  2010     12    tmin   d31  NaN

[682 rows x 6 columns]

Next, we need to pivot up the variables stored in the element column.

weather_tidy = weather_melt.pivot_table(
    index=['id', 'year', 'month', 'day'],
    columns='element',
    values='temp'
)
print(weather_tidy)
element                 tmax  tmin
id      year month day
MX17004 2010 1     d30  27.8  14.5
             2     d11  29.7  13.4
                    d2  27.3  14.4
                   d23  29.9  10.7
                    d3  24.1  14.4
...                      ...   ...

             11    d27  27.7  14.2
                   d26  28.1  12.1
                    d4  27.2  12.0
             12     d1  29.9  13.8
                    d6  27.8  10.5

[33 rows x 2 columns]

Looking at the pivoted table, we notice that each value in the element column is now a separate column. We can leave this table in its current state, but we can also flatten the hierarchical columns.

weather_tidy_flat = weather_tidy.reset_index()
print(weather_tidy_flat)
element       id  year  month  day  tmax  tmin
0        MX17004  2010      1  d30  27.8  14.5
1        MX17004  2010      2  d11  29.7  13.4
2        MX17004  2010      2   d2  27.3  14.4
3        MX17004  2010      2  d23  29.9  10.7
4        MX17004  2010      2   d3  24.1  14.4
..           ...   ...     ... ...   ...   ...
28       MX17004  2010      11 d27  27.7  14.2
29       MX17004  2010      11 d26  28.1  12.1
30       MX17004  2010      11  d4  27.2  12.0
31       MX17004  2010      12  d1  29.9  13.8
32       MX17004  2010      12  d6  27.8  10.5

[33 rows x 6 columns]

Likewise, we can apply these methods without the intermediate dataframe:

weather_tidy = (
  weather_melt
  .pivot_table(
    index=['id', 'year', 'month', 'day'],
    columns='element',
    values='temp')
  .reset_index()
)

print(weather_tidy)
element       id  year  month  day  tmax  tmin
0        MX17004  2010      1  d30  27.8  14.5
1        MX17004  2010      2  d11  29.7  13.4
2        MX17004  2010      2   d2  27.3  14.4
3        MX17004  2010      2  d23  29.9  10.7
4        MX17004  2010      2   d3  24.1  14.4

..           ...   ...     ... ...   ...   ...
28       MX17004  2010      11 d27  27.7  14.2
29       MX17004  2010      11 d26  28.1  12.1
30       MX17004  2010      11  d4  27.2  12.0
31       MX17004  2010      12  d1  29.9  13.8
32       MX17004  2010      12  d6  27.8  10.5

[33 rows x 6 columns]

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.