- 12.1 What Is “Tidy” Data?
- 12.2 From Columns to Rows: gather()
- 12.3 From Rows to Columns: spread()
- 12.4 tidyr in Action: Exploring Educational Statistics
12.2 From Columns to Rows: gather()
Sometimes you may want to change the structure of your data—how your data is organized in terms of observations and features. To help you do so, the tidyr package provides elegant functions for transforming between orientations.
For example, to move from wide format (Table 12.1) to long format (Table 12.2), you need to gather all of the prices into a single column. You can do this using the gather() function, which collects data values stored across multiple columns into a single new feature (e.g., “price” in Table 12.2), along with an additional new column representing which feature that value was gathered from (e.g., “band” in Table 12.2). In effect, it creates two columns representing key–value pairs of the feature and its value from the original data frame.
# Reshape by gathering prices into a single feature band_data_long <- gather( band_data_wide, # data frame to gather from key = band, # name for new column listing the gathered features value = price, # name for new column listing the gathered values -city # columns to gather data from, as in dplyr's `select()` )
The gather() function takes in a number of arguments, starting with the data frame to gather from. It then takes in a key argument giving a name for a column that will contain as values the column names the data was gathered from—for example, a new band column that will contains the values "greensky_bluegrass", "trampled_by_turtles", and so on. The third argument is a value, which is the name for the column that will contain the gathered values—for example, price to contain the price numbers. Finally, the function takes in arguments representing which columns to gather data from, using syntax similar to using dplyr to select() those columns (in the preceding example, -city indicates that it should gather from all columns except city). Again, any columns provided as this final set of arguments will have their names listed in the key column, and their values listed in the value column. This process is illustrated in Figure 12.1. The gather() function’s syntax can be hard to intuit and remember; try tracing where each value “moves” in the table and diagram.
Figure 12.1 The gather() function takes values from multiple columns (greensky_bluegrass, trampled_by_turtles, etc.) and gathers them into a (new) single column (price). In doing so, it also creates a new column (band) that stores the names of the columns that were gathered (i.e., the column name in which each value was stored prior to gathering).
Note that once data is in long format, you can continue to analyze an individual feature (e.g., a specific band) by filtering for that value. For example, filter(band_data_long, band == "greensky_bluegrass") would produce just the prices for a single band.