Data Reshaping with the R Programming Language
Data Scientist Jared Lander discusses how combining multiple datasets, whether by stacking or joining, is commonly necessary as is changing the shape of data. The plyr and reshape2 packages offer good functions for accomplishing this in addition to base tools such as rbind, cbind and merge.
Save 35% off the list price* of the related book or multi-format eBook (EPUB + MOBI + PDF) with discount code ARTICLE.
* See informit.com/terms
Manipulating data takes a great deal of effort before serious analysis can begin. In this chapter we will consider when the data need to be rearranged from column-oriented to row-oriented (or the opposite) and when the data are in multiple, separate sets and need to be combined into one.
There are base functions to accomplish these tasks, but we will focus on those in plyr, reshape2 and data.table.
While the tools covered in this chapter still form the backbone of data reshaping, newer packages like tidyr and dplyr are starting to supercede them. Chapter 15 is an analog to this chapter using these new packages.
14.1 cbind and rbind
The simplest case is when we have a two datasets with either identical columns (both the number of and names) or the same number of rows. In this case, either rbind or cbind work greatly.
As a first trivial example, we create two simple data.frames by combining a few vectors with cbind, and then stack them using rbind.
> # make three vectors and combine them as columns in a data.frame > sport <- c("Hockey", "Baseball", "Football") > league <- c("NHL", "MLB", "NFL") > trophy <- c("Stanley Cup", "Commissioner's Trophy", + "Vince Lombardi Trophy") > trophies1 <- cbind(sport, league, trophy) > > # make another data.frame using data.frame() > trophies2 <- data.frame(sport=c("Basketball", "Golf"), + league=c("NBA", "PGA"), + trophy=c("Larry O'Brien Championship Trophy", + "Wanamaker Trophy"), + stringsAsFactors=FALSE) > > # combine them into one data.frame with rbind > trophies <- rbind(trophies1, trophies2)
Both cbind and rbind can take multiple arguments to combine an arbitrary number of objects. Note that it is possible to assign new column names to vectors in cbind.
> cbind(Sport=sport, Association=league, Prize=trophy) Sport Association Prize [1,] "Hockey" "NHL" "Stanley Cup" [2,] "Baseball" "MLB" "Commissioner's Trophy" [3,] "Football" "NFL" "Vince Lombardi Trophy"