- 14.1 cbind and rbind
- 14.2 Joins
- 14.3 reshape2
- 14.4 Conclusion
14.3 reshape2
The next most common munging need is either melting data (going from column orientation to row orientation) or casting data (going from row orientation to column orientation). As with most other procedures in R, there are multiple functions available to accomplish these tasks, but we will focus on Hadley Wickham’s reshape2 package. We talk about Wickham a lot, but that is because his products have become so fundamental to the R developer’s toolbox.
14.3.1 melt
Looking at the Aid_00s data.frame, we see that each year is stored in its own column. That is, the dollar amount for a given country and program is found in a different column for each year. This is called a cross table, which while nice for human consumption, is not ideal for graphing with ggplot2 or for some analysis algorithms.
> head(Aid_00s) Country.Name Program.Name 1 Afghanistan Child Survival and Health 2 Afghanistan Department of Defense Security Assistance 3 Afghanistan Development Assistance 4 Afghanistan Economic Support Fund/Security Support Assistance 5 Afghanistan Food For Education 6 Afghanistan Global Health and Child Survival FY2000 FY2001 FY2002 FY2003 FY2004 FY2005 FY2006 1 NA NA 2586555 56501189 40215304 39817970 40856382 2 NA NA 2964313 NA 45635526 151334908 230501318 3 NA 4110478 8762080 54538965 180539337 193598227 212648440 4 NA 61144 31827014 341306822 1025522037 1157530168 1357750249 5 NA NA NA 3957312 2610006 3254408 386891 6 NA NA NA NA NA NA NA FY2007 FY2008 FY2009 1 72527069 28397435 NA 2 214505892 495539084 552524990 3 173134034 150529862 3675202 4 1266653993 1400237791 1418688520 5 NA NA NA 6 NA 63064912 1764252
We want it set up so that each row represents a single country-program-year entry with the dollar amount stored in one column. To achieve this we melt the data using melt from reshape2.
> library(reshape2) > melt00 <- melt(Aid_00s, id.vars=c("Country.Name", "Program.Name"), + variable.name="Year", value.name="Dollars") > tail(melt00, 10) Country.Name 24521 Zimbabwe 24522 Zimbabwe 24523 Zimbabwe 24524 Zimbabwe 24525 Zimbabwe 24526 Zimbabwe 24527 Zimbabwe 24528 Zimbabwe 24529 Zimbabwe 24530 Zimbabwe Program.Name Year 24521 Migration and Refugee Assistance FY2009 24522 Narcotics Control FY2009 24523 Nonproliferation, Anti-Terrorism, Demining and Related FY2009 24524 Other Active Grant Programs FY2009 24525 Other Food Aid Programs FY2009 24526 Other State Assistance FY2009 24527 Other USAID Assistance FY2009 24528 Peace Corps FY2009 24529 Title I FY2009 24530 Title II FY2009 Dollars 24521 3627384 24522 NA 24523 NA 24524 7951032 24525 NA 24526 2193057 24527 41940500 24528 NA 24529 NA 24530 174572685
The id.vars argument specifies which columns uniquely identify a row.
After some manipulation of the Year column and aggregating, this is now prime for plotting, as shown in Figure 14.1. The plot uses faceting, enabling us to quickly see and understand the funding for each program over time.
> library(scales) > # strip the "FY" out of the year column and convert it to numeric > melt00$Year <- as.numeric(str_sub(melt00$Year, start=3, 6)) > # aggregate the data so we have yearly numbers by program > meltAgg <- aggregate(Dollars ~ Program.Name + Year, data=melt00, + sum, na.rm=TRUE) > # just keep the first 10 characters of program name > # then it will fit in the plot > meltAgg$Program.Name <- str_sub(meltAgg$Program.Name, start=1, + end=10) > > ggplot(meltAgg, aes(x=Year, y=Dollars)) + + geom_line(aes(group=Program.Name)) + + facet_wrap(~ Program.Name) + + scale_x_continuous(breaks=seq(from=2000, to=2009, by=2)) + + theme(axis.text.x=element_text(angle=90, vjust=1, hjust=0)) + + scale_y_continuous(labels=multiple_format(extra=dollar, + multiple="B"))
Figure 14.1 Plot of foreign assistance by year for each of the programs.
14.3.2 dcast
Now that we have the foreign aid data melted, we cast it back into the wide format for illustration purposes. The function for this is dcast, and it has trickier arguments than melt. The first is the data to be used, in our case melt00. The second argument is a formula where the left side specifies the columns that should remain columns and the right side specifies the columns that should become column names. The third argument is the column (as a character) that holds the values to be populated into the new columns, representing the unique values of the right side of the formula argument.
> cast00 <- dcast(melt00, Country.Name + Program.Name ~ Year, + value.var="Dollars") > head(cast00) Country.Name Program.Name 2000 1 Afghanistan Child Survival and Health NA 2 Afghanistan Department of Defense Security Assistance NA 3 Afghanistan Development Assistance NA 4 Afghanistan Economic Support Fund/Security Support Assistance NA 5 Afghanistan Food For Education NA 6 Afghanistan Global Health and Child Survival NA 2001 2002 2003 2004 2005 2006 1 NA 2586555 56501189 40215304 39817970 40856382 2 NA 2964313 NA 45635526 151334908 230501318 3 4110478 8762080 54538965 180539337 193598227 212648440 4 61144 31827014 341306822 1025522037 1157530168 1357750249 5 NA NA 3957312 2610006 3254408 386891 6 NA NA NA NA NA NA 2007 2008 2009 1 72527069 28397435 NA 2 214505892 495539084 552524990 3 173134034 150529862 3675202 4 1266653993 1400237791 1418688520 5 NA NA NA 6 NA 63064912 1764252