Split data frame, apply function, and return results in a data frame.
For each subset of a data frame, apply function then combine results into a
To apply a function for each row, use
.margins set to
ddply(.data, .variables, .fun = NULL, ..., .progress = "none", .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)
- data frame to be processed
- variables to split data frame by, as
as.quotedvariables, a formula or character vector
- function to apply to each piece
- other arguments passed on to
- name of the progress bar to use, see
- produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging
- should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default)
TRUE, apply function in parallel, using parallel backend provided by foreach
- a list of additional options passed into
foreachfunction when parallel computation is enabled. This is important if (for example) your code relies on external data or packages: use the
.packagesarguments to supply them so that all cluster nodes have the correct environment set up for computing.
A data frame, as described in the output section.
This function splits data frames by variables.
The most unambiguous behaviour is achieved when
.fun returns a
data frame - in that case pieces will be combined with
.fun returns an atomic vector of
fixed length, it will be
rbinded together and converted to a data
frame. Any other values will result in an error. If there are no results, then this function will return a data
frame with zero rows and columns (
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. http://www.jstatsoft.org/v40/i01/.
tapply for similar functionality in the base package
# Summarize a dataset by two variables dfx <- data.frame( group = c(rep('A', 8), rep('B', 15), rep('C', 6)), sex = sample(c("M", "F"), size = 29, replace = TRUE), age = runif(n = 29, min = 18, max = 54) ) # Note the use of the '.' function to allow # group and sex to be used without quoting ddply(dfx, .(group, sex), summarize, mean = round(mean(age), 2), sd = round(sd(age), 2)) # An example using a formula for .variables ddply(baseball[1:100,], ~ year, nrow) # Applying two functions; nrow and ncol ddply(baseball, .(lg), c("nrow", "ncol")) # Calculate mean runs batted in for each year rbi <- ddply(baseball, .(year), summarise, mean_rbi = mean(rbi, na.rm = TRUE)) # Plot a line chart of the result plot(mean_rbi ~ year, type = "l", data = rbi) # make new variable career_year based on the # start year for each player (id) base2 <- ddply(baseball, .(id), mutate, career_year = year - min(year) + 1 )