ddply
Split data frame, apply function, and return results in a data frame.
For each subset of a data frame, apply function then combine results into a
data frame.
To apply a function for each row, use adply
with
.margins
set to 1
.
- Keywords
- manip
Usage
ddply(.data, .variables, .fun = NULL, ..., .progress = "none",
.inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)
Arguments
- .data
data frame to be processed
- .variables
variables to split data frame by, as
as.quoted
variables, a formula or character vector- .fun
function to apply to each piece
- ...
other arguments passed on to
.fun
- .progress
name of the progress bar to use, see
create_progress_bar
- .inform
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging
- .drop
should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default)
- .parallel
if
TRUE
, apply function in parallel, using parallel backend provided by foreach- .paropts
a list of additional options passed into the
foreach
function when parallel computation is enabled. This is important if (for example) your code relies on external data or packages: use the.export
and.packages
arguments to supply them so that all cluster nodes have the correct environment set up for computing.
Value
A data frame, as described in the output section.
Input
This function splits data frames by variables.
Output
The most unambiguous behaviour is achieved when .fun
returns a
data frame - in that case pieces will be combined with
rbind.fill
. If .fun
returns an atomic vector of
fixed length, it will be rbind
ed together and converted to a data
frame. Any other values will result in an error.
If there are no results, then this function will return a data
frame with zero rows and columns (data.frame()
).
References
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. http://www.jstatsoft.org/v40/i01/.
See Also
tapply
for similar functionality in the base package
Examples
library(plyr)
# NOT RUN {
# Summarize a dataset by two variables
dfx <- data.frame(
group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
age = runif(n = 29, min = 18, max = 54)
)
# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,
mean = round(mean(age), 2),
sd = round(sd(age), 2))
# An example using a formula for .variables
ddply(baseball[1:100,], ~ year, nrow)
# Applying two functions; nrow and ncol
ddply(baseball, .(lg), c("nrow", "ncol"))
# Calculate mean runs batted in for each year
rbi <- ddply(baseball, .(year), summarise,
mean_rbi = mean(rbi, na.rm = TRUE))
# Plot a line chart of the result
plot(mean_rbi ~ year, type = "l", data = rbi)
# make new variable career_year based on the
# start year for each player (id)
base2 <- ddply(baseball, .(id), mutate,
career_year = year - min(year) + 1
)
# }
Community examples
A example to use ddply to account duplicated rows in a data frame ```r library(plyr) df = data.frame(x1=c(0,1,1,1,2,3,3,3), x2=c(0,1,1,3,2,3,3,2), x3=c(0,1,1,1,2,3,3,2)) ddply(df, .(x1, x2, x3), nrow) ```