For each subset of a data frame, apply function then combine results into a
data frame.
To apply a function for each row, use adply
with
.margins
set to 1
.
ddply(
.data,
.variables,
.fun = NULL,
...,
.progress = "none",
.inform = FALSE,
.drop = TRUE,
.parallel = FALSE,
.paropts = NULL
)
A data frame, as described in the output section.
data frame to be processed
variables to split data frame by, as as.quoted
variables, a formula or character vector
function to apply to each piece
other arguments passed on to .fun
name of the progress bar to use, see
create_progress_bar
produce informative error messages? This is turned off by default because it substantially slows processing speed, but is very useful for debugging
should combinations of variables that do not appear in the input data be preserved (FALSE) or dropped (TRUE, default)
if TRUE
, apply function in parallel, using parallel
backend provided by foreach
a list of additional options passed into
the foreach
function when parallel computation
is enabled. This is important if (for example) your code relies on
external data or packages: use the .export
and .packages
arguments to supply them so that all cluster nodes have the correct
environment set up for computing.
This function splits data frames by variables.
The most unambiguous behaviour is achieved when .fun
returns a
data frame - in that case pieces will be combined with
rbind.fill
. If .fun
returns an atomic vector of
fixed length, it will be rbind
ed together and converted to a data
frame. Any other values will result in an error.
If there are no results, then this function will return a data
frame with zero rows and columns (data.frame()
).
Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. https://www.jstatsoft.org/v40/i01/.
tapply
for similar functionality in the base package
Other data frame input:
d_ply()
,
daply()
,
dlply()
Other data frame output:
adply()
,
ldply()
,
mdply()
# Summarize a dataset by two variables
dfx <- data.frame(
group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
age = runif(n = 29, min = 18, max = 54)
)
# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,
mean = round(mean(age), 2),
sd = round(sd(age), 2))
# An example using a formula for .variables
ddply(baseball[1:100,], ~ year, nrow)
# Applying two functions; nrow and ncol
ddply(baseball, .(lg), c("nrow", "ncol"))
# Calculate mean runs batted in for each year
rbi <- ddply(baseball, .(year), summarise,
mean_rbi = mean(rbi, na.rm = TRUE))
# Plot a line chart of the result
plot(mean_rbi ~ year, type = "l", data = rbi)
# make new variable career_year based on the
# start year for each player (id)
base2 <- ddply(baseball, .(id), mutate,
career_year = year - min(year) + 1
)
Run the code above in your browser using DataLab