summarize: Summarize Scalars or Matrices by Cross-Classification

Description

summarize is a fast version of

summary(formula,
method="cross",overall=FALSE)

for producing stratified summary statistics and storing them in a data frame for plotting (especially with trellis xyplot and dotplot and Hmisc xYplot). Unlike aggregate, summarize accepts a matrix as its first argument and a multi-valued FUN argument and summarize also labels the variables in the new data frame using their original names. Unlike methods based on tapply, summarize stores the values of the stratification variables using their original types, e.g., a numeric by variable will remain a numeric variable in the collapsed data frame. summarize also retains "label" attributes for variables. summarize works especially well with the Hmisc xYplot function for displaying multiple summaries of a single variable on each panel, such as means and upper and lower confidence limits.

mApply is like tapply except that the first argument can be a matrix, and the output is cleaned up if simplify=TRUE. It uses code adapted from Tony Plate (tplate@blackmesacapital.com) to operate on grouped submatrices.

As mApply can be much faster than using by, it is often worth the trouble of converting a data frame to a numeric matrix for processing by mApply. asNumericMatrix will do this, and matrix2dataFrame will convert a numeric matrix back into a data frame if attributes and storage modes of the original variables are saved by calling subsAttr. subsAttr saves attributes that are commonly preserved across row subsetting (i.e., it does not save dim, dimnames, or names attributes).

Usage

summarize(X, by, FUN, ..., 
          stat.name=deparse(substitute(X)),
          type=c('variables','matrix'), subset=TRUE)
mApply(X, INDEX, FUN=NULL, ..., simplify=TRUE)
asNumericMatrix(x)
subsAttr(x)
matrix2dataFrame(x, at, restoreAll=TRUE)

Arguments

a vector or matrix capable of being operated on by the function specified as the FUN argument

one or more stratification variables. If a single variable, by may be a vector, otherwise it should be a list. Using the Hmisc llist function instead of list will result in individual variable names being accessible

FUN

a function of a single vector argument, used to create the statistical summaries for summarize. FUN may compute any number of statistics.

simplify

set to FALSE to suppress simplification of the result in to an array, matrix, etc.

...

extra arguments are passed to FUN

stat.name

the name to use when creating the main summary variable. By default, the name of the X argument is used. Set stat.name to NULL to suppress this name replacement.

type

Specify type="matrix" to store the summary variables (if there are more than one) in a matrix.

subset

a logical vector or integer vector of subscripts used to specify the subset of data to use in the analysis. The default is to use all observations in the data frame.

INDEX

vector or list of vectors to cross-classify on, similar to by. See tapply.

a data frame (for asNumericMatrix) or a numeric matrix (for matrix2dataFrame). For subsAttr, x may be a data frame, list, or a vector.

result of subsAttr

restoreAll

set to FALSE to only restore attributes label, units, and levels instead of all attributes

Value

For summarize, a data frame containing the by variables and the statistical summaries (the first of which is named the same as the X variable unless stat.name is given). If type="matrix", the summaries are stored in a single variable in the data frame, and this variable is a matrix. For mApply, the returned value is a vector, matrix, or list. If FUN returns more than one number, the result is an array if simplify=TRUE and is a list otherwise. If a matrix is returned, its rows correspond to unique combinations of INDEX. If INDEX is a list with more than one vector, FUN returns more than one number, and simplify=FALSE, the returned value is a list that is an array with the first dimension corresponding to the last vector in INDEX, the second dimension corresponding to the next to last vector in INDEX, etc., and the elements of the list-array correspond to the values computed by FUN. In this situation the returned value is a regular array if simplify=TRUE. The order of dimensions is as previously but the additional (last) dimension corresponds to values computed by FUN. asNumericMatrix returns a numeric matrix, and matrix2dataFrame returns a data frame. subsAttr returns a list of attribute lists if its argument is a list or data frame, and a list containing attributes of a single variable.

concept

grouping
stratification
aggregation
cross-classification

Examples

Run this code

s <- summarize(ap>1, llist(size=cut2(sz, g=4), bone), mean,
               stat.name='Proportion')
dotplot(Proportion ~ size | bone, data=s7)

set.seed(1)
temperature <- rnorm(300, 70, 10)
month <- sample(1:12, 300, TRUE)
year  <- sample(2000:2001, 300, TRUE)
g <- function(x)c(Mean=mean(x,na.rm=TRUE),Median=median(x,na.rm=TRUE))
summarize(temperature, month, g)
mApply(temperature, month, g)

mApply(temperature, month, mean, na.rm=TRUE)
w <- summarize(temperature, month, mean, na.rm=TRUE)
if(.R.) library(lattice)
xyplot(temperature ~ month, data=w) # plot mean temperature by month

w <- summarize(temperature, llist(year,month), 
               quantile, probs=c(.5,.25,.75), na.rm=TRUE, type='matrix')
xYplot(Cbind(temperature[,1],temperature[,-1]) ~ month | year, data=w)
mApply(temperature, llist(year,month),
       quantile, probs=c(.5,.25,.75), na.rm=TRUE)

# Compute the median and outer quartiles.  The outer quartiles are
# displayed using "error bars"
set.seed(111)
dfr <- expand.grid(month=1:12, year=c(1997,1998), reps=1:100)
attach(dfr)
y <- abs(month-6.5) + 2*runif(length(month)) + year-1997
s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5)
s
mApply(y, llist(month,year), smedian.hilow, conf.int=.5)

xYplot(Cbind(y,Lower,Upper) ~ month, groups=year, data=s, 
       keys='lines', method='alt')
# Can also do:
s <- summarize(y, llist(month,year), quantile, probs=c(.5,.25,.75),
               stat.name=c('y','Q1','Q3'))
xYplot(Cbind(y, Q1, Q3) ~ month, groups=year, data=s, keys='lines')
# To display means and bootstrapped nonparametric confidence intervals
# use for example:
s <- summarize(y, llist(month,year), smean.cl.boot)
xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s)

# For each subject use the trapezoidal rule to compute the area under
# the (time,response) curve using the Hmisc trap.rule function
x <- cbind(time=c(1,2,4,7, 1,3,5,10),response=c(1,3,2,4, 1,3,2,4))
subject <- c(rep(1,4),rep(2,4))
trap.rule(x[1:4,1],x[1:4,2])
summarize(x, subject, function(y) trap.rule(y[,1],y[,2]))

# Another approach would be to properly re-shape the mm array below
# This assumes no missing cells.  There are many other approaches.
# mApply will do this well while allowing for missing cells.
m <- tapply(y, list(year,month), quantile, probs=c(.25,.5,.75))
mm <- array(unlist(m), dim=c(3,2,12), 
            dimnames=list(c('lower','median','upper'),c('1997','1998'),
                          as.character(1:12)))
# aggregate will help but it only allows you to compute one quantile
# at a time; see also the Hmisc mApply function
dframe <- aggregate(y, list(Year=year,Month=month), quantile, probs=.5)

# Compute expected life length by race assuming an exponential
# distribution - can also use summarize
g <- function(y) { # computations for one race group
  futime <- y[,1]; event <- y[,2]
  sum(futime)/sum(event)  # assume event=1 for death, 0=alive
}
mApply(cbind(followup.time, death), race, g)

# To run mApply on a data frame:
m <- mApply(asNumericMatrix(x), race, h)
# Here assume h is a function that returns a matrix similar to x
at <- subsAttr(x)  # get original attributes and storage modes
matrix2dataFrame(m, at)


# Get stratified weighted means
g <- function(y) wtd.mean(y[,1],y[,2])
summarize(cbind(y, wts), llist(sex,race), g, stat.name='y')
mApply(cbind(y,wts), llist(sex,race), g)

# Compare speed of mApply vs. by for computing 
d <- data.frame(sex=sample(c('female','male'),100000,TRUE),
                country=sample(letters,100000,TRUE),
                y1=runif(100000), y2=runif(100000))
g <- function(x) {
  y <- c(median(x[,'y1']-x[,'y2']),
         med.sum =median(x[,'y1']+x[,'y2']))
  names(y) <- c('med.diff','med.sum')
  y
}

system.time(by(d, llist(sex=d$sex,country=d$country), g))
system.time({
             x <- asNumericMatrix(d)
             a <- subsAttr(d)
             m <- mApply(x, llist(sex=d$sex,country=d$country), g)
            })
system.time({
             x <- asNumericMatrix(d)
             summarize(x, llist(sex=d$sex, country=d$country), g)
            })

# An example where each subject has one record per diagnosis but sex of
# subject is duplicated for all the rows a subject has.  Get the cross-
# classified frequencies of diagnosis (dx) by sex and plot the results
# with a dot plot

count <- rep(1,length(dx))
d <- summarize(count, llist(dx,sex), sum)
Dotplot(dx ~ count | sex, data=d)
detach('dfr')

Run the code above in your browser using DataLab