describe: Concise Statistical Description of a Vector, Matrix, Data Frame, or Formula

Description

describe is a generic method that invokes describe.data.frame, describe.matrix, describe.vector, or describe.formula. describe.vector is the basic function for handling a single variable. This function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. A numeric variable is deemed discrete if it has <= 5="" 10="" 20="" unique="" values.="" in="" this="" case,="" quantiles="" are="" not="" printed.="" a="" frequency="" table="" is="" printed="" for="" any="" non-binary="" variable="" if="" it="" has="" no="" more="" than="" with="" at="" least="" values,="" the="" lowest="" and="" highest="" values="" describe is especially useful for describing data frames created by sas.get, as SAS labels, formats, value labels, and frequencies of special missing values are printed.

For a binary variable, the sum (number of 1's) and mean (proportion of 1's) are printed. If the first argument is a formula, a model frame is created and passed to describe.data.frame. If a variable is of class "impute", a count of the number of imputed values is printed. If a date variable has an attribute partial.date (this is set up by sas.get), counts of how many partial dates are actually present (missing month, missing day, missing both) are also presented. If a variable was created by the special-purpose function substi (which substitutes values of a second variable if the first variable is NA), the frequency table of substitutions is also printed.

A latex method exists for converting the describe object to a LaTeX file. For numeric variables having at least 20 unique values, describe saves in its returned object the frequencies of 100 evenly spaced bins running from minimum observed value to the maximum. latex inserts a spike histogram displaying these frequency counts in the tabular material using the LaTeX picture environment. For example output see http://biostat.mc.vanderbilt.edu/twiki/pub/Main/Hmisc/counties.pdf.

Sample weights may be specified to any of the functions, resulting in weighted means, quantiles, and frequency tables.

Usage

## S3 method for class 'vector':
describe(x, descript, exclude.missing=TRUE, digits=4,
         weights, normwt, \dots)
## S3 method for class 'matrix':
describe(x, descript, exclude.missing=TRUE, digits=4, \dots)
## S3 method for class 'data.frame':
describe(x, descript, exclude.missing=TRUE,
    digits=4, \dots)
## S3 method for class 'formula':
describe(x, descript, data, subset, na.action,
    digits=4, weights, \dots)
## S3 method for class 'describe':
print(x, condense=TRUE, \dots)
## S3 method for class 'describe':
latex(object, title=NULL, condense=TRUE, 
      file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'),
      append=FALSE, size='small', tabular=TRUE, ...)
## S3 method for class 'describe.single':
latex(object, title=NULL, condense=TRUE, vname,
      file, append=FALSE, size='small', tabular=TRUE, \dots)

Arguments

a data frame, matrix, vector, or formula. For a data frame, the describe.data.frame function is automatically invoked. For a matrix, describe.matrix is called. For a formula, describe.data.frame(model.frame(x)) is inv

descript

optional title to print for x. The default is the name of the argument or the "label" attributes of individual variables. When the first argument is a formula, descript defaults to a character representation of the formula.

exclude.missing

set toTRUE to print the names of variables that contain only missing values. This list appears at the bottom of the printout, and no space is taken up for such variables in the main listing.

digits

number of significant digits to print

weights

a numeric vector of frequencies or sample weights. Each observation will be treated as if it were sampled weights times.

normwt

The default, normwt=FALSE results in the use of weights as weights in computing various statistics. In this case the sample size is assumed to be equal to the sum of weights. Specify normwt=TRUE

object

a result of describe

title

unused

condense

default isTRUE to condense the output with regard to the 5 lowest and highest values and the frequency table

data

subset

na.action

There are used if a formula is specified. na.action defaults to na.retain which does not delete any NAs from the data frame. Use na.action=na.omit or na.delete to drop any observation w

...

arguments passed to describe.default which are passed to calls to format for numeric variables. For example if using R POSIXct or Date date/time formats, specifying describe(d,format='%d%b%y

file

name of output file (should have a suffix of .tex). Default name is formed from the first word of the descript element of the describe object, prefixed by "describe". Set file="" to send LaTeX code to

append

set to TRUE to have latex append text to an existing file named file

size

LaTeX text size ("small", the default, or "normalsize", "tiny", "scriptsize", etc.) for the describe output in LaTeX.

tabular

set to FALSE to use verbatim rather than tabular environment for the summary statistics output. By default, tabular is used if the output is not too wide.

vname

unused argument in latex.describe.single

Value

a list containing elements descript, counts, values. The list is of class describe. If the input object was a matrix or a data frame, the list is a list of lists, one list for each variable analyzed. latex returns a standard latex object. For numeric variables having at least 20 unique values, an additional component intervalFreq. This component is a list with two elements, range (containing two values) and count, a vector of 100 integer frequency counts.

Details

If options(na.detail.response=TRUE) has been set and na.action is "na.delete" or "na.keep", summary statistics on the response variable are printed separately for missing and non-missing values of each predictor. The default summary function returns the number of non-missing response values and the mean of the last column of the response values, with a names attribute of c("N","Mean"). When the response is a Surv object and the mean is used, this will result in the crude proportion of events being used to summarize the response. The actual summary function can be designated through options(na.fun.response = "function name").

Examples

Run this code

set.seed(1)
describe(runif(200),dig=2)    #single variable, continuous
                              #get quantiles .05,.10,\dots

dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))
describe(dfr)

d <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)
describe(d)      #describe entire data frame
attach(d, 1)
describe(relig)  #Has special missing values .D .F .M .R .T
                 #attr(relig,"label") is "Religious preference"

#relig : Religious preference  Format:relig
#    n missing  D  F M R T unique 
# 4038     263 45 33 7 2 1      8
#
#0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%) 
#3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%) 
#5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%) 


# Method for describing part of a data frame:
 describe(death.time ~ age*sex + rcs(blood.pressure))
 describe(~ age+sex)
 describe(~ age+sex, weights=freqs)  # weighted analysis

 fit <- lrm(y ~ age*sex + log(height))
 describe(formula(fit))
 describe(y ~ age*sex, na.action=na.delete)   
# report on number deleted for each variable
 options(na.detail.response=TRUE)  
# keep missings separately for each x, report on dist of y by x=NA
 describe(y ~ age*sex)
 options(na.fun.response="quantile")
 describe(y ~ age*sex)   # same but use quantiles of y by x=NA

 d <- describe(my.data.frame)
 d$age                   # print description for just age
 d[c('age','sex')]       # print description for two variables
 d[sort(names(d))]       # print in alphabetic order by var. names
 d2 <- d[20:30]          # keep variables 20-30
 page(d2)                # pop-up window for these variables

# Test date/time formats and suppression of times when they don't vary
 library(chron)
 d <- data.frame(a=chron((1:20)+.1),
                 b=chron((1:20)+(1:20)/100),
                 d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
                               hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),
                 f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,
                               hour=1:20,min=1:20,sec=1:20),
                 g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20))
 describe(d)

Run the code above in your browser using DataLab