descript: Compute univariate descriptive statistics

Description

Function returns univariate data summaries for each variable supplied. For presentation purposes, discrete and continuous variables are treated separately, the former of which reflects count/proportion information while the ladder are supplied to a (customizable) list of univariate summary functions. As such, quantitative/continuous variable information is kept distinct in the output, while discrete variables (e.g., factors and character vectors) are returned by using the discrete argument.

Usage

descript(df, funs = get_descriptFuns(), discrete = FALSE)
get_descriptFuns()

Arguments

df

typically a data.frame or tibble-like structure containing the variables of interest

Note that factor and character vectors will be treated as discrete observations, and by default are omitted from the computation of the quantitative descriptive statistics specified in funs. However, setting discrete = TRUE will provide count-type information for these discrete variables, in which case arguments to funs are ignored

funs

functions to apply when discrete = FALSE. Can be modified by the user to include or exclude further functions, however each supplied function must return a scalar. Use get_discreteFuns() to return the full list of functions, which may then be augmented or subsetted based on the user's requirements. Default descriptive statistic returned are:

n: number of non-missing observations

mean

mean

trim

trimmed mean (10%)

sd

standard deviation

skew

skewness (from e1701)

kurt

kurtosis (from e1071)

min

minimum

P25

25th percentile (a.k.a., 1st/lower quartile, Q1), returned from quantile)

P50

median (50th percentile)

P75

75th percentile (a.k.a, 3rd/upper quartile, Q3), returned from quantile)

max

maximum

Note that by default the na.rm behavior is set to TRUE in each function call

discrete

logical; include summary statistics for discrete variables only? If TRUE then only count and proportion information for the discrete variables will be returned. For greater flexibility in creating cross-tabulated count/proportion information see xtabs

Details

The purpose of this function is to provide a more pipe-friendly API for selecting and subsetting variables using the dplyr syntax, where conditional statistics are evaluated internally using the by function (when multiple variables are to be summarised). As a special case, if only a single variable is being summarised then the canonical output from dplyr::summarise will be returned.

Conditioning: As the function is intended to support pipe-friendly code specifications, conditioning/group subset specifications are declared using group_by and subsequently passed to descript.

Examples

Run this code


library(dplyr)

data(mtcars)

if(FALSE){
  # run the following to see behavior with NA values in dataset
  mtcars[sample(1:nrow(mtcars), 3), 'cyl'] <- NA
  mtcars[sample(1:nrow(mtcars), 5), 'mpg'] <- NA
}

fmtcars <- within(mtcars, {
	cyl <- factor(cyl)
	am <- factor(am, labels=c('automatic', 'manual'))
	vs <- factor(vs)
})

# with and without factor variables
mtcars |> descript()
fmtcars |> descript()               # factors/discrete vars omitted
fmtcars |> descript(discrete=TRUE)  # discrete variables only

# for discrete variables, xtabs() is generally nicer as cross-tabs can
# be specified explicitly (though can be cumbersome)
xtabs(~ am, fmtcars)
xtabs(~ am, fmtcars) |> prop.table()
xtabs(~ am + cyl + vs, fmtcars)
xtabs(~ am + cyl + vs, fmtcars) |> prop.table()

# usual pipe chaining
fmtcars |> select(mpg, wt) |> descript()
fmtcars |> filter(mpg > 20) |> select(mpg, wt) |> descript()

# conditioning with group_by()
fmtcars |> group_by(cyl) |> descript()
fmtcars |> group_by(cyl, am) |> descript()
fmtcars |> group_by(cyl, am) |> select(mpg, wt) |> descript()

# with single variables, typical dplyr::summarise() output returned
fmtcars |> select(mpg) |> descript()
fmtcars |> group_by(cyl) |> select(mpg) |> descript()
fmtcars |> group_by(cyl, am) |> select(mpg) |> descript()

# discrete variables also work with group_by(), though again
#  xtabs() is generally more flexible
fmtcars |> group_by(cyl) |> descript(discrete=TRUE)
fmtcars |> group_by(am) |> descript(discrete=TRUE)
fmtcars |> group_by(cyl, am) |> descript(discrete=TRUE)

# only return a subset of summary statistics
funs <- get_descriptFuns()
sfuns <- funs[c('n', 'mean', 'sd')] # subset
fmtcars |> descript(funs=sfuns) # only n, miss, mean, and sd

# add a new functions
funs2 <- c(sfuns,
           trim_20 = \(x) mean(x, trim=.2, na.rm=TRUE),
           median= \(x) median(x, na.rm=TRUE))
fmtcars |> descript(funs=funs2)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Details

See Also

Examples