univariateTable: Univariate table

Description

Categorical variables are summarized using counts and frequencies.

Usage

univariateTable(formula, data = parent.frame(),
  summary.format = "mean(x) (sd(x))", Q.format = "median(x) [iqr(x)]",
  freq.format = "count(x) (percent(x))", column.percent = TRUE,
  digits = c(1, 1, 3), short.groupnames, compare.groups = TRUE,
  show.totals = TRUE, n = "inNames", outcome = NULL, na.rm = FALSE, ...)

Arguments

formula

Formula specifying the grouping variable (strata) on the left hand side (can be omitted) and on the right hand side the variables for which to obtain (descriptive) statistics.

data

Data set in which formula is evaluated

summary.format

Format for the numeric (non-factor) variables. Default is mean (SD). If different formats are desired, either special Q can be used or the function is called multiple times and the results are rbinded. See examples.

Q.format

Format for quantile summary of numerical variables: Default is median (inter quartile range).

freq.format

Format for categorical variables. Default is count (percentage).

column.percent

Logical, if TRUE and the default freq.format is used then column percentages are given instead of row percentages for categorical variables (factors).

digits

Number of digits

short.groupnames

If TRUE group names are abbreviated.

compare.groups

Method used to compare groups. If "logistic" and there are exactly two groups logistic regression is used instead of t-tests and Wilcoxon rank tests to compare numeric variables across groups.

show.totals

If TRUE show a column with totals.

If TRUE show the number of subjects as a separate row. If equal to "inNames", show the numbers in parentheses in the column names. If FALSE do not show number of subjects.

outcome

Outcome data used to calculate p-values when compare groups method is 'logistic' or 'cox'.

na.rm

If TRUE remove missing values from categorical variables when calculating p-values.

...

saved as part of the result to be passed on to labelUnits

Value

List with one summary table element for each variable on the right hand side of formula. The summary tables can be combined with rbind. The function summary.univariateTable combines the tables, and shows p-values in custom format. The summary tables

Details

This function can generate the baseline demographic characteristics that forms table 1 in many publications. It is also useful for generating other tables of univariate statistics.

The result of the function is an object (list) which containe the various data generated. In most applications the summary function should be applied which generates a data.frame with a (nearly) publication ready table. Standard manipulation can be used to modify, add or remove columns/rows and for users not accustomed to R the table generated can be exported to a text file which can be read by other software, e.g., via write.csv(table,file="path/to/results/table.csv")

Continuous variables are summarized by means and standard deviations. Deviations from the above defaults are obtained when the arguments summary.format and freq.format are combined with suitable summary functions.

Examples

Run this code

# NOT RUN {
data(Diabetes)
univariateTable(~age,data=Diabetes)
univariateTable(~gender,data=Diabetes)
univariateTable(~age+gender+ height+weight,data=Diabetes)
## same thing but less typing
utable(~age+gender+ height+weight,data=Diabetes)

## summary by location: 
univariateTable(location~Q(age)+gender+height+weight,data=Diabetes)
## continuous variables marked with Q() are (by default) summarized
## with median (IQR) and kruskal.test (with two groups equivalent to wilcox.test)
## variables not marked with Q() are (by default) summarized
## with mean (sd) and anova.glm(...,test="Chisq")
## the p-value of anova.glm with only two groups is similar
## but not exactly equal to that of a t.test
## categorical variables are (by default) summarized by count
## (percent) and anova.glm(...,family=binomial,test="Chisq")

## export result to csv
table1 = summary(univariateTable(location~age+gender+height+weight,data=Diabetes),
show.pvalues=FALSE)
# write.csv(table1,file="~/table1.csv",rownames=FALSE)

## change labels and values
utable(location~age+gender+height+weight,data=Diabetes,
       age="Age (years)",gender="Sex",
       gender.female="Female",
       gender.male="Male",
       height="Body height (inches)",
       weight="Body weight (pounds)")

## Use quantiles and rank tests for some variables and mean and standard deviation for others
univariateTable(gender~Q(age)+location+Q(BMI)+height+weight,
                data=Diabetes)

## Factor with more than 2 levels
Diabetes$AgeGroups <- cut(Diabetes$age,
                          c(19,29,39,49,59,69,92),
                          include.lowest=TRUE)
univariateTable(location~AgeGroups+gender+height+weight,
                data=Diabetes)

## Row percent
univariateTable(location~gender+age+AgeGroups,
                data=Diabetes,
                column.percent=FALSE)

## change of frequency format
univariateTable(location~gender+age+AgeGroups,
                data=Diabetes,
                column.percent=FALSE,
                freq.format="percent(x) (n=count(x))")

## changing Labels
u <- univariateTable(location~gender+AgeGroups+ height + weight,
                     data=Diabetes,
                     column.percent=TRUE,
                     freq.format="count(x) (percent(x))")
summary(u,"AgeGroups"="Age (years)","height"="Height (inches)")

## more than two groups
Diabetes$frame=factor(Diabetes$frame,levels=c("small","medium","large"))
univariateTable(frame~gender+BMI+age,data=Diabetes)

Diabetes$sex=as.numeric(Diabetes$gender)
univariateTable(frame~sex+gender+BMI+age,
                data=Diabetes,freq.format="count(x) (percent(x))")

## multiple summary formats
## suppose we want for some reason mean (range) for age
## and median (range) for BMI.
## method 1:
univariateTable(frame~Q(age)+BMI,
                data=Diabetes,
                Q.format="mean(x) (range(x))",
                summary.format="median(x) (range(x))")
## method 2:
u1 <- summary(univariateTable(frame~age,
                              data=na.omit(Diabetes),
                              summary.format="mean(x) (range(x))"))
u2 <- summary(univariateTable(frame~BMI,
                              data=na.omit(Diabetes),
                              summary.format="median(x) (range(x))"))
publish(rbind(u1,u2),digits=2)

# }