dfSummary: Data frame Summary

Description

Summary of a data frame consisting of: variable names, labels if any, factor levels, frequencies and/or numerical summary statistics, and valid/missing observation counts.

Usage

dfSummary(
  x,
  round.digits = 1,
  varnumbers = st_options("dfSummary.varnumbers"),
  labels.col = st_options("dfSummary.labels.col"),
  valid.col = st_options("dfSummary.valid.col"),
  na.col = st_options("dfSummary.na.col"),
  graph.col = st_options("dfSummary.graph.col"),
  graph.magnif = st_options("dfSummary.graph.magnif"),
  style = st_options("dfSummary.style"),
  plain.ascii = st_options("plain.ascii"),
  justify = "l",
  col.widths = NA,
  headings = st_options("headings"),
  display.labels = st_options("display.labels"),
  max.distinct.values = 10,
  trim.strings = FALSE,
  max.string.width = 25,
  split.cells = 40,
  split.tables = Inf,
  tmp.img.dir = st_options("tmp.img.dir"),
  silent = st_options("dfSummary.silent"),
  ...
)

Arguments

A data frame.

round.digits

Number of significant digits to display. Defaults to 1.

varnumbers

Logical. Should the first column contain variable number? Defaults to TRUE. Can be set globally; see st_options, option “dfSummary.varnumbers”.

labels.col

Logical. If TRUE, variable labels (as defined with rapportools, Hmisc or summarytools' label functions) will be displayed. TRUE by default, but the labels column is only shown if at least one column has a defined label. This option can also be set globally; see st_options, option “dfSummary.labels.col”.

valid.col

Logical. Include column indicating count and proportion of valid (non-missing) values. TRUE by default, but can be set globally; see st_options, option “dfSummary.valid.col”.

na.col

Logical. Include column indicating count and proportion of missing (NA) values. TRUE by default, but can be set globally; see st_options, option “dfSummary.na.col”.

graph.col

Logical. Display barplots / histograms column in html reports. TRUE by default, but can be set globally; see st_options, option “dfSummary.graph.col”.

graph.magnif

Numeric. Magnification factor, useful if the graphs show up too large (then use a value < 1) or too small (use a value > 1). Must be positive. Default to 1. Can be set globally; see st_options, option “dfSummary.graph.magnif”.

style

Style to be used by pander when rendering output table. Defaults to “multiline”. The only other valid option is “grid”. Style “simple” is not supported for this particular function, and “rmarkdown” will fallback to “multiline”.

plain.ascii

Logical. pander argument; when TRUE, no markup characters will be used (useful when printing to console). Defaults to TRUE. Set to FALSE when in context of markdown rendering. To change the default value globally, see st_options.

justify

String indicating alignment of columns; one of “l” (left) “c” (center), or “r” (right). Defaults to “l”.

col.widths

Numeric or character. Vector of column widths. If numeric, values are assumed to be numbers of pixels. Otherwise, any CSS-supported units can be used. NA by default, meaning widths are calculated automatically.

headings

Logical. Set to FALSE to omit headings. To change this default value globally, see st_options.

display.labels

Logical. Should data frame label be displayed in the title section? Default is TRUE. To change this default value globally, see st_options.

max.distinct.values

The maximum number of values to display frequencies for. If variable has more distinct values than this number, the remaining frequencies will be reported as a whole, along with the number of additional distinct values. Defaults to 10.

trim.strings

Logical; for character variables, should leading and trailing white space be removed? Defaults to FALSE. See details section.

max.string.width

Limits the number of characters to display in the frequency tables. Defaults to 25.

split.cells

A numeric argument passed to pander. It is the number of characters allowed on a line before splitting the cell. Defaults to 40.

split.tables

pander argument which determines the maximum width of a table. Keeping the default value (Inf) is recommended.

tmp.img.dir

Character. Directory used to store temporary images when rendering dfSummary() with `method = "pander"`, `plain.ascii = TRUE` and `style = "grid"`. See Details.

silent

Logical. Hide console messages. FALSE by default. To change this value globally, see st_options.

…

Additional arguments passed to pander.

Value

A data frame with additional class summarytools containing as many rows as there are columns in x, with attributes to inform print method. Columns in the output data frame are:

No: Number indicating the order in which column appears in the data frame.
Variable: Name of the variable, along with its class(es).
Label: Label of the variable (if applicable).
Stats / Values: For factors, a list of their values, limited by the max.distinct.values parameter. For character variables, the most common values (in descending frequency order), also limited by max.distinct.values. For numerical variables, common univariate statistics (mean, std. deviation, min, med, max, IQR and CV).
Freqs (% of Valid): For factors and character variables, the frequencies and proportions of the values listed in the previous column. For numerical vectors, number of distinct values, or frequency of distinct values if their number is not greater than max.distinct.values.
Text Graph: An ascii histogram for numerical variables, and ascii barplot for factors and character variables.
Valid: Number and proportion of valid values.
Missing: Number and proportion of missing (NA and NAN) values.

Details

The default plain.ascii = TRUE option is there to make results appear cleaner in the console. When used in a context of rmarkdown rendering, set this option to FALSE.

When the trim.strings is set to TRUE, trimming is done before calculating frequencies, so those will be impacted accordingly.

Specifying tmp.img.dir allows producing results consistent with pandoc styling while also showing png graphs. Due to the fact that in Pandoc, column widths are determined by the length of cell contents even if said content is merely a link to an image, we cannot use the standard R temporary directory to store the images. We need a shorter path; on Mac OS and Linux, using “/tmp” is a sensible choice, since this directory is cleaned up automatically on a regular basis. On Windows however, there is no such convenient directory and the user will have to choose a directory and cleanup the temporary images manually after the document has been rendered. Providing a relative path such as “img” is recommended. The maximum length for this parameter is set to 5 characters. It can be set globally using st_options; for example: st_options(tmp.img.dir = ".").

Examples

Run this code

# NOT RUN {
data("tobacco")
saved_x11_option <- st_options("use.x11")
st_options(use.x11 = FALSE)
dfSummary(tobacco)

# Exclude some columns
dfSummary(tobacco, varnumbers = FALSE, valid.col = FALSE)

# Limit number of categories to be displayed for factors / categorical data
dfSummary(tobacco, max.distinct.values = 5, style = "grid")

# Using stby()
stby(tobacco, tobacco$gender, dfSummary)

st_options(use.x11 = saved_x11_option)

# }
# NOT RUN {
# Show in Viewer or browser (view: no capital V!)
view(dfSummary(iris))

# Rmarkdown-ready
dfSummary(tobacco, style = "rmarkdown", plain.ascii = TRUE,
          varnumbers = FALSE, valid.col = FALSE, tmp.img.dir = "./img")

# Using group_by()
tobacco %>% group_by(gender) %>% dfSummary()
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab