Learn R Programming

rockchalk (version 1.8.111)

summarize: Sorts numeric from discrete variables and returns separate summaries for those types of variables.

Description

The work is done by the functions summarizeNumerics and summarizeFactors. Please see the help pages for those functions for complete details.

Usage

summarize(dat, alphaSort = FALSE, stats = TRUE, digits = 2, ...)

Arguments

dat

A data frame

alphaSort

If TRUE, the columns are re-organized in alphabetical order. If FALSE, they are presented in the original order.

stats

A vector of desired summary statistics. Can be TRUE to select defaults. See summarizeNumerics and summarizeFactors for details. TRUE implies, for numeric variables: c("min", "med", "max", "mean", "sd", "skewness", "kurtosis") and discrete variables c("entropy", "normedEntropy"). All summaries will include "nobs" and "nmiss". "nobs" is the number of observations with non-missing, finite scores (not NA, NaN, -Inf, or Inf). "nmiss" is the number of cases with values of NA.

digits

Decimal values to display, defaults as 2.

...

Optional arguments that are passed to summarizeNumerics and summarizeFactors. For numeric variables, one can specify probs, na.rm and unbiased. If probs is unspecified, the default is probs = c(0, .50, 1.0), which are labeled in output as c("min", "med", and "max"). For discrete variables (factors, ordered, logical, character), the argument is maxLevels, which determines the number of levels that will be reported in tables for discrete variables.

Value

The on-screen output will have 2 sections, a stylized display of numeric variables and one small display for each factor. The return value is a list with three objects 1) numerics: a data frame with variable names on rows and summary stats on columns, 2) factors: a list with summary information about each discrete variable, 3) numericsfmt, a character matrix that is the 'beautified' display of the numerics data frame. In order to preserve the style of R's summary function, this character matrix has variable names on the columns and summary stats on the rows.

Details

The major purpose here is to generate summary data structure that is more useful in subsequent data analysis. The numeric portion of the summaries are a data frame that can be used in plots or other diagnostics.

The term "factors" was used, but "discrete variables" would have been more accurate. The factor summaries will collect all logical, factor, ordered, and character variables.

Other variable types, such as Dates, will be ignored, with a warning.

Examples

Run this code
# NOT RUN {
library(rockchalk)


set.seed(23452345)
N <- 100
x1 <- gl(12, 2, labels = LETTERS[1:12])
x2 <- gl(8, 3, labels = LETTERS[12:24])
x1 <- sample(x = x1, size=N, replace = TRUE)
x2 <- sample(x = x2, size=N, replace = TRUE)
z1 <- rnorm(N)
a1 <- rnorm(N, mean = 1.2, sd = 11.7)
a2 <- rpois(N, lambda = 10 + a1)
a3 <- rgamma(N, 0.5, 4)
b1 <- rnorm(N, mean = 211.3, sd = 0.4)
dat <- data.frame(z1, a1, x2, a2, x1, a3, b1)
summary(dat)

summarize(dat)

summarize(dat, digits = 2)

summarize(dat, 
          probs = c(0, 0.20, 0.50),
          stats = c("mean", "entropy"))

## Only quantile values, no summary stats for numeric variables
## Discrete variables get entropy
summarize(dat, 
          probs = c(0, 0.25, 0.50, 0.75, 1.0),
          stats = "entropy", digits = 2)

## Quantiles and the mean for numeric variables.
## No diversity stats for discrete variables (entropy omitted)
summarize(dat, 
          probs = c(0, 0.25, 0.50, 0.75, 1.0),
          stats = "mean")


## Returns un rounded data frame, with
## colnames on rows, values on columns
summarizeNumerics(dat)

summarizeFactors(dat, maxLevels = 5)

## See actual values of factor summaries, without
## beautified printing
unclass(summarizeFactors(dat, maxLevels = 5))

summarize(dat, alphaSort = TRUE) 

summarize(dat, digits = 6, alphaSort = FALSE)


summarize(dat, maxLevels = 2)

datsumm <- summarize(dat, stats = c("mean", "sd", "var"), props = TRUE)

## Unbeautified numeric data frame, variables on the rows
datsumm[["numerics"]]
## Beautified versions 1. shows saved version:
datsumm[["numericsfmt"]]
## 2. Run formatNumericSummaries to re-specify digits:
formatNumericSummaries(datsumm[["numerics"]], digits = 10)

datsumm[["factors"]]

datsummNT <- datsumm[["numerics"]]

plot(datsummNT$mean, datsummNT$var, xlab = "The Means",
    ylab = "The Variances")

plot(datsummNT$mean, datsummNT$var, xlab = "The Means",
    ylab = "The Variances", type = "n")
text(datsummNT$mean, jitter(datsummNT$var), labels = rownames(datsummNT))
## problem with name overlap.
# }

Run the code above in your browser using DataLab