summarizeNumerics
for a
function that handles numeric variables.) It then treats
all non-numeric variables as if they were factors, and
summarizes each. The main benefits from this compared to
R's default summary are 1) more summary information is
returned for each variable (entropy estimates
ofdispersion), 2) the columns in the output are
alphabetized. To prevent alphabetization, use alphaSort =
FALSE.summarizeFactors(dat = NULL, maxLevels = 5,
alphaSort = TRUE, sumstat = TRUE,
digits = max(3, getOption("digits") - 3))
Concerning the use of entropy as a diversity index, the user might consult Balch(). For each possible outcome category, let p represent the observed proportion of cases. The diversity contribution of each category is -p * log2(p). Note that if p is either 0 or 1, the diversity contribution is 0. The sum of those diversity contributions across possible outcomes is the entropy estimate. The entropy value is a lower bound of 0, but there is no upper bound that is independent of the number of possible categories. If m is the number of categories, the maximum possible value of entropy is -log2(1/m).
Because the maximum value of entropy depends on the number of possible categories, some scholars wish to re-scale so as to bring the values into a common numeric scale. The normed entropy is calculated as the observed entropy divided by the maximum possible entropy. Normed entropy takes on values between 0 and 1, so in a sense, its values are more easily comparable. However, the comparison is something of an illusion, since variables with the same number of categories will always be comparable by their entropy, whether it is normed or not.
Shannon, Claude. E. (1949). The Mathematical Theory of Communication. Urbana: University of Illinois Press.
summarizeFactors
and
summarizeNumerics
set.seed(21234)
x <- runif(1000)
xn <- ifelse(x < 0.2, 0, ifelse(x < 0.6, 1, 2))
xf <- factor(xn, levels=c(0,1,2), labels("A","B","C"))
dat <- data.frame(xf, xn, x)
summarizeFactors(dat)
##see help for summarize for more examples
Run the code above in your browser using DataLab