summarizeNumerics
for a function that
handles numeric variables.) It then treats all non-numeric
variables as if they were factors, and summarizes each. The main
benefits from this compared to R's default summary are 1) more
summary information is returned for each variable (entropy
estimates ofdispersion), 2) the columns in the output are
alphabetized. To prevent alphabetization, use alphaSort = FALSE.summarizeFactors(dat = NULL, maxLevels = 5, alphaSort = TRUE,
sumstat = TRUE, digits = max(3, getOption("digits") - 3))
Concerning the use of entropy as a diversity index, the user might consult Balch(). For each possible outcome category, let p represent the observed proportion of cases. The diversity contribution of each category is -p * log2(p). Note that if p is either 0 or 1, the diversity contribution is 0. The sum of those diversity contributions across possible outcomes is the entropy estimate. The entropy value is a lower bound of 0, but there is no upper bound that is independent of the number of possible categories. If m is the number of categories, the maximum possible value of entropy is -log2(1/m).
Because the maximum value of entropy depends on the number of possible categories, some scholars wish to re-scale so as to bring the values into a common numeric scale. The normed entropy is calculated as the observed entropy divided by the maximum possible entropy. Normed entropy takes on values between 0 and 1, so in a sense, its values are more easily comparable. However, the comparison is something of an illusion, since variables with the same number of categories will always be comparable by their entropy, whether it is normed or not.
Warning: Variables of class POSIXt will be ignored. This will be fixed in the future. The function works perfectly well with numeric, factor, or character variables. Other more elaborate structures are likely to be trouble.
Shannon, Claude. E. (1949). The Mathematical Theory of Communication. Urbana: University of Illinois Press.
summarizeFactors
and summarizeNumerics
set.seed(21234)
x <- runif(1000)
xn <- ifelse(x < 0.2, 0, ifelse(x < 0.6, 1, 2))
xf <- factor(xn, levels=c(0,1,2), labels("A","B","C"))
dat <- data.frame(xf, xn, x)
summarizeFactors(dat)
##see help for summarize for more examples
Run the code above in your browser using DataLab