Last chance! 50% off unlimited learning
Sale ends in
This function compares two (estimated) distributions of the same categorical variable(s).
comp.prop(p1, p2, n1, n2=NULL, ref=FALSE)
A vector or an array containing relative or absolute frequencies for one or more categorical variables. Usually it is the output of the function xtabs
or table
. If ref = FALSE
then p2
is a further estimate of the distribution of the categorical variable(s) being considered. On the contrary (ref = TRUE
) it is the 'reference' distribution (the distribution considered true or a reliable estimate).
The size of the sample on which p1
has been estimated.
The size of the sample on which p2
has been estimated, required just when ref = FALSE
(p2
is estimated on another sample and is not the reference distribution).
Logical. When ref = TRUE
, p2
is the reference distribution (true or reliable estimate of distribution), on the contrary when ref = FALSE
it an estimate of the distribution derived from another sample with sample size n2
.
A list
object with two or three components depending on the argument ref
.
A vector with the measures of similarity/dissimilarity between the distributions: dissimilarity index ("tvd"
), overlap ("overlap"
), Bhattacharyya coefficient
("Bhatt"
) and Hellinger's distance ("Hell"
).
A vector with the following values: Pearson's Chi-square ("Pearson"
), the degrees of freedom ("df"
), the percentile of a Chi-squared distribution ("q0.05"
) and the largest admissible value of the generalised design effect that would determine the acceptance of H0 (equality of distributions).
When ref=FALSE
it is reported the value of the reference distribution ref=FALSE
) it is set equal to the argument p2
.
This function computes some similarity or dissimilarity measures between marginal (joint) distribution of categorical variables(s). The following measures are considered:
Dissimilarity index or total variation distance:
where p2
is the reference distribution (true or expected distribution under a given hypothesis) than, following the Agresti's rule of thumb (Agresti 2002, pp. 329--330) , values of p1
follows the true or expected pattern quite closely.
Overlap between two distributions:
It is a measure of similarity which ranges from 0 to 1 (the distributions are equal). It is worth noting that
Bhattacharyya coefficient:
It is a measure of similarity and ranges from 0 to 1 (the distributions are equal).
Hellinger's distance:
It is a dissimilarity measure ranging from 0 (distributions are equal) to 1 (max dissimilarity). It satisfies all the properties of a distance measure (
Alongside with those similarity/dissimilarity measures the Pearson's Chi-squared is computed. Two formulas are considered. When p2
is the reference distribution (true or expected under some hypothesis, ref=TRUE
):
When p2
is a distribution estimated on a second sample then:
where
being
The Chi-Square value can be used to test the hypothesis that two distributions are equal (
Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.
Sarndal CE, Swensson B, Wretman JH (1992) Model Assisted Survey Sampling. Springer--Verlag, New York.
# NOT RUN {
data(quine, package="MASS") #loads quine from MASS
str(quine)
# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(124)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, c("Eth","Sex","Age")]
quine.B <- quine[-lab.A, c("Eth","Sex","Age")]
# compare est. distributions from 2 samples
# 1 variable
tt.A <- xtabs(~Age, data=quine.A)
tt.B <- xtabs(~Age, data=quine.B)
comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE)
# joint distr. of more variables
tt.A <- xtabs(~Eth+Sex+Age, data=quine.A)
tt.B <- xtabs(~Eth+Sex+Age, data=quine.B)
comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE)
# compare est. distr. with a one considered as reference
tt.A <- xtabs(~Eth+Sex+Age, data=quine.A)
tt.all <- xtabs(~Eth+Sex+Age, data=quine)
comp.prop(p1=tt.A, p2=tt.all, n1=nrow(quine.A), n2=NULL, ref=TRUE)
# }
Run the code above in your browser using DataLab