This function computes some similarity or dissimilarity measures among marginal (joint) distribution of categorical variables(s).
The following measures are considered:
Dissimilarity index or total variation distance:
$$\Delta_{12} = \frac{1}{2} \sum_{j=1}^J \left| p_{1,j} - p_{2,j} \right|$$
where $p_s,j$ are the relative frequencies ($0 <= 0="" p_s,j="" <="1$)." the="" dissimilarity="" index="" ranges="" from="" (minimum="" dissimilarity)="" to="" 1.="" it="" can="" be="" interpreted="" as="" smallest="" fraction="" of="" units="" that="" need="" reclassified="" in="" order="" make="" distributions="" equal.="" when="" p2 is the reference distribution (true or expected distribution under a given hypothesis) than, following the Agresti's rule of thumb (Agresti 2002, pp. 329--330) , values of $D < 0.03$ denotes that the estimated distribution p1 follow the true or expected pattern quite closely.=>
Overlap between two distributions:
$$O_{12} = \sum_{j=1}^J min(p_{1,j},p_{2,j}) $$
It is a measure of similarity which ranges from 0 to 1 (the distributions are equal). It is worth noting that $O = 1 - D$.
Bhattacharyya coefficient:
$$B_{12} = \sum_{j=1}^J \sqrt{p_{1,j} \times p_{2,j}} $$
It is a measure of similarity and ranges from 0 to 1 (the distributions are equal).
Hellinger's distance:
$$d_{H,12} = \sqrt{1-B_{1,2}} $$
It is a dissimilarity measure which ranges from 0 (distributions are equal) to 1 (max dissimilarity). It satisfies all the properties of a distance measure ($0 <= d_h="" <="1$;" symmetry="" and="" triangle="" inequality).="" hellinger's="" distance="" is="" related="" to="" the="" dissimilarity="" index,="" it="" possible="" show="" that:<="" p="">
$$d_{H,12}^2 \leq \Delta_{12} \leq d_{H,12}\sqrt{2} $$
Alongside with those similarity/dissimilarity measures the Pearson's Chi-squared is computed. Two formulas are considered. When p2 is the reference distribution (true or expected under some hypothesis, ref=TRUE):
$$ \chi^2_P = n_1 \sum_{j=1}^J \frac{\left( p_1,j - p_{2,j}\right)^2}{p_{2,j}} $$
When p2 is a distribution estimated on a second sample then:
$$ \chi^2_P = \sum_{i=1}^2 \sum_{j=1}^J n_i \frac{\left( p_{i,j} - p_{+,j}\right)^2}{p_{+,j}} $$
where $p_+,j$ is the expected frequency for category j, obtained as follows:
$$ p_{+,j} = \frac{n_1 p_{1,j} + n_2 p_{2,j}}{n_1+n_2} $$
The Chi-Square value can be used to test the hypothesis that two distributions are equal (df=J-1). Unfortunately such a test would not be useful when the distribution are estimated from samples selected from finite population using complex selection schemes (stratification, clustering, etc.). In such a case different alternative corrected Chi-square tests are available (cf. Sarndal et al., 1992, Sec. 13.5). One possibility consist in dividing the Pearson's Chi-square test by the generalised design effect of both the surveys. Its estimation is not simple (sampling design variables need to be available). The generalised design effect is smaller than 1 in the presence of stratified random sampling designs. It exceeds 1 the presence of a two stage cluster sampling design. For the purposes of analysis it is reported the value of the generalised design effect g that would determine the acceptance of the null hypothesis (equality of distributions) in the case of alpha=0.05 (df=J-1), i.e. values of g such that
$$ \frac{\chi^2_P}{g} \leq \chi^2_{J-1,0.05} $$
=>