Fbwidths.by.x: Computes the Frechet bounds of cells in a contingency table by considering all the possible subsets of the common variables.

Description

This function permits to compute the bounds for cell probabilities in the contingency table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables, by considering all the possible subsets of the X variables. In this manner it is possible to identify which subset of the X variables produces the major reduction of the uncertainty.

Usage

Fbwidths.by.x(tab.x, tab.xy, tab.xz, compress.sum=FALSE)

Arguments

tab.x

A R table crossing the X variables. This table must be obtained by using the function xtabs or table, e.g. tab.x <- xtabs(~x1+x2+x3, data=data.all).

tab.xy

A R table of X vs. Y variable. This table must be obtained by using the function xtabs or table, e.g. table.xy <- xtabs(~x1+x2+x3+y, data=data.A).

A single categorical Y variables is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in tab.x must be available in tab.xy. Moreover, it is assumed that the joint distribution of the X variables computed from tab.xy is equal to tab.x; a warning is produced if this is not true.

tab.xz

A R table of X vs. Z variable. This table must be obtained by using the function xtabs or table, e.g. tab.xz <- xtabs(~x1+x2+x3+z, data=data.B).

A single categorical Z variable is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in tab.x must be available in tab.xz. Moreover, it is assumed that the joint distribution of the X variables computed from tab.xz is equal to tab.x; a warning is produced if this is not true.

compress.sum

Logical (default FALSE). If TRUE reduces the information saved in sum.unc. See Value for further information.

Value

A list with the estimated bounds for the cells in the table of Y vs. Z for each possible subset of the X variables. The final component sum.unc is a data.frame that summarizes the main findings. In particular it reports the number of X variables ("x.vars"), the number of cells in the each of the input tables and the corresponding number of cells with frequency equal to 0 (columns ending with freq0 ). Then it is provided the average width of the uncertainty intervals ("av.width") and its relative value ("rel.av.width") when compared with the average widths of the uncertainty intervals when no X variables are considered (i.e. unconditioned "av.width", reported in the first row of the data.frame).When compress.sum = TRUE the data.frame sum.unc will show a combination of the X variables only if it determines a reduction of the ("av.width") when compared to the preceding one.Note that in the presence of too many cells with 0s in the input contingency tables is an indication of sparseness; this is an unappealing situation when estimating the cells' relative frequencies needed to derive the bounds; in such cases the corresponding results may be unreliable. A possible alternative way of working consists in estimating the required parameters by considering a pseudo-Bayes estimator (see pBayes); in practice the input tab.x, tab.xy and tab.xz should be the ones provided by the pBayes function.

Details

This function permits to compute the Frechet bounds for the frequencies in the contingency table of Y vs. Z, starting from the conditional distributions P(Y|X) and P(Z|X) (for details see Frechet.bounds.cat), by considering all the possible subsets of the X variables. In this manner it is possible to identify the subset of the X variables, with highest association with both Y and Z, that permits to reduce the uncertainty concerning the distribution of Y vs. Z.

The uncertainty is measured by the average of the widths of the bounds for the cells in the table Y vs. Z:

$$ \bar{d} = \frac{1}{J \times K} \sum_{j,k} ( p^{(up)}_{Y=j,Z=k} - p^{(low)}_{Y=j,Z=k} )$$

For details see Frechet.bounds.cat.

References

Ballin, M., D'Orazio, M., Di Zio, M., Scanu, M. and Torelli, N. (2009) “Statistical Matching of Two Surveys with a Common Subset”. Working Paper, 124. Dip. Scienze Economiche e Statistiche, Univ. di Trieste, Trieste.

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Examples

Run this code


data(quine, package="MASS") #loads quine from MASS
str(quine)
quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100))
table(quine$c.Days)


# split quine in two subsets
set.seed(4567)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:4]
quine.B <- quine[-lab.A, c(1:3,6)]

# compute the tables required by Fbwidths.by.x()
freq.x <- xtabs(~Eth+Sex+Age, data=quine.A)
freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A)
freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B)

# apply Fbwidths.by.x()
bounds.yz <- Fbwidths.by.x(tab.x=freq.x, tab.xy=freq.xy,
        tab.xz=freq.xz)

bounds.yz$sum.unc

# input tables estimated with pBayes()

pf.x <- pBayes(x=freq.x)
pf.xy <- pBayes(x=freq.xy)
pf.xz <- pBayes(x=freq.xz)

bounds.yz.p <- Fbwidths.by.x(tab.x = pf.x$pseudoB, 
							 tab.xy = pf.xy$pseudoB,
							 tab.xz = pf.xz$pseudoB)

Run the code above in your browser using DataLab