comb.samples: Statistical Matching of data from complex sample surveys

Description

This function permits to cross--tabulate two categorical variables, Y and Z, observed separately in two independent surveys (Y is collected in survey A and Z is collected in survey B) carried out on the same target population. The two surveys share a number of common variables X. When it is available a third survey C, carried on the same population, in which both Y and Z are collected, these data are used as a source of auxiliary information. The statistical matching is performed by carrying out calibration of the survey weights, as suggested in Renssen (1998).

Usage

comb.samples(svy.A, svy.B, svy.C=NULL, y.lab, z.lab, form.x, 
              estimation=NULL, ...)

Arguments

svy.A

A svydesign Robject that stores the data collected in the survey A and all the information concerning the sampling design. This type of object can be created by using the function svydesign

svy.B

A svydesign Robject that stores the data collected in the survey B and all the information concerning the sampling design.  This type of object can be created by using the function svydesign

svy.C

A svydesign Robject that stores the data collected in the the survey C and all the information concerning the sampling design.  This type of object can be created by using the function svydesign

y.lab

A string providing the name of the Y variable, collected in survey A and in survey C (if available).  The Y variable must be a categorical variable (factor or integer in R).

z.lab

A string providing the name of the Z variable collected in survey B and in survey C (if available).  The Z variable must be a categorical variable (factor or integer in R).

form.x

A Rformula specifying which of the X variables, collected in all the surveys, have to be considered, and how have to be considered in combining samples.  For instance 
form.x=~x1+x2 means that the variables x1 and x2 have to be conside

estimation

A character string that identifies the method to be used to estimate the table of Y vs. Z when data from survey C are available.  As suggested in Renssen (1998), two alternative methods are available: (i) Incomplete Two--Way Stratification (estimati

...

Further arguments that may be necessary for calibration.  In particular, the argument calfun allows to specyfy the calibration function.  Three alternatives are available: 
(i) calfun="linear" for linear calibration (default); 
(

`Value`

A Rlist with the results of the calibration procedure according to the input arguments.  When svy.C=NULL the list will contain just two components yz.CIA (Y vs. Z estimated under the CIA) and call (the call to the function).  On the contrary, when data from svy.C are available, the following components will be available:
yz.CIAThe table of Y vs. Z estimated under the Conditional Independence Assumption (CIA).
cal.CThe survey object svy.C after the calibration.
yz.estThe table of Y vs. Z estimated under the method specified via estimation argument.
callStores the call to this function with all the values specified for the various arguments (call=match.call()).

`Details`

This function estimates the contingency table of Y vs. Z by performing the calibration of the survey weights.  In practice the estimation is carried out on data in survey C by exploiting all the information from surveys A and B.  When survey C is not available the table of Y vs. Z is estimated under the assumption of Conditional Independence (CIA), i.e. $p(Y,Z)=p(Y|\bold{X}) \times p(Z|\bold{X}) \times p(\bold{X})$.

When data from survey C are available (Renssen, 1998), the table of Y vs. Z can be estimated by: Incomplete Two--Way Stratification (ITWS) or Synthetic Two--Way Stratification (STWS).  In the first case (ITWS) the weights of the units in survey C are calibrated so that the new weights allow to reproduce the marginal distributions of Y variable estimated on survey A, and that of Z estimated on survey B.  Note that the distribution of the X variables in survey A and in survey B, must be harmonized before performing ITWS (see harmonize.x).  
The Synthetic Two--Way Stratification allows to estimate the table of Y vs. Z by considering also the X variables observed in C.  This method consists in correcting the table of Y vs. Z estimated under the CIA by according to the relationship between Y and Z observed in survey C (for major details see Renssen, 1998).

`References`

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Renssen, R.H. (1998) Use of Statistical Matching Techniques in Calibration Estimation. Survey Methodology, 24, pp. 171--183.

`See Also`

calibrate, svydesign, harmonize.x

`Examples`

Run this codedata(quine, package="MASS") #loads quine from MASS
str(quine)
quine$c.Days <- cut(quine$Days, c(-1, seq(0,20,10),100))
table(quine$c.Days)


# split quine in two subsets
set.seed(124)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, c("Eth","Sex","Age","Lrn")]
quine.B <- quine[-lab.A, c("Eth","Sex","Age","c.Days")]

# create svydesign objects
require(survey)
quine.A$f <- 70/nrow(quine) # sampling fraction
quine.B$f <- (nrow(quine)-70)/nrow(quine)
svy.qA <- svydesign(~1, fpc=~f, data=quine.A)
svy.qB <- svydesign(~1, fpc=~f, data=quine.B)

# Harmonizazion wrt the joint distribution
# of ('Sex' x 'Age' x 'Eth')

# vector of population total known
# estimated from the full data set
# note the formula!
tot.m <- colSums(model.matrix(~Eth:Sex:Age-1, data=quine))
tot.m

out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m,
            form.x=~Eth:Sex:Age-1, cal.method="linear")
            
# estimation of 'Lrn' vs. 'c.Days' under the CIA

svy.qA.h <- out.hz$cal.A
svy.qB.h <- out.hz$cal.B

out.1 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h,
            svy.C=NULL, y.lab="Lrn", z.lab="c.Days",
            form.x=~Eth:Sex:Age-1)

out.1$yz.CIA
addmargins(out.1$yz.CIA)



#
# incomplete two-way stratification

# select a sample C from quine
# and define a survey object

set.seed(4321)
lab.C <- sample(nrow(quine), 50, replace=TRUE)
quine.C <- quine[lab.C, c("Lrn","c.Days")]
quine.C$f <- 50/nrow(quine) # sampling fraction
svy.qC <- svydesign(~1, fpc=~f, data=quine.C)

out.2 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h,
            svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days",
            form.x=~Eth:Sex:Age-1, estimation="incomplete",
            calfun="linear", maxit=100)

summary(weights(out.2$cal.C))
out.2$yz.est # estimated table of 'Lrn' vs. 'c.Days'
# difference wrt the table 'Lrn' vs. 'c.Days' under CIA
addmargins(out.2$yz.est)-addmargins(out.2$yz.CIA)

# synthetic two-way stratification

quine.C <- quine[lab.C, ]
quine.C$f <- 50/nrow(quine) # sampling fraction
svy.qC <- svydesign(~1, fpc=~f, data=quine.C)

out.3 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h,
            svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days",
            form.x=~Eth:Sex:Age-1, estimation="synthetic",
            calfun="linear",bounds=c(.5,Inf), maxit=100)

summary(weights(out.3$cal.C))

out.3$yz.est # estimated table of 'Lrn' vs. 'c.Days'
# difference wrt the table of 'Lrn' vs. 'c.Days' under CIA
addmargins(out.3$yz.est)-addmargins(out.3$yz.CIA)
# diff wrt the table of 'Lrn' vs. 'c.Days' under incomplete 2ws
addmargins(out.3$yz.est)-addmargins(out.2$yz.CIA)
Run the code above in your browser using DataLab