Model for the outcome for the mass imputation estimator using generalized linear
models via the stats::glm
function. Estimation of the mean is done using \(S_B\)
probability sample or known population totals.
method_glm(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = "gaussian",
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
an nonprob_method
class which is a list
with the following entries
fitted model either an glm.fit
or cv.ncvreg
object
predicted values for the non-probablity sample
predicted values for the probability sample or population totals
coefficients for the model (if available)
an updated surveydesign2
object (new column y_hat_MI
is added)
estimated population mean for the target variable
whether variable selection was performed
variance for the probability sample component (if available)
variance for the non-probability sampl component
total variance, if possible it should be var_prob+var_nonprob
if not, just a scalar
model type (character "glm"
)
family type (character "glm"
)
target variable from non-probability sample
a model.matrix
with auxiliary variables from non-probability sample
a model.matrix
with auxiliary variables from non-probability sample
a svydesign object
case / frequency weights from non-probability sample
family for the glm model
start parameters (default NULL
)
whether variable selection should be conducted
population totals from the nonprob
function
population size from the nonprob
function
controls passed by the control_out
function
controls passed by the control_inf
function (currently not used, for further development)
parameter passed from the main nonprob
function
whether standard errors should be calculated
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (\(S_A\) with size \(n_A\); denoted as var_nonprob
in the result)
$$ \hat{V}_1 = \frac{1}{n_A^2}\sum_{i=1}^{n_A} \hat{e}_i \left\lbrace \boldsymbol{h}(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})^\prime\hat{\boldsymbol{c}}\right\rbrace, $$
where \(\hat{e}_i = y_i - m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})\) and $$\widehat{\boldsymbol{c}}=\left\lbrace n_B^{-1} \sum_{i \in B} \dot{\boldsymbol{m}}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right) \boldsymbol{h}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right)^{\prime}\right\rbrace^{-1} N^{-1} \sum_{i \in A} w_i \dot{\boldsymbol{m}}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right).$$
Under the linear regression model \(\boldsymbol{h}\left(\boldsymbol{x}_i ; \widehat{\boldsymbol{\beta}}\right)=\boldsymbol{x}_i\) and \(\widehat{\boldsymbol{c}}=\left(n_A^{-1} \sum_{i \in A} \boldsymbol{x}_i \boldsymbol{x}_i^{\prime}\right)^{-1} N^{-1} \sum_{i \in B} w_i \boldsymbol{x}_i .\)
(b) probability part (\(S_B\) with size \(n_B\); denoted as var_prob
in the result)
This part uses functionalities of the {survey}
package and the variance is estimated using the following
equation:
$$ \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}. $$
Note that \(\hat{V}_2\) in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Furthermore, if only population totals/means are known and assumed to be fixed we set \(\hat{V}_2=0\).
Kim, J. K., Park, S., Chen, Y., & Wu, C. (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(3), 941-963.
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs)
res_glm <- method_glm(y_nons = admin$single_shift,
X_nons = model.matrix(~ region + private + nace + size, admin),
X_rand = model.matrix(~ region + private + nace + size, jvs),
svydesign = jvs_svy)
res_glm
Run the code above in your browser using DataLab