Mass imputation using nearest neighbours approach as described in Yang et al. (2021). The implementation is currently based on RANN::nn2 function and thus it uses Euclidean distance for matching units from \(S_A\) (non-probability) to \(S_B\) (probability). Estimation of the mean is done using \(S_B\) sample.
method_nn(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = NULL,
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
an nonprob_method
class which is a list
with the following entries
RANN::nn2
object
predicted values for the non-probablity sample (query to itself)
predicted values for the probability sample
coefficients for the model (if available)
an updated surveydesign2
object (new column y_hat_MI
is added)
estimated population mean for the target variable
whether variable selection was performed (not implemented, for further development)
variance for the probability sample component (if available)
variance for the non-probability sample component
total variance, if possible it should be var_prob+var_nonprob
if not, just a scalar
model type (character "nn"
)
placeholder for the NN approach
information
target variable from non-probability sample
a model.matrix
with auxiliary variables from non-probability sample
a model.matrix
with auxiliary variables from non-probability sample
a svydesign object
case / frequency weights from non-probability sample
a placeholder (not used in method_nn
)
a placeholder (not used in method_nn
)
whether variable selection should be conducted
a placeholder (not used in method_nn
)
population size from the nonprob
function
controls passed by the control_out
function
controls passed by the control_inf
function
parameter passed from the main nonprob
function
whether standard errors should be calculated
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (\(S_A\) with size \(n_A\); denoted as var_nonprob
in the result)
This may be estimated using
$$ \hat{V}_1 = \frac{1}{N^2}\sum_{i=1}^{S_A}\frac{1-\hat{\pi}_B(\boldsymbol{x}_i)}{\hat{\pi}_B(\boldsymbol{x}_i)}\hat{\sigma}^2(\boldsymbol{x}_i), $$
where \(\hat{\pi}_B(\boldsymbol{x}_i)\) is an estimator of propensity scores which we currently estimate using \(n_A/N\) (constant) and \(\hat{\sigma}^2(\boldsymbol{x}_i)\) is estimated using based on the average of \((y_i - y_i^*)^2\).
Chlebicki et al. (2025, Algorithm 2) proposed non-parametric mini-bootstrap estimator
(without assuming that it is consistent) but with good finite population properties.
This bootstrap can be applied using control_inference(nn_exact_se=TRUE)
and
can be summarized as follows:
Sample \(n_A\) units from \(S_A\) with replacement to create \(S_A'\) (if pseudo-weights are present inclusion probabilities should be proportional to their inverses).
Match units from \(S_B\) to \(S_A'\) to obtain predictions \(y^*\)=\({k}^{-1}\sum_{k}y_k\).
Estimate \(\hat{\mu}=\frac{1}{N} \sum_{i \in S_B} d_i y_i^*\).
Repeat steps 1-3 \(M\) times (we set \(M=50\) in our simulations; this is hard-coded).
Estimate \(\hat{V}_1=\text{var}({\hat{\boldsymbol{\mu}}})\) obtained from simulations and save it as var_nonprob
.
(b) probability part (\(S_B\) with size \(n_B\); denoted as var_prob
in the result)
This part uses functionalities of the {survey}
package and the variance is estimated using the following
equation:
$$ \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}} \frac{y_i^*}{\pi_i} \frac{y_j^*}{\pi_j}, $$
where \(y^*_i\) and \(y_j^*\) are values imputed imputed as an average of \(k\)-nearest neighbour, i.e. \({k}^{-1}\sum_{k}y_k\). Note that \(\hat{V}_2\) in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Yang, S., Kim, J. K., & Hwang, Y. (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability and probability samples with predictive mean matching. arXiv preprint arXiv:2403.13750.
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs)
res_nn <- method_nn(y_nons = admin$single_shift,
X_nons = model.matrix(~ region + private + nace + size, admin),
X_rand = model.matrix(~ region + private + nace + size, jvs),
svydesign = jvs_svy)
res_nn
Run the code above in your browser using DataLab