ma_projection: Model-Assisted Projection Estimator

Description

The function addresses the problem of combining information from two or more independent surveys, a common challenge in survey sampling. It focuses on cases where:

Survey 1: A large sample collects only auxiliary information.
Survey 2: A much smaller sample collects both the variables of interest and the auxiliary variables.

The function implements a model-assisted projection estimation method based on a working model. The working models that can be used include several machine learning models that can be seen in the details section

Usage

ma_projection(
  formula,
  cluster_ids,
  weight,
  strata = NULL,
  domain,
  summary_function = "mean",
  working_model,
  data_model,
  data_proj,
  model_metric,
  cv_folds = 3,
  tuning_grid = 10,
  parallel_over = "resamples",
  seed = 1,
  return_yhat = FALSE,
  ...
)

Value

A list containing:

model – The fitted working model object.
prediction – A vector of predictions from the working model.
df_result – A data frame with:
- domain – Domain identifier.
- ypr – Projection estimator results for each domain.
- var_ypr – Estimated variance of the projection estimator.
- rse_ypr – Relative standard error (in \

Arguments

formula: A model formula. All variables used must exist in both data_model and data_proj.
cluster_ids: Column name (character) or formula specifying cluster identifiers from highest to lowest level. Use ~0 or ~1 if there are no clusters.
weight: Column name in data_proj representing the survey weights.
strata: Column name for stratification; use NULL if no strata are used.
domain: Character vector specifying domain variable names in both datasets.
summary_function: A function to compute domain-level estimates (default: "mean", "total", "variance").
working_model: A parsnip model object specifying the working model (see @details).
data_model: Data frame (small sample) containing both target and auxiliary variables.
data_proj: Data frame (large sample) containing only auxiliary variables.
model_metric: A yardstick::metric_set() function, or NULL to use default metrics.
cv_folds: Number of folds for k-fold cross-validation.
tuning_grid: Either a data frame with tuning parameters or a positive integer specifying the number of grid search candidates.
parallel_over: Specifies parallelization mode: "resamples", "everything", or NULL. If "resamples", then tuning will be performed in parallel over resamples alone. Within each resample, the preprocessor (i.e. recipe or formula) is processed once, and is then reused across all models that need to be fit. If "everything", then tuning will be performed in parallel at two levels. An outer parallel loop will iterate over resamples. Additionally, an inner parallel loop will iterate over all unique combinations of preprocessor and model tuning parameters for that specific resample. This will result in the preprocessor being re-processed multiple times, but can be faster if that processing is extremely fast.
seed: Integer seed for reproducibility.
return_yhat: Logical; if TRUE, returns predicted y values for data_model.
...: Additional arguments passed to svydesign.

Details

The following working models are supported via the parsnip interface:

linear_reg() – Linear regression
logistic_reg() – Logistic regression
linear_reg(engine = "stan") – Bayesian linear regression
logistic_reg(engine = "stan") – Bayesian logistic regression
poisson_reg() – Poisson regression
decision_tree() – Decision tree
nearest_neighbor() – k-Nearest Neighbors (k-NN)
naive_bayes() – Naive Bayes classifier
mlp() – Multi-layer perceptron (neural network)
svm_linear() – Support vector machine with linear kernel
svm_poly() – Support vector machine with polynomial kernel
svm_rbf() – Support vector machine with radial basis function (RBF) kernel
bag_tree() – Bagged decision tree
bart() – Bayesian Additive Regression Trees (BART)
rand_forest(engine = "ranger") – Random forest (via ranger)
rand_forest(engine = "aorsf") – Accelerated oblique random forest (AORF; Jaeger et al. 2022, 2024)
boost_tree(engine = "lightgbm") – Gradient boosting (LightGBM)
boost_tree(engine = "xgboost") – Gradient boosting (XGBoost)

For a complete list of supported models and engines, see Tidy Modeling With R.

References

Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.

Examples

Run this code

if (FALSE) {
library(sae.projection)
library(dplyr)
library(bonsai)

df_svy22_income <- df_svy22 %>% filter(!is.na(income))
df_svy23_income <- df_svy23 %>% filter(!is.na(income))

# Linear regression
lm_proj <- ma_projection(
  income ~ age + sex + edu + disability,
  cluster_ids = "PSU", weight = "WEIGHT", strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = linear_reg(),
  data_model = df_svy22_income,
  data_proj = df_svy23_income,
  nest = TRUE
)


df_svy22_neet <- df_svy22 %>% filter(between(age, 15, 24))
df_svy23_neet <- df_svy23 %>% filter(between(age, 15, 24))


# LightGBM regression with hyperparameter tunning
show_engines("boost_tree")
lgbm_model <- boost_tree(
  mtry = tune(), trees = tune(), min_n = tune(),
  tree_depth = tune(), learn_rate = tune(),
  engine = "lightgbm"
)

lgbm_proj <- ma_projection(
  formula = neet ~ sex + edu + disability,
  cluster_ids = "PSU",
  weight = "WEIGHT",
  strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = lgbm_model,
  data_model = df_svy22_neet,
  data_proj = df_svy23_neet,
  cv_folds = 3,
  tuning_grid = 3,
  nest = TRUE
)
}

Run the code above in your browser using DataLab