Learn R Programming

sae.projection (version 0.1.4)

ma_projection: Model-Assisted Projection Estimator

Description

The function addresses the problem of combining information from two or more independent surveys, a common challenge in survey sampling. It focuses on cases where:

  • Survey 1: A large sample collects only auxiliary information.

  • Survey 2: A much smaller sample collects both the variables of interest and the auxiliary variables.

The function implements a model-assisted projection estimation method based on a working model. The working models that can be used include several machine learning models that can be seen in the details section

Usage

ma_projection(
  formula,
  cluster_ids,
  weight,
  strata = NULL,
  domain,
  summary_function = "mean",
  working_model,
  data_model,
  data_proj,
  model_metric,
  cv_folds = 3,
  tuning_grid = 10,
  parallel_over = "resamples",
  seed = 1,
  return_yhat = FALSE,
  ...
)

Value

A list containing:

  • model – The fitted working model object.

  • prediction – A vector of predictions from the working model.

  • df_result – A data frame with:

    • domain – Domain identifier.

    • ypr – Projection estimator results for each domain.

    • var_ypr – Estimated variance of the projection estimator.

    • rse_ypr – Relative standard error (in \

Arguments

formula

A model formula. All variables used must exist in both data_model and data_proj.

cluster_ids

Column name (character) or formula specifying cluster identifiers from highest to lowest level. Use ~0 or ~1 if there are no clusters.

weight

Column name in data_proj representing the survey weights.

strata

Column name for stratification; use NULL if no strata are used.

domain

Character vector specifying domain variable names in both datasets.

summary_function

A function to compute domain-level estimates (default: "mean", "total", "variance").

working_model

A parsnip model object specifying the working model (see @details).

data_model

Data frame (small sample) containing both target and auxiliary variables.

data_proj

Data frame (large sample) containing only auxiliary variables.

model_metric

A yardstick::metric_set() function, or NULL to use default metrics.

cv_folds

Number of folds for k-fold cross-validation.

tuning_grid

Either a data frame with tuning parameters or a positive integer specifying the number of grid search candidates.

parallel_over

Specifies parallelization mode: "resamples", "everything", or NULL. If "resamples", then tuning will be performed in parallel over resamples alone. Within each resample, the preprocessor (i.e. recipe or formula) is processed once, and is then reused across all models that need to be fit. If "everything", then tuning will be performed in parallel at two levels. An outer parallel loop will iterate over resamples. Additionally, an inner parallel loop will iterate over all unique combinations of preprocessor and model tuning parameters for that specific resample. This will result in the preprocessor being re-processed multiple times, but can be faster if that processing is extremely fast.

seed

Integer seed for reproducibility.

return_yhat

Logical; if TRUE, returns predicted y values for data_model.

...

Additional arguments passed to svydesign.

Details

The following working models are supported via the parsnip interface:

  • linear_reg() – Linear regression

  • logistic_reg() – Logistic regression

  • linear_reg(engine = "stan") – Bayesian linear regression

  • logistic_reg(engine = "stan") – Bayesian logistic regression

  • poisson_reg() – Poisson regression

  • decision_tree() – Decision tree

  • nearest_neighbor() – k-Nearest Neighbors (k-NN)

  • naive_bayes() – Naive Bayes classifier

  • mlp() – Multi-layer perceptron (neural network)

  • svm_linear() – Support vector machine with linear kernel

  • svm_poly() – Support vector machine with polynomial kernel

  • svm_rbf() – Support vector machine with radial basis function (RBF) kernel

  • bag_tree() – Bagged decision tree

  • bart() – Bayesian Additive Regression Trees (BART)

  • rand_forest(engine = "ranger") – Random forest (via ranger)

  • rand_forest(engine = "aorsf") – Accelerated oblique random forest (AORF; Jaeger et al. 2022, 2024)

  • boost_tree(engine = "lightgbm") – Gradient boosting (LightGBM)

  • boost_tree(engine = "xgboost") – Gradient boosting (XGBoost)

For a complete list of supported models and engines, see Tidy Modeling With R.

References

  1. Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.

Examples

Run this code
if (FALSE) {
library(sae.projection)
library(dplyr)
library(bonsai)

df_svy22_income <- df_svy22 %>% filter(!is.na(income))
df_svy23_income <- df_svy23 %>% filter(!is.na(income))

# Linear regression
lm_proj <- ma_projection(
  income ~ age + sex + edu + disability,
  cluster_ids = "PSU", weight = "WEIGHT", strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = linear_reg(),
  data_model = df_svy22_income,
  data_proj = df_svy23_income,
  nest = TRUE
)


df_svy22_neet <- df_svy22 %>% filter(between(age, 15, 24))
df_svy23_neet <- df_svy23 %>% filter(between(age, 15, 24))


# LightGBM regression with hyperparameter tunning
show_engines("boost_tree")
lgbm_model <- boost_tree(
  mtry = tune(), trees = tune(), min_n = tune(),
  tree_depth = tune(), learn_rate = tune(),
  engine = "lightgbm"
)

lgbm_proj <- ma_projection(
  formula = neet ~ sex + edu + disability,
  cluster_ids = "PSU",
  weight = "WEIGHT",
  strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = lgbm_model,
  data_model = df_svy22_neet,
  data_proj = df_svy23_neet,
  cv_folds = 3,
  tuning_grid = 3,
  nest = TRUE
)
}

Run the code above in your browser using DataLab