projection_randomforest: Projection Estimator with Random Forest Algorithm

Description

Kim and Rao (2012), the synthetic data obtained through the model-assisted projection method can provide a useful tool for efficient domain estimation when the size of the sample in survey B is much larger than the size of sample in survey A.

The function projects estimated values from a small survey (survey A) onto an independent large survey (survey B) using the random forest classification algorithm. The two surveys are statistically independent, but the projection relies on shared auxiliary variables. The process includes data preprocessing, feature selection, model training, and domain-specific estimation based on survey design principles "two stages one phase". The function automatically selects standard estimation or bias-corrected estimation based on the parameter bias_correction.

bias_correction = TRUE can only be used if there is psu, ssu, strata on the data_model. If it doesn't, then it will automatically be bias_correction = FALSE

Usage

projection_randomforest(
  data_model,
  target_column,
  predictor_cols,
  data_proj,
  domain1,
  domain2,
  psu,
  ssu = NULL,
  strata = NULL,
  weights,
  split_ratio = 0.8,
  feature_selection = TRUE,
  bias_correction = FALSE
)

Value

A list containing the following elements:

model The trained Random Forest model.
importance Feature importance showing which features contributed most to the model's predictions.
train_accuracy Accuracy of the model on the training set.
validation_accuracy Accuracy of the model on the validation set.
validation_performance Confusion matrix for the validation set, showing performance metrics like accuracy, precision, recall, etc.
data_proj The projection data with predicted values.

if bias_correction = FALSE:

Domain1 Estimations for Domain 1, including estimated values, variance, and relative standard error (RSE).
Domain2 Estimations for Domain 2, including estimated values, variance, and relative standard error (RSE).

if bias_correction = TRUE:

Direct Direct estimations for Domain 1, including estimated values, variance, and relative standard error (RSE).
Domain1_corrected_bias Bias-corrected estimations for Domain 1, including estimated values, variance, and relative standard error (RSE).
Domain2_corrected_bias Bias-corrected estimations for Domain 2, including estimated values, variance, and relative standard error (RSE).

Arguments

data_model: The training dataset, consisting of auxiliary variables and the target variable.
target_column: The name of the target column in the data_model.
predictor_cols: A vector of predictor column names.
data_proj: The data for projection (prediction), which needs to be projected using the trained model. It must contain the same auxiliary variables as the data_model
domain1: Domain variables for survey estimation (e.g., "province")
domain2: Domain variables for survey estimation (e.g., "regency")
psu: Primary sampling units, representing the structure of the sampling frame.
ssu: Secondary sampling units, representing the structure of the sampling frame (default is NULL).
strata: Stratification variable, ensuring that specific subgroups are represented (default is NULL).
weights: Weights used for the direct estimation from data_model and indirect estimation from data_proj.
split_ratio: Proportion of data used for training (default is 0.8, meaning 80 percent for training and 20 percent for validation).
feature_selection: Selection of predictor variables (default is TRUE)
bias_correction: Logical; if TRUE, then bias correction is applied, if FALSE, then bias correction is not applied. Default is FALSE.

References

Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.

Examples

Run this code

# \donttest{
library(survey)
library(caret)
library(dplyr)

data_A <- df_svy_A
data_B <- df_svy_B

# Get predictor variables from data_model
x_predictors <- data_A %>% select(5:19) %>% names()

# Run projection_randomforest with bias correction
rf_proj_corrected <- projection_randomforest(
                data_model = data_A,
                target_column = "Y",
                predictor_cols = x_predictors,
                data_proj = data_B,
                domain1 = "province",
                domain2 = "regency",
                psu = "num",
                ssu = NULL,
                strata = NULL,
                weights = "weight",
                feature_selection = TRUE,
                bias_correction = TRUE)

rf_proj_corrected$Direct
rf_proj_corrected$Domain1_corrected_bias
rf_proj_corrected$Domain2_corrected_bias

# }

Run the code above in your browser using DataLab