projection_xgboost: Projection Estimator with XGBoost Algorithm

Description

Kim and Rao (2012), proposed a model-assisted projection estimation method for two independent surveys, where the first survey (A1) has a large sample that only collects auxiliary variables, while the second survey (A1) has a smaller sample but contains information on both the focal variable and auxiliary variables. This method uses a Working Model (WM) to relate the focal variable to the auxiliary variable based on data from A2, and then predicts the value of the focal variable for A1. A projection estimator is then obtained from the (A2) sample using the resulting synthetic values. This approach produces estimators that are asymptotically unbiased and can improve the efficiency of domain estimation, especially when the sample size in survey 1 is much larger compared to survey 2.

This function applies the XGBoost algorithm to project estimated values from a small survey onto an independent larger survey. While the two surveys are statistically independent, the projection is based on common auxiliary variables. The process in this function involves data preprocessing, feature selection, getting the best model with hyperparameter tuning, and performing domain-specific estimation following survey design principles.

Usage

projection_xgboost(
  target_col,
  data_model,
  data_proj,
  id,
  STRATA = NULL,
  domain1,
  domain2,
  weight,
  task_type,
  test_size = 0.2,
  nfold = 5,
  corrected_bias = FALSE,
  feature_selection = TRUE
)

Value

A list containing the following components:

metadata

A list of metadata about the modeling process, including:

method: Description of the method used (e.g., "Projection Estimator With XGBoost Algorithm"),
model_type: The type of model, either "classification" or "regression",
feature_selection_used: Logical, whether feature selection was used,
corrected_bias_applied: Logical, whether bias correction was applied,
n_features_used: Number of predictor variables used,
model_params: The hyperparameters and settings of the final XGBoost model,
features_selected (optional): Names of features selected, if feature selection was applied.

estimation

A list of projection estimation results, including:

projected_data: The dataset used for projection (e.g., kabupaten/kota) with predicted values,
domain1_estimation: Estimated values for domain 1 (e.g., province level), including:
- Estimation, RSE, var
,
domain2_estimation: Estimated values for domain 2 (e.g., regency level), including:
- Estimation, RSE, var

performance

(Only if applicable) A list of model performance metrics:

mean_train_accuracy, final_accuracy, confusion_matrix (for classification),
mean_train_rmse, final_rmse (for regression).

bias_correction

(Optional) A list of bias correction results, returned only if corrected_bias = TRUE, including:

direct_estimation: Direct estimation before correction,
corrected_domain1: Bias-corrected estimates for domain 1,
corrected_domain2: Bias-corrected estimates for domain 2.

Arguments

target_col: The name of the column that contains the target variable in the data_model.
data_model: A data frame or a data frame extension (e.g., a tibble) representing the training dataset, which consists of auxiliary variables and the target variable. This dataset is characterized by a smaller sample size and provides information on both the variable of interest and the auxiliary variables.
data_proj: A data frame or a data frame extension (e.g., a tibble) representing the projection dataset, which is characterized by a larger sample size that collects only auxiliary information or general-purpose variables. This dataset must contain the same auxiliary variables as the data_model and is used for making predictions based on the trained model.
id: Column name specifying cluster ids from the largest level to the smallest level, where ~0 or ~1 represents a formula indicating the absence of clusters.
STRATA: The name of the column that specifies the strata; set to NULL if no stratification is required.#' @param test_size Proportion of data used for training (default is 0.8, meaning 80% for training and 20% for validation).
domain1: Domain variables for higher-level survey estimation. (e.g., "province")
domain2: Domain variables for more granular survey estimation at a lower administrative level. (e.g., "regency")
weight: The name of the column in data_proj that represents the survey weight, usually used for the purpose of indirect estimation .
task_type: A string that specifies the modeling objective, indicating whether the task is for classification or regression. Use "classification" for tasks where the goal is to categorize data into discrete classes, such as predicting whether an email is spam or not. Use "regression" for tasks where the goal is to predict a continuous outcome, such as forecasting sales revenue or predicting house prices.
test_size: The proportion of data used for testing, with the remaining data used for training.
nfold: The number of data partitions used for cross-validation (n-fold validation).
corrected_bias: A logical value indicating whether to apply bias correction to the estimation results from the modeling process. When set to TRUE, this parameter ensures that the estimates are adjusted to account for any systematic biases, leading to more accurate and reliable predictions.
feature_selection: Selection of predictor variables (default is TRUE)

References

Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.
Kim and Rao (2012), the synthetic data obtained through the model-assisted projection method can provide a useful tool for efficient domain estimation when the size of the sample in survey 1 is much larger than the size of sample in survey 2.

Examples

Run this code

# \donttest{
library(xgboost)
library(caret)
library(FSelector)
library(glmnet)
library(recipes)

Data_A <- df_svy_A
Data_B <- df_svy_B

hasil <- projection_xgboost(
                            target_col = "Y",
                            data_model = Data_A,
                            data_proj = Data_B,
                            id = "num",
                            STRATA = NULL,
                            domain1 = "province",
                            domain2 = "regency",
                            weight = "weight",
                            nfold = 3,
                            test_size = 0.2 ,
                            task_type = "classification",
                            corrected_bias = TRUE,
                            feature_selection = TRUE)
# }

Run the code above in your browser using DataLab