impute_batches: Impute batches and return completed data frame

Description

Impute batches and return completed data frame

Usage

impute_batches(data, features, batch, pmm_k, n_trees, seed, save)

Value

A completed, imputed data set

Arguments

data: Original data frame or tibble (with missing values)
features: Correlation-based vector of ranked features output from running flatten_mat()
batch: Numeric. Batch size.
pmm_k: Integer. Number of neighbors considered in imputation. Default at 5.
n_trees: Integer. Number of trees used in imputation. Default at 15.
seed: Integer. Seed to be set for reproducibility.
save: Should the list of individual imputed batches be saved as .rds file to working directory? Default set to FALSE.

Details

Step 1. group data by dividing the row_number() by batch size (batch, number of batches set by user) using integer division. Step 2. pass through group_split() to return a list. Step 3. impute each batch individually and time. Step 4. generate completed (unlisted/joined) imputed data frame

References

Waggoner, P. D. (2023). A batch process for high dimensional imputation. Computational Statistics, 1-22. doi: <10.1007/s00180-023-01325-9>

Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. doi: <10.1093/bioinformatics/btr597>

Examples

Run this code

if (FALSE) {
impute_batches(data = data, features = flat_mat,
batch = 2,  pmm_k = 5, n_trees = 15, seed = 123,
save = FALSE)
}

Run the code above in your browser using DataLab