pre_process: Data preprocessing

Description

These functions are run in evaluate just prior to model fitting, to extract fitting and test sets from the entire dataset and apply transformations to pre-process the data (for handling missing values, scaling, compression etc.). They can also be used to adapt the form of the data to a specific fitting function, e.g. pre_pamr that transposes the dataset to make it compatible with the pamr classification method.

Usage

pre_split(x, y, fold)
pre_convert(data, x_fun, y_fun, ...)
pre_transpose(data)
pre_remove(data, feature)
pre_center(data, y = FALSE, na.rm = TRUE)
pre_scale(data, y = FALSE, na.rm = TRUE, center = TRUE)
pre_remove_constant(data, na.rm = TRUE)
pre_remove_correlated(data, cutoff)
pre_pca(data, ncomponent, scale. = TRUE, ...)

Arguments

Value

A list with the following components [object Object],[object Object],[object Object],[object Object]

Details

When supplied to evaluate, pre-processing functions can be chained (i.e. executed sequentially) after an initating call to pre_split. This can either be done using the pipe operator defined in the magrittr package or by putting all pre-processing functions in a regular list (see the examples).

Note that all transformations are defined based on the fitting data only and then applied to both fitting set and test set. It is important to not let the test data in any way be part of the model fitting, including the preprocessing, to not risk information leakage and biased results!

The imputation functions can also be used outside of evaluate by not supplying a fold to pre_split. See the code of impute_median for an example.

Examples

Run this code

# Setup an example to work on
x <- as.matrix(iris[-5])
x[sample(600, 6)] <- NA
y <- iris$Species
cv <- resample("crossvalidation", y, nrepeat=3, nfold=4)
procedure <- modeling_procedure("lda")

# Simple dataset splitting
sets <- pre_split(x, y, cv[[1]])

# Chaining using the pipe operator
sets <- pre_split(x, y, cv[[1]]) %>%
    pre_impute_median %>%
    pre_scale

# Integration with `evaluate`
result <- evaluate(procedure, x, y, resample=cv,
    pre_process = function(...){
        pre_split(...) %>%
        pre_impute_median %>%
        pre_scale
    }
)

# or analogously with a list
result <- evaluate(procedure, x, y, resample=cv,
    pre_process = list(pre_split, pre_impute_median, pre_scale))

# Imputing without splitting
x.imputed <- impute_knn(x)

# Using a whole chain without splitting
x.processed <- pre_split(x, y=NULL) %>%
    pre_impute_median %>%
    pre_scale %>%
    (function(data) data$fit$x)

Run the code above in your browser using DataLab