Learn R Programming

DriveML (version 0.1.0)

autoDataprep: Automatic data preparation for ML algorithm

Description

Final data preparation before ML algorithm. Function provides final data set and highlights of the data preparation

Usage

autoDataprep(data, target = NULL, missimpute = "default",
  auto_mar = FALSE, mar_object = NULL, dummyvar = TRUE,
  char_var_limit = 12, aucv = 0.02, corr = 0.99,
  outlier_flag = FALSE, interaction_var = FALSE,
  frequent_var = FALSE, uid = NULL, onlykeep = NULL, drop = NULL,
  verbose = FALSE)

Arguments

data

[data.frame | Required] dataframe or data.table

target

[integer | Required] dependent variable (binary or multiclass)

missimpute

[text | Optional] missing value impuation using mlr misimpute function. See more methods in details

auto_mar

[character | Optional] identify any missing variable which are completely missing at random or not.(default FALSE). If TRUE this will call autoMAR()

mar_object

[character | Optional] object created from autoMAR function

dummyvar

[logical | Optional] categorical feature engineering i.e. one hot encoding (default is TRUE)

char_var_limit

[integer | Optional] default limit is 12 for a dummy variable preparation. Ex: if gender variable has two different value "M" and "F", then gender has 2 level

aucv

[integer | Optional] cut off value for AUC based variable selection

corr

[integer | Optional] cut off value for correlation based variable selection

outlier_flag

[logical | Optional] to add outlier features (default is False)

interaction_var

[logical | Optional] bulk interactions transformer for numerical features

frequent_var

[logical | Optional] Frequent transformer for categorical features

uid

[character | Optional] unique identifier column if any to keep in the final data set

onlykeep

[character | Optional] only consider selected variables for data preparation

drop

[character | Optional] exclude variable list from the data preparation

verbose

[logical | Optional] display executions steps on console. Default FALSE

Value

list output contains below objects

complete_data

Complete data set including new novel features based on the functional understanding of the dataset

master_data

filtered data set based on the input parameter

final_var_list

list of master varaibles

auc_var

list of auc variables

cor_var

list of correlation variables

overall_var

all variables in the dataset

zerovariance

zero variance variables in the dataset

Details

Missing imputation using impute function from MLR

MLR package have a appropriate way to impute missing value using multiple methods. default value is listed below #'

  • mean value for integer variable

  • median value for numeric variable

  • mode value for character or factor variable

Optional: You might be interested to impute missing variable using ML method. List of algortihms will be handle missing variables in MLR package listLearners("classif", check.packages = TRUE, properties = "missings")[c("class", "package")]

Feature engineering

  • Missing not completely at random variable using autoMAR function

  • Date transfomer like year, month, quarter, week

  • Frequent transformer counts each categorical value in the dataset

  • Interaction transformer using multiplication

  • one hot dummy coding for categorical value

  • outlier flag and capping variable for numerical value

Feature reduction

  • Zero variance using nearZeroVar caret function

  • Pearson's Correlation value

  • AUC with target variable

See Also

impute

Examples

Run this code
# NOT RUN {
#Auto data prep
traindata <- autoDataprep(heart, target = "target_var", missimpute = "default",
dummyvar = TRUE, aucv = 0.02, corr = 0.98, outlier_flag = TRUE,
interaction_var = TRUE, frequent_var = TRUE)
train <- traindata$master

# Print auto data prep object
printautoDataprep(traindata)

# }

Run the code above in your browser using DataLab