Final data preparation before ML algorithm. Function provides final data set and highlights of the data preparation
autoDataprep(data, target = NULL, missimpute = "default",
auto_mar = FALSE, mar_object = NULL, dummyvar = TRUE,
char_var_limit = 12, aucv = 0.02, corr = 0.99,
outlier_flag = FALSE, interaction_var = FALSE,
frequent_var = FALSE, uid = NULL, onlykeep = NULL, drop = NULL,
verbose = FALSE)
[data.frame | Required] dataframe or data.table
[integer | Required] dependent variable (binary or multiclass)
[text | Optional] missing value impuation using mlr misimpute function. See more methods in details
[character | Optional] identify any missing variable which are completely missing at random or not.(default FALSE). If TRUE this will call autoMAR()
[character | Optional] object created from autoMAR function
[logical | Optional] categorical feature engineering i.e. one hot encoding (default is TRUE)
[integer | Optional] default limit is 12 for a dummy variable preparation. Ex: if gender variable has two different value "M" and "F", then gender has 2 level
[integer | Optional] cut off value for AUC based variable selection
[integer | Optional] cut off value for correlation based variable selection
[logical | Optional] to add outlier features (default is False)
[logical | Optional] bulk interactions transformer for numerical features
[logical | Optional] Frequent transformer for categorical features
[character | Optional] unique identifier column if any to keep in the final data set
[character | Optional] only consider selected variables for data preparation
[character | Optional] exclude variable list from the data preparation
[logical | Optional] display executions steps on console. Default FALSE
list output contains below objects
complete_data
Complete data set including new novel features based on the functional understanding of the dataset
master_data
filtered data set based on the input parameter
final_var_list
list of master varaibles
auc_var
list of auc variables
cor_var
list of correlation variables
overall_var
all variables in the dataset
zerovariance
zero variance variables in the dataset
Missing imputation using impute function from MLR
MLR package have a appropriate way to impute missing value using multiple methods. default value is listed below #'
mean value for integer variable
median value for numeric variable
mode value for character or factor variable
Optional: You might be interested to impute missing variable using ML method. List of algortihms will be handle missing variables in MLR package listLearners("classif", check.packages = TRUE, properties = "missings")[c("class", "package")]
Feature engineering
Missing not completely at random variable using autoMAR function
Date transfomer like year, month, quarter, week
Frequent transformer counts each categorical value in the dataset
Interaction transformer using multiplication
one hot dummy coding for categorical value
outlier flag and capping variable for numerical value
Feature reduction
Zero variance using nearZeroVar caret function
Pearson's Correlation value
AUC with target variable
# NOT RUN {
#Auto data prep
traindata <- autoDataprep(heart, target = "target_var", missimpute = "default",
dummyvar = TRUE, aucv = 0.02, corr = 0.98, outlier_flag = TRUE,
interaction_var = TRUE, frequent_var = TRUE)
train <- traindata$master
# Print auto data prep object
printautoDataprep(traindata)
# }
Run the code above in your browser using DataLab