Last chance! 50% off unlimited learning
Sale ends in
The data_cleansing
function is a simpler wrapper for data cleaning functions, such as delete variables that values are all NAs;checking dat and target format.;delete low variance variables.;replace null or NULL or blank with NA; encode variables which NAs & miss value rate is more than 95
data_cleansing(dat, target = NULL, x_list = NULL, obs_id = NULL,
occur_time = NULL, pos_flag = NULL, miss_values = NULL,
ex_cols = NULL, outlier_proc = TRUE, missing_proc = TRUE,
low_var = TRUE, one_hot = FALSE, parallel = FALSE, note = FALSE,
save_data = FALSE, file_name = NULL, dir_path = tempdir())
A data frame with x and target.
The name of target variable.
A list of x variables.
The name of ID of observations.Default is NULL.
The name of occur time of observations.Default is NULL.
The value of positive class of target variable, default: "1".
Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "Missing".
A list of excluded variables. Default is NULL.
Logical, process outliers or not. Default is TRUE.
Logical, process nas or not. Default is TRUE.
Logical, delete low variance variables or not. Default is TRUE.
Logical. If TRUE, one-hot_encoding of category variables. Default is FASLE.
Logical, parallel computing or not. Default is FALSE.
Logical. Outputs info. Default is TRUE.
Logical, save the result or not. Default is FALSE.
The name for periodically saved data file. Default is NULL.
The path for periodically saved data file. Default is tempdir().
A preprocessed data.frame
remove_duplicated
,
null_blank_na
,
entry_rate_na
,
low_variance_filter
,
process_nas
,
process_outliers
# NOT RUN {
#data cleaning
dat_cl <- data_cleansing(dat = UCICreditCard[1:2000,],
target = "default.payment.next.month",
x_list = NULL,
obs_id = "ID",
occur_time = "apply_date",
ex_cols = c("PAY_6|BILL_"),
outlier_proc = TRUE,
missing_proc = TRUE,
one_hot = FALSE,
low_var = TRUE,
save_data = FALSE)
# }
Run the code above in your browser using DataLab