Takes the original data frame of covariates as an input (which may or
may not be numeric), and converts it into a numericized data frame by
applying either Binary or Numeric Encoding.
Binary Encoding for categorical features are recommended for
tree ensembles when the cardinality of categorical feature is >= 1000;
Numeric Encoding for categorical features are recommended for
tree ensembles when the cardinality of categorical features is < 1000.
For more information about the Binary and Numeric Encoding and their
effectiveness under different cardinality, please visit:
https://medium.com/data-design/
visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
NOTE: In order to use other functions within the forestRK package,
you must ensure that the numericized data frame of covariates (the
x.organizer object) contains no missing record,
that is, you have to remove any record containing NA or NaN
prior to applying the x.organizer function.
Following is the summary of the data cleaning process with
x.organizer():
1. remove all NA or NaN's from the data in hand.
2. split the data into a data frame that contains covariates of
ALL data points, (BOTH training and test observations), and a
vector that contains class types of the training observations;
3. apply the x.organizer to the big data frame of covariates
of all observations.
4. split the x.organizer output into a training and
a test set, as needed.
PROPER DATA CLEANING IS ABSOLUTELY NECESSARY FOR forestRK FUNCTIONS TO WORK!