data_process: data_process: process tabular data into the format for the PIE model.

Description

This function take tabular dataset and meta-data (such as numerical columns and categorical columns), then output k fold cross validation dataset with splines on numerical features in order to capture the non-linear relationship among numerical features. Within this function, numerical features and target variable are normalized and reorganize into order: (numerical features, categorical features, target).

Usage

data_process(
  X,
  y,
  num_col,
  cat_col,
  y_col,
  k = 5,
  validation_rate = 0.2,
  spline_num = 5,
  random_seed = 1
)

Value

A list containing:

spl_train_X: A list of splined training dataset where all numerical features are splined into `spline_num` columns. The number of element in list equals `k` the number of fold.
orig_train_X: A list of original training dataset where the numerical features remains the original format. The number of element in list equals `k` the number of fold.
train_y: A list of vectors representing target variable for training dataset. The number of element in list equals `k` the number of fold.
spl_validation_X: A list of splined validation dataset where all numerical features are splined into `spline_num` columns. The number of element in list equals `k` the number of fold. It could be None, when `validation_rate == 0`
orig_validation_X: A list of original validation dataset where the numerical features remains the original format. The number of element in list equals `k` the number of fold. It could be None, when `validation_rate == 0`
validation_y: A list of vectors representing target variable for validation dataset. The number of element in list equals `k` the number of fold. It could be None, when `validation_rate == 0`
spl_test_X: A list of splined testing dataset where all numerical features are splined into `spline_num` columns. The number of element in list equals `k` the number of fold.
orig_test_X: A list of original testing dataset where the numerical features remains the original format. The number of element in list equals `k` the number of fold.
test_y: A list of vectors representing target variable for testing dataset. The number of element in list equals `k` the number of fold.
lasso_group: A vector of consecutive integers describing the grouping of the coefficients

Arguments

X: Feature columns in dataset
y: Target column in dataset
num_col: Index of the columns that are numerical features
cat_col: Index of the columns that are categorical features.
y_col: Index of the column that is the response.
k: Number of fold for cross validation dataset setup. By default `k = 5`.
validation_rate: Validation ratio within training dataset. By default `validation_rate = 0.2`
spline_num: The degree of freedom for natural splines. By default `spline_num = 5`
random_seed: Random seed for cross validation data split. By default `random_seed = 1`

Details

The function generates a suitable cross-validation dataset for PIE model. It contains training dataset, validation dataset, testing dataset and also group indicator for group lasso. When `k=5`, the training testing splits in 80/20. When `validation_rate=0.2`, 20 Setting `validation_rate=0` will only generate training and testing data without validation data.

Examples

Run this code

# \donttest{
# Load the training data
data("winequality")

# Which columns are numerical?
num_col <- 1:11
# Which columns are categorical?
cat_col <- 12
# Which column is the response?
y_col <- ncol(winequality)

# Data Processing (the first 200 rows are sampled for demonstration)
dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), 
  y = winequality[1:200, y_col], 
  num_col = num_col, cat_col = cat_col, y_col = y_col)
# }

Run the code above in your browser using DataLab