Learn R Programming

PIE (version 1.0.0)

data_process: data_process: process tabular data into the format for the PIE model.

Description

This function take tabular dataset and meta-data (such as numerical columns and categorical columns), then output k fold cross validation dataset with splines on numerical features in order to capture the non-linear relationship among numerical features. Within this function, numerical features and target variable are normalized and reorganize into order: (numerical features, categorical features, target).

Usage

data_process(
  X,
  y,
  num_col,
  cat_col,
  y_col,
  k = 5,
  validation_rate = 0.2,
  spline_num = 5,
  random_seed = 1
)

Value

A list containing:

spl_train_X

A list of splined training dataset where all numerical features are splined into `spline_num` columns. The number of element in list equals `k` the number of fold.

orig_train_X

A list of original training dataset where the numerical features remains the original format. The number of element in list equals `k` the number of fold.

train_y

A list of vectors representing target variable for training dataset. The number of element in list equals `k` the number of fold.

spl_validation_X

A list of splined validation dataset where all numerical features are splined into `spline_num` columns. The number of element in list equals `k` the number of fold. It could be None, when `validation_rate == 0`

orig_validation_X

A list of original validation dataset where the numerical features remains the original format. The number of element in list equals `k` the number of fold. It could be None, when `validation_rate == 0`

validation_y

A list of vectors representing target variable for validation dataset. The number of element in list equals `k` the number of fold. It could be None, when `validation_rate == 0`

spl_test_X

A list of splined testing dataset where all numerical features are splined into `spline_num` columns. The number of element in list equals `k` the number of fold.

orig_test_X

A list of original testing dataset where the numerical features remains the original format. The number of element in list equals `k` the number of fold.

test_y

A list of vectors representing target variable for testing dataset. The number of element in list equals `k` the number of fold.

lasso_group

A vector of consecutive integers describing the grouping of the coefficients

Arguments

X

Feature columns in dataset

y

Target column in dataset

num_col

Index of the columns that are numerical features

cat_col

Index of the columns that are categorical features.

y_col

Index of the column that is the response.

k

Number of fold for cross validation dataset setup. By default `k = 5`.

validation_rate

Validation ratio within training dataset. By default `validation_rate = 0.2`

spline_num

The degree of freedom for natural splines. By default `spline_num = 5`

random_seed

Random seed for cross validation data split. By default `random_seed = 1`

Details

The function generates a suitable cross-validation dataset for PIE model. It contains training dataset, validation dataset, testing dataset and also group indicator for group lasso. When `k=5`, the training testing splits in 80/20. When `validation_rate=0.2`, 20 Setting `validation_rate=0` will only generate training and testing data without validation data.

Examples

Run this code
# \donttest{
# Load the training data
data("winequality")

# Which columns are numerical?
num_col <- 1:11
# Which columns are categorical?
cat_col <- 12
# Which column is the response?
y_col <- ncol(winequality)

# Data Processing (the first 200 rows are sampled for demonstration)
dat <- data_process(X = as.matrix(winequality[1:200, -y_col]), 
  y = winequality[1:200, y_col], 
  num_col = num_col, cat_col = cat_col, y_col = y_col)
# }

Run the code above in your browser using DataLab