splits_selection: Split dataset and select variables

Description

Split dataset into training data and testing data and select variables based on relative importance.

Usage

splits_selection(data,split_ratio,split_seed,
feature_model,imbalance,nfolds,
RAN_type,RAN.seed,smote.seed,
xcol_enter,distribution)

Arguments

data

A data.frame used to build models

split_ratio

A numeric value indicating the ratio of total rows contained in each split. Must less than 1

split_seed

Random seed for splitting

feature_model

Name of model for feature selection. Currently, only allow "gbm" for gradient boosted tree, and "rf" for random forest

imbalance

Logical or "SMOTE"(for categorical response). True for balancing training data class counts via over/under-sampling when building the model. "SMOTE" for applying SMOTE and returning SMOTE training data.

nfolds

Number of folds for K-fold cross-validation. Default:5.

RAN_type

"both", "binominal" or "normal". "both" for generating both binominal and normal random terms for feature selection. "binominal" or "normal" only generate one specific type of random term. Categorical or continuous variables with relative importance greater than corresponding random term(s) will be selected.

RAN.seed

Random seed for random term(s)

smote.seed

Random seed for SMOTE. Only used if argument "imbalance"="SMOTE"

xcol_enter

A character vector of variables are required to enter the model, also called "forced entry". If xcol_enter contains all independent variables' names, it will not use random terms to select variables.

distribution

Distribution type. Must be one of: "AUTO", "bernoulli", "quasibinomial", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom". Defaults to AUTO.

Value

importance

A data.frame containing the relative importance scores of selected variables.

train_data

Training dataset. If "imbalance"="SMOTE", it returns the SMOTE training set.

test_data

Testing dataset.

raw_traindata

Same training dataset. If "imbalance"="SMOTE", it returns the original training set before SMOTE.

Details

This function applys a technique to use random term to select variables. We consider variables with relative importance greater than random term as truly important variables.

Examples

Run this code

# NOT RUN {
library(survival)
library(h2o)
library(performanceEstimation)
data("lung")
attach(lung)
data <- datatrans(lung,factor_dummy = 'dummy',rescale = TRUE)
data <- data[,c(3,1,2,4:14)]
h2o.init()
selection <- splits_selection(data,imbalance = 'SMOTE')
h2o.shutdown(prompt=FALSE)
Sys.sleep(2)
# }

Run the code above in your browser using DataLab