HHDecisionTreeCore: HHDecisionTreeCore Common function for all hhcartr model functions.

Description

This function internal function provides a common interface for all hhcartr model function. At the time of writing these are HHDecisionTreeClassifier and HHDecisionTreeRegressor. The following parameters are supported (they are not necessarily all common to the classifier and regressor models - look at documentation for each model).

Usage

HHDecisionTreeCore(
  response = "classify",
  n_min = 2,
  min_node_impurity = 0.2,
  n_trees = 1,
  n_folds = 5,
  sample_size = 1,
  testSize = 0.2,
  sampleWithReplacement = FALSE,
  useIdentity = FALSE,
  dataDescription = "Unknown",
  max_features = "None",
  pruning = FALSE,
  parallelize = FALSE,
  number_cpus = 1,
  show_progress = FALSE,
  seed = seed,
  control = control,
  prune_control = prune_control,
  debug_msgs = FALSE
)

Arguments

response

The response parameter is used to specify what type of model to build, either 'classify' for a classification tree model or 'regressor' for a regression tree model. The default is 'classify'.

n_min

The n min parameter is used to stop splitting a node when a minimum number of samples at that node has been reached. The default value is 2.

min_node_impurity

The min node impurity parameter is used to stop splitting a node if the node impurity at that node is less than this value. The node impurity is calculated using the hyperplane Gini index. The default value is 0.2.

n_trees

The n trees parameter is used to specify the number of trees to use(grow) per fold or trial. The default value is 1.

n_folds

The n folds parameter is used to specify the number of folds to use i.e. split the input data into n folds equal amounts, for n folds times, use one portion of the input data as a test dataset, and the remaining n folds-1 portions as the training dataset. The model is then trained using these training and test datasets, once training complete the next fold or portion of the input dataset is treated as the test dataset and the remainder the training dataset, the model is then trained again. This process is repeated until all portions or folds of the input dataset have been used as a test dataset. The default value is 5.

sample_size

The sample size parameter is used to determine how much of the training dataset is actually used during training. A value of 1.0 allows all of the current training dataset to be used for training. A value of less than one will mean that proportion of the training dataset will be selected at random and then used for training. The value of parameter sampleWithReplacement will determine if the random sampling of the training dataset is performed using replacement or not. The default value is 1.0.

testSize

The testSize parameter determines how much of the input dataset is to be used as the test dataset. The remainder is used as the training dataset. This parameter is only used when the parameter n_folds=1. For values of n_folds greater than one, the computed fold size will govern the test dataset size used (see the n_folds parameter for more details). The default value is 0.2.

sampleWithReplacement

The sampleWithReplacement parameter is used in conjunction with the sample size parameter. The sampleWithReplacement parameter will determine if sampling from the training dataset is done with or without replacement. The default value is FALSE.

useIdentity

The useIdentity parameter when set TRUE will result in hhcartr using the original training data to find the optimal splits rather than using the reflected data. The default value is FALSE.

dataDescription

The dataDescription parameter is a short description used to describe the dataset being modelled. It is used is output displays and plots as documentation. The default value is <U+201C>Unknown<U+201D>.

max_features

The max features parameter determines the number of features to consider when looking for the best split, and can take one of the values listed below. The default value is <U+201C>sqrt<U+201D>.

pruning

The pruning parameter when set TRUE specifies that the induced tree is to be pruned after induction. The default value is FALSE.

parallelize

The parallelize parameter when set TRUE will allow selected loops to be run in parallel. (This functionality has yet to be fully tested). The default value is FALSE.

number_cpus

The number of available CPU<U+2019>s to use when parameter parallelize is set to TRUE. The maximum number of CPU<U+2019>s to be used will be the number of physical CPU<U+2019>s available (as returned via the detectCores() function of the parallel package) minus one. The default value is 1.

show_progress

The show_progress parameter when set TRUE will cause progress messages to be displayed as trees are induced. A value of FALSE will result in no progress messages being displayed. The default value is FALSE.

seed

Specify a seed to seed the RNG. Acceptable values are 1-9999. If no value is specified a random integer in the range 1-9999 is used.

control

Default value mni.control(n_folds = 5). The control parameter is used to specify parameters for the mni.control function. See documentation for mni.control for supported parameters.

prune_control

Default value prune.control(prune_type = "all", prune_fatbears_max_nodes = 10, prune_fatbears_iterations = 1000) The prune_control parameter is used to specify parameters for the prune.control function. This parameter is only used when 'pruning = TRUE'. See documentation for prune.control for supported parameters.

debug_msgs

Not fully implemented yet but will turn on debug messages.

classify

The classify parameter when set TRUE indicates that the data is for building a classification model. A value of FALSE and a regression model will be induced.

Value

a list of the trees induced during training, these are saved in global enviornment variable pkg.env$folds_trees.