compare_methods: compare_methods

Description

Calculates performance metrics for train and test data of different regression methods: multiple linear regression (MLR), artificial neural networks with Bayesian regularization training algorithm (BRNN), M5P model trees (MT), model trees with bagging (BMT) and random forest of regression trees (RF). With the subset argument, specific methods of interest could be specified. Calculated performance metrics are the correlation coefficient (r), the root mean squared error (RMSE), the root relative squared error (RRSE), the index of agreement (d), the reduction of error (RE), the coefficient of efficiency (CE), the detrended efficiency (DE) and mean bias.

Usage

compare_methods(formula, dataset, k = 10, repeats = 2, optimize = TRUE,
  dataset_complete = NULL, BRNN_neurons = 1, MT_M = 4, MT_N = F,
  MT_U = F, MT_R = F, BMT_P = 100, BMT_I = 100, BMT_M = 4,
  BMT_N = F, BMT_U = F, BMT_R = F, RF_P = 100, RF_I = 100,
  RF_depth = 0, seed_factor = 5, digits = 3, blocked_CV = FALSE,
  PCA_transformation = FALSE, log_preprocess = TRUE,
  components_selection = "automatic", eigenvalues_threshold = 1,
  N_components = 2, round_bias_cal = 15, round_bias_val = 4,
  n_bins = 30, edge_share = 0.1, MLR_stepwise = FALSE,
  stepwise_direction = "backward", methods = c("MLR", "BRNN", "MT", "BMT",
  "RF"), tuning_metric = "RMSE", BRNN_neurons_vector = c(1, 2, 3),
  MT_M_vector = c(4, 8, 16, 25), MT_N_vector = c(TRUE, FALSE),
  MT_U_vector = c(TRUE, FALSE), MT_R_vector = c(FALSE),
  BMT_P_vector = c(100), BMT_I_vector = c(100), BMT_M_vector = c(4, 8, 16,
  25), BMT_N_vector = c(TRUE, FALSE), BMT_U_vector = c(TRUE, FALSE),
  BMT_R_vector = c(FALSE), RF_P_vector = c(100), RF_I_vector = c(100),
  RF_depth_vector = c(0, 2), holdout = NULL, holdout_share = 0.1,
  holdout_manual = NULL, total_reproducibility = FALSE)

Arguments

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted.

dataset

a data frame with dependent and independent variables as columns and (optional) years as row names.

number of folds for cross-validation

repeats

number of cross-validation repeats. Should be equal or more than 2.

optimize

if set to TRUE, the package caret will be used to tune parameters for regression methods

dataset_complete

optional, a data frame with the full length of tree-ring parameter, which will be used to reconstruct the climate variable specified with the formula argument.

BRNN_neurons

number of neurons to be used for the brnn method

MT_M

minimum number of instances used by model trees

MT_N

unpruned (argument for model trees)

MT_U

unsmoothed (argument for model trees)

MT_R

use regression trees (argument for model trees)

BMT_P

bagSizePercent (argument for bagging of model trees)

BMT_I

number of iterations (argument for bagging of model trees)

BMT_M

minimum number of instances used by model trees

BMT_N

unpruned (argument for bagging of model trees)

BMT_U

unsmoothed (argument for bagging of model trees)

BMT_R

use regression trees (argument for bagging of model trees)

RF_P

bagSizePercent (argument for random forest)

RF_I

number of iterations (argument for random forest)

RF_depth

maxDepth (argument for random forest)

seed_factor

an integer that will be used to change the seed options for different repeats.

digits

integer of number of digits to be displayed in the final result tables

blocked_CV

default is FALSE, if changed to TRUE, blocked cross-validation will be used to compare regression methods.

PCA_transformation

if set to TRUE, all independent variables will be transformed using PCA transformation.

log_preprocess

if set to TRUE, variables will be transformed with logarithmic transformation before used in PCA

components_selection

character string specifying how to select the Principal Components used as predictors. There are three options: "automatic", "manual" and "plot_selection". If parameter is set to automatic, all scores with eigenvalues above 1 will be selected. This threshold could be changed by changing the eigenvalues_threshold argument. If parameter is set to "manual", user should set the number of components with N_components argument. If component selection is se to "plot_selection", Scree plot will be shown and user must manually enter the number of components used as predictors.

eigenvalues_threshold

threshold for automatic selection of Principal Components

N_components

number of Principal Components used as predictors

round_bias_cal

number of digits for bias in calibration period. Effects the outlook of the final ggplot of mean bias for calibration data (element 3 of the output list)

round_bias_val

number of digits for bias in validation period. Effects the outlook of the final ggplot of mean bias for validation data (element 4 of the output list)

n_bins

number of bins used for the histograms of mean bias

edge_share

the share of the data to be considered as the edge (extreme) data. This argument could be between 0.10 and 0.50. If the argument is set to 0.10, then the 5 considered to be the edge data.

MLR_stepwise

if set to TRUE, stepwise selection of predictors will be used for the MLR method

stepwise_direction

the mode of stepwise search, can be one of "both", "backward", or "forward", with a default of "backward".

methods

a vector of strings related to methods that will be compared. A full method vector is methods = c("MLR", "BRNN", "MT", "BMT", "RF"). To use only a subset of methods, pass a vector of methods that you would like to compare.

tuning_metric

a string that specifies what summary metric will be used to select the optimal value of tuning parameters. By default, the argument is set to "RMSE". It is also possible to use "RSquared".

BRNN_neurons_vector

a vector of possible values for BRNN_neurons argument optimization

MT_M_vector

a vector of possible values for MT_M argument optimization

MT_N_vector

a vector of possible values for MT_N argument optimization

MT_U_vector

a vector of possible values for MT_U argument optimization

MT_R_vector

a vector of possible values for MT_R argument optimization

BMT_P_vector

a vector of possible values for BMT_P argument optimization

BMT_I_vector

a vector of possible values for BMT_I argument optimization

BMT_M_vector

a vector of possible values for BMT_M argument optimization

BMT_N_vector

a vector of possible values for BMT_N argument optimization

BMT_U_vector

a vector of possible values for BMT_U argument optimization

BMT_R_vector

a vector of possible values for BMT_R argument optimization

RF_P_vector

a vector of possible values for RF_P argument optimization

RF_I_vector

a vector of possible values for RF_I argument optimization

RF_depth_vector

a vector of possible values for RF_depth argument optimization

holdout

this argument is used to define observations, which are excluded from the cross-validation and hyperparameters optimization. The holdout argument must be a character with one of the following inputs: <U+201C>early<U+201D>, <U+201C>late<U+201D> or <U+201C>manual<U+201D>. If "early" or "late" characters are specified, then the early or late years will be used as a holdout data. How many of the "early" or "late" years are used as a holdout is specified with the argument holdout_share. If the argument holdout is set to <U+201C>manual<U+201D>, then supply a vector of years (or row names) to the argument holdout_manual. Defined years will be used as a holdout. For the holdout data, the same statistical measures are calculated as for the cross-validation. The results for holdout metrics are given in the output element $holdout_results.

holdout_share

the share of the whole dataset to be used as a holdout. Default is 0.10.

holdout_manual

a vector of years (or row names) which will be used as a holdout. calculated as for the cross-validation.

total_reproducibility

logical, default is FALSE. This argument ensures total reproducibility despite the inclusion/exclusion of different methods. By default, the optimization is done only for the methods, that are included in the methods vector. If one method is absent or added, the optimization phase is different, and this affects all the final cross-validation results. By setting the total_reproducibility = TRUE, all methods will be optimized, even though they are not included in the methods vector and the final results will be subset based on the methods vector. Setting the total_reproducibility to TRUE will result in longer optimization phase as well.

Value

a list with twelve elements:

1	$mean_std	data frame with calculated metrics for the selected regression methods. For each regression method and each calculated metric, mean and standard deviation are given
2	$ranks	data frame with ranks of calculated metrics: mean rank and share of rank_1 are given
3	$edge_results	data frame with calculated performance metrics for the central-edge test. The central part of the data represents the calibration data, while the edge data, i.e. extreme values, represent the test/validation data. Different regression models are calibrated using the central data and validated for the edge (extreme) data. This test is particularly important to assess the performance of models for the predictions of the extreme data. The share of the edge (extreme) data is defined with the edge_share argument
4	$holdout_results	calculated metrics for the holdout data
5	$bias_cal	ggplot object of mean bias for calibration data
6	$bias_val	ggplot object of mean bias for validation data
7	$transfer_functions	ggplot or plotly object with transfer functions of methods
8	$transfer_functions_together	ggplot or plotly object with transfer functions of methods plotted together
9	$parameter_values	a data frame with specifications of parameters used for different regression methods
10	$PCA_output	princomp object: the result output of the PCA analysis
11	$reconstructions	ggplot object: reconstructed dependent variable based on the dataset_complete argument, facet is used to split plots by methods

References

Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Inc. 482 pp.

Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123-140.

Breiman, L., 2001. Random forests. Machine Learning 45, 5-32.

Burden, F., Winkler, D., 2008. Bayesian Regularization of Neural Networks, in: Livingstone, D.J. (ed.), Artificial Neural Networks: Methods and Applications, vol. 458. Humana Press, Totowa, NJ, pp. 23-42.

Hastie, T., Tibshirani, R., Friedman, J.H., 2009. The Elements of Statistical Learning : Data Mining, Inference, and Prediction, 2nd ed. Springer, New York xxii, 745 p. pp.

Ho, T.K., 1995. Random decision forests, Proceedings of the Third International Conference on Document Analysis and Recognition Volume 1. IEEE Computer Society, pp. 278-282.

Hornik, K., Buchta, C., Zeileis, A., 2009. Open-source machine learning: R meets Weka. Comput. Stat. 24, 225-232.

Perez-Rodriguez, P., Gianola, D., 2016. Brnn: Brnn (Bayesian Regularization for Feed-forward Neural Networks). R package version 0.6.

Quinlan, J.R., 1992. Learning with Continuous Classes, Proceedings of the 5th Australian Joint Conference on Artificial Intelligence (AI '92). World Scientific, Hobart, pp. 343-348.

Examples

Run this code

# NOT RUN {
# An example with default settings of machine learning algorithms
experiment_1 <- compare_methods(formula = MVA~.,
dataset = example_dataset_1, k = 10, repeats = 10, blocked_CV = TRUE,
PCA_transformation = FALSE, components_selection = "automatic",
optimize = TRUE, methods = c("MLR", "BRNN"), tuning_metric = "RSquared")
experiment_1$mean_std
experiment_1$ranks
experiment_1$bias_cal
experiment_1$bias_val
experiment_1$transfer_functions
experiment_1$transfer_functions_together
experiment_1$PCA_output
experiment_1$parameter_values
experiment_1$transfer_functions

experiment_2 <- compare_methods(formula = MVA ~  T_APR,
dataset = example_dataset_1, k = 5, repeats = 10, BRNN_neurons = 1,
MT_M = 4, MT_N = FALSE, MT_U = FALSE, MT_R = FALSE, BMT_P = 100,
BMT_I = 100, BMT_M = 4, BMT_N = FALSE, BMT_U = FALSE, BMT_R = FALSE,
RF_P = 100, RF_I = 100, RF_depth = 0, seed_factor = 5)
experiment_2$mean_std
experiment_2$ranks
experiment_2$bias_cal
experiment_2$transfer_functions
experiment_2$transfer_functions_together
experiment_2$PCA_output

experiment_3 <- compare_methods(formula = MVA ~ .,
dataset = example_dataset_1, k = 2, repeats = 5,
methods = c("MLR", "BRNN", "MT", "BMT"),
optimize = TRUE, MLR_stepwise = TRUE)
experiment_3$mean_std
experiment_3$ranks
experiment_3$bias_val
experiment_3$transfer_functions
experiment_3$transfer_functions_together
experiment_3$parameter_values

library(dendroTools)
library(ggplot2)
data(dataset_TRW)
comparison_TRW <- compare_methods(formula = T_Jun_Jul ~ TRW, dataset = dataset_TRW,
k = 3, repeats = 10, optimize = TRUE, methods = c("MLR", "MT", "BMT", "BRNN"),
seed_factor = 5, dataset_complete = dataset_TRW_complete, MLR_stepwise = TRUE,
stepwise_direction = "backward")
comparison_TRW$mean_std
comparison_TRW$bias_val
comparison_TRW$transfer_functions + xlab(expression(paste('TRW'))) +
ylab("June-July Mean Temperature [<U+00C2><U+00B0>C]")
comparison_TRW$reconstructions
comparison_TRW$reconstructions_together
# }

Run the code above in your browser using DataLab