- formula
an object of class "formula" (or one that can be coerced
to that class): a symbolic description of the model to be fitted.
- dataset
a data frame with dependent and independent variables as
columns and (optional) years as row names.
- k
number of folds for cross-validation
- repeats
number of cross-validation repeats. Should be equal or more
than 1
- optimize
if set to TRUE (default), the optimal values for the tuning
parameters will be selected in a preliminary cross-validation procedure
- dataset_complete
optional, a data frame with the full length of tree-ring
parameter, which will be used to reconstruct the climate variable specified
with the formula argument
- BRNN_neurons
number of neurons to be used for the brnn method
- MT_committees
an integer: how many committee models (e.g. boosting
iterations) should be used?
- MT_neighbors
how many, if any, neighbors should be used to correct the
model predictions
- MT_rules
an integer (or NA): define an explicit limit to the number of
rules used (NA let’s Cubist decide).
- MT_unbiased
a logical: should unbiased rules be used?
- MT_extrapolation
a number between 0 and 100: since Cubist uses linear models,
predictions can be outside of the outside of the range seen the training set. This
parameter controls how much rule predictions are adjusted to be consistent with the
training set.
- MT_sample
a number between 0 and 99.9: this is the percentage of the dataset
to be randomly selected for model building (not for out-of-bag type evaluation)
- RF_ntree
number of trees to grow. This should not be set to too small
a number, to ensure that every input row gets predicted at least a few times
- RF_maxnodes
maximum number of terminal nodes trees in the forest can
have
- RF_mtry
number of variables randomly sampled as candidates at each
split
- RF_nodesize
minimum size of terminal nodes. Setting this number larger
causes smaller trees to be grown (and thus take less time).
- seed_factor
an integer that will be used to change the seed options
for different repeats.
- digits
integer of number of digits to be displayed in the final
result tables
- blocked_CV
default is FALSE, if changed to TRUE, blocked cross-validation
will be used to compare regression methods.
- PCA_transformation
if set to TRUE, all independent variables will be
transformed using PCA transformation.
- log_preprocess
if set to TRUE, variables will be transformed with
logarithmic transformation before used in PCA
- components_selection
character string specifying how to select the Principal
Components used as predictors.
There are three options: "automatic", "manual" and "plot_selection". If
parameter is set to automatic, all scores with eigenvalues above 1 will be
selected. This threshold could be changed by changing the
eigenvalues_threshold argument. If parameter is set to "manual", user should
set the number of components with N_components argument. If component
selection is se to "plot_selection", Scree plot will be shown and user must
manually enter the number of components used as predictors.
- eigenvalues_threshold
threshold for automatic selection of Principal Components
- N_components
number of Principal Components used as predictors
- round_bias_cal
number of digits for bias in calibration period. Effects
the outlook of the final ggplot of mean bias for calibration data (element 3 of
the output list)
- round_bias_val
number of digits for bias in validation period. Effects
the outlook of the final ggplot of mean bias for validation data (element 4 of
the output list)
- n_bins
number of bins used for the histograms of mean bias
- edge_share
the share of the data to be considered as the edge (extreme) data.
This argument could be between 0.10 and 0.50. If the argument is set to 0.10, then
the 5
considered to be the edge data.
- MLR_stepwise
if set to TRUE, stepwise selection of predictors will be used
for the MLR method
- stepwise_direction
the mode of stepwise search, can be one of "both",
"backward", or "forward", with a default of "backward".
- methods
a vector of strings related to methods that will be compared. A full
method vector is methods = c("MLR", "BRNN", "MT", "RF").
To use only a subset of methods, pass a vector of methods that you would like to compare.
- tuning_metric
a string that specifies what summary metric will be used to select
the optimal value of tuning parameters. By default, the argument is set to "RMSE". It is
also possible to use "RSquared".
- BRNN_neurons_vector
a vector of possible values for BRNN_neurons argument optimization
- MT_committees_vector
a vector of possible values for MT_committees argument optimization
- MT_neighbors_vector
a vector of possible values for MT_neighbors argument optimization
- MT_rules_vector
a vector of possible values for MT_rules argument optimization
- MT_unbiased_vector
a vector of possible values for MT_unbiased argument optimization
- MT_extrapolation_vector
a vector of possible values for MT_extrapolation argument optimization
- MT_sample_vector
a vector of possible values for MT_sample argument optimization
- RF_ntree_vector
a vector of possible values for RF_ntree argument optimization
- RF_maxnodes_vector
a vector of possible values for RF_maxnodes argument optimization
- RF_mtry_vector
a vector of possible values for RF_mtry argument optimization
- RF_nodesize_vector
a vector of possible values for RF_nodesize argument optimization
- holdout
this argument is used to define observations, which are excluded
from the cross-validation and hyperparameters optimization. The holdout argument must be
a character with one of the following inputs: “early”, “late” or “manual”. If
"early" or "late" characters are specified, then the early or late years will be
used as a holdout data. How many of the "early" or "late" years are used as a holdout
is specified with the argument holdout_share. If the argument holdout is set to “manual”,
then supply a vector of years (or row names) to the argument holdout_manual. Defined
years will be used as a holdout. For the holdout data, the same statistical measures are
calculated as for the cross-validation. The results for holdout metrics are given in the
output element $holdout_results.
- holdout_share
the share of the whole dataset to be used as a holdout.
Default is 0.10.
- holdout_manual
a vector of years (or row names) which will be used as a holdout.
calculated as for the cross-validation.
- total_reproducibility
logical, default is FALSE. This argument ensures total
reproducibility despite the inclusion/exclusion of different methods. By default, the
optimization is done only for the methods, that are included in the methods vector. If
one method is absent or added, the optimization phase is different, and this affects
all the final cross-validation results. By setting the total_reproducibility = TRUE,
all methods will be optimized, even though they are not included in the methods vector
and the final results will be subset based on the methods vector. Setting the
total_reproducibility to TRUE will result in longer optimization phase as well.