Estimate the prediction error of a model via (repeated) \(K\)-fold cross-validation, (repeated) random splitting (also known as random subsampling or Monte Carlo cross-validation), or the bootstrap. It is thereby possible to supply an object returned by a model fitting function, a model fitting function itself, or an unevaluated function call to a model fitting function.
perryFit(object, ...) # S3 method for default
perryFit (object, data = NULL,
x = NULL, y, splits = foldControl(),
predictFun = predict, predictArgs = list(),
cost = rmspe, costArgs = list(), names = NULL,
envir = parent.frame(), ncores = 1, cl = NULL,
seed = NULL, ...)
# S3 method for function
perryFit (object, formula,
data = NULL, x = NULL, y, args = list(),
splits = foldControl(), predictFun = predict,
predictArgs = list(), cost = rmspe, costArgs = list(),
names = NULL, envir = parent.frame(), ncores = 1,
cl = NULL, seed = NULL, ...)
# S3 method for call
perryFit (object, data = NULL, x = NULL,
y, splits = foldControl(), predictFun = predict,
predictArgs = list(), cost = rmspe, costArgs = list(),
names = NULL, envir = parent.frame(), ncores = 1,
cl = NULL, seed = NULL, ...)
the fitted model for which to estimate the
prediction error, a function for fitting a model, or an
unevaluated function call for fitting a model (see
call
for the latter). In the case of a
fitted model, the object is required to contain a
component call
that stores the function call used
to fit the model, which is typically the case for objects
returned by model fitting functions.
a formula
describing
the model.
a data frame containing the variables
required for fitting the models. This is typically used
if the model in the function call is described by a
formula
.
a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments.
a numeric vector or matrix containing the response.
a list of additional arguments to be passed to the model fitting function.
an object of class "cvFolds"
(as
returned by cvFolds
) or a control object of
class "foldControl"
(see
foldControl
) defining the folds of the data
for (repeated) \(K\)-fold cross-validation, an object
of class "randomSplits"
(as returned by
randomSplits
) or a control object of class
"splitControl"
(see splitControl
)
defining random data splits, or an object of class
"bootSamples"
(as returned by
bootSamples
) or a control object of class
"bootControl"
(see bootControl
)
defining bootstrap samples.
a function to compute predictions for
the test data. It should expect the fitted model to be
passed as the first argument and the test data as the
second argument, and must return either a vector or a
matrix containing the predicted values. The default is
to use the predict
method of the
fitted model.
a list of additional arguments to be
passed to predictFun
.
a cost function measuring prediction loss.
It should expect the observed values of the response to
be passed as the first argument and the predicted values
as the second argument, and must return either a
non-negative scalar value, or a list with the first
component containing the prediction error and the second
component containing the standard error. The default is
to use the root mean squared prediction error (see
cost
).
a list of additional arguments to be
passed to the prediction loss function cost
.
an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”).
the environment
in which to
evaluate the function call for fitting the models (see
eval
).
a positive integer giving the number of
processor cores to be used for parallel computing (the
default is 1 for no parallelization). If this is set to
NA
, all available processor cores are used.
a parallel cluster for parallel computing
as generated by makeCluster
. If
supplied, this is preferred over ncores
.
optional initial seed for the random number
generator (see .Random.seed
). Note that
also in case of parallel computing, resampling is
performed on the manager process rather than the worker
processes. On the parallel worker processes, random
number streams are used and the seed is set via
clusterSetRNGStream
for reproducibility in
case the model fitting function involves randomness.
additional arguments to be passed down.
An object of class "perry"
with the following
components:
a numeric vector containing the respective estimated prediction errors. In case of more than one replication, those are average values over all replications.
a numeric vector containing the respective estimated standard errors of the prediction loss.
a numeric matrix in which each column contains the respective estimated prediction errors from all replications. This is only returned in case of more than one replication.
an object giving the data splits used to estimate the prediction error.
the response.
a list containing the predicted values from all replications.
the matched function call.
(Repeated) \(K\)-fold cross-validation is performed in
the following way. The data are first split into \(K\)
previously obtained blocks of approximately equal size
(given by folds
). Each of the \(K\) data blocks
is left out once to fit the model, and predictions are
computed for the observations in the left-out block with
predictFun
. Thus a prediction is obtained for
each observation. The response variable and the obtained
predictions for all observations are then passed to the
prediction loss function cost
to estimate the
prediction error. For repeated \(K\)-fold
cross-validation (as indicated by splits
), this
process is replicated and the estimated prediction errors
from all replications are returned.
(Repeated) random splitting is performed similarly. In
each replication, the data are split into a training set
and a test set at random. Then the training data is used
to fit the model, and predictions are computed for the
test data. Hence only the response values from the test
data and the corresponding predictions are passed to the
prediction loss function cost
.
For the bootstrap estimator, each bootstrap sample is
used as training data to fit the model. The out-of-bag
estimator uses the observations that do not enter the
bootstrap sample as test data and computes the prediction
loss function cost
for those out-of-bag
observations. The 0.632 estimator is computed as a
linear combination of the out-of-bag estimator and the
prediction loss of the fitted values of the model
computed from the full sample.
In any case, if the response is a vector but
predictFun
returns a matrix, the prediction error
is computed for each column. A typical use case for this
behavior would be if predictFun
returns
predictions from an initial model fit and stepwise
improvements thereof.
If formula
or data
are supplied, all
variables required for fitting the models are added as
one argument to the function call, which is the typical
behavior of model fitting functions with a
formula
interface. In this case,
the accepted values for names
depend on the
method. For the function
method, a character
vector of length two should supplied, with the first
element specifying the argument name for the formula and
the second element specifying the argument name for the
data (the default is to use c("formula", "data")
).
Note that names for both arguments should be supplied
even if only one is actually used. For the other
methods, which do not have a formula
argument, a
character string specifying the argument name for the
data should be supplied (the default is to use
"data"
).
If x
is supplied, on the other hand, the predictor
matrix and the response are added as separate arguments
to the function call. In this case, names
should
be a character vector of length two, with the first
element specifying the argument name for the predictor
matrix and the second element specifying the argument
name for the response (the default is to use c("x",
"y")
). It should be noted that the formula
or
data
arguments take precedence over x
.
perrySelect
, perryTuning
,
cvFolds
, randomSplits
,
bootSamples
, cost
# NOT RUN {
data("coleman")
set.seed(1234) # set seed for reproducibility
## via model fit
# fit an MM regression model
fit <- lmrob(Y ~ ., data=coleman)
# perform cross-validation
perryFit(fit, data = coleman, y = coleman$Y,
splits = foldControl(K = 5, R = 10),
cost = rtmspe, costArgs = list(trim = 0.1),
seed = 1234)
## via model fitting function
# perform cross-validation
# note that the response is extracted from 'data' in
# this example and does not have to be supplied
perryFit(lmrob, formula = Y ~ ., data = coleman,
splits = foldControl(K = 5, R = 10),
cost = rtmspe, costArgs = list(trim = 0.1),
seed = 1234)
## via function call
# set up function call
call <- call("lmrob", formula = Y ~ .)
# perform cross-validation
perryFit(call, data = coleman, y = coleman$Y,
splits = foldControl(K = 5, R = 10),
cost = rtmspe, costArgs = list(trim = 0.1),
seed = 1234)
# }
Run the code above in your browser using DataLab