predab.resample
Predictive Ability using Resampling
predab.resample
is a general-purpose
function that is used by functions for specific models.
It computes estimates of optimism of, and bias-corrected estimates of a vector
of indexes of predictive accuracy, for a model with a specified
design matrix, with or without fast backward step-down of predictors. If bw=TRUE
, the design
matrix x
must have been created by ols
, lrm
, or cph
.
If bw=TRUE
, predab.resample
stores as the kept
attribute a logical matrix encoding which
factors were selected at each repetition.
- Keywords
- models
Usage
predab.resample(fit.orig, fit, measure,
method=c("boot","crossvalidation",".632","randomization"),
bw=FALSE, B=50, pr=FALSE, prmodsel=TRUE,
rule="aic", type="residual", sls=.05, aics=0,
tol=1e-12, force=NULL, estimates=TRUE,
non.slopes.in.x=TRUE, kint=1,
cluster, subset, group=NULL,
allow.varying.intercepts=FALSE, debug=FALSE, …)
Arguments
- fit.orig
object containing the original full-sample fit, with the
x=TRUE
andy=TRUE
options specified to the model fitting function. This model should be the FULL model including all candidate variables ever excluded because of poor associations with the response.- fit
a function to fit the model, either the original model fit, or a fit in a sample. fit has as arguments
x
,y
,iter
,penalty
,penalty.matrix
,xcol
, and other arguments passed topredab.resample
. If you don't wantiter
as an argument inside the definition offit
, add … to the end of its argument list.iter
is passed tofit
to inform the function of the sampling repetition number (0=original sample). Ifbw=TRUE
,fit
should allow for the possibility of selecting no predictors, i.e., it should fit an intercept-only model if the model has intercept(s).fit
must return objectscoef
andfail
(fail=TRUE
iffit
failed due to singularity or non-convergence - these cases are excluded from summary statistics).fit
must add design attributes to the returned object ifbw=TRUE
. Thepenalty.matrix
parameter is not used ifpenalty=0
. Thexcol
vector is a vector of columns ofX
to be used in the current model fit. Forols
andpsm
it includes a1
for the intercept position.xcol
is not defined ifiter=0
unless the initial fit had been from a backward step-down.xcol
is used to select the correct rows and columns ofpenalty.matrix
for the current variables selected, for example.- measure
a function to compute a vector of indexes of predictive accuracy for a given fit. For
method=".632"
ormethod="crossval"
, it will make the most sense for measure to compute only indexes that are independent of sample size. The measure function should take the following arguments or use …:xbeta
(X beta for current fit),y
,evalfit
,fit
,iter
, andfit.orig
.iter
is as infit
.evalfit
is set toTRUE
bypredab.resample
if the fit is being evaluated on the sample used to make the fit,FALSE
otherwise;fit.orig
is the fit object returned by the original fit on the whole sample. Usingevalfit
will sometimes save computations. For example, in bootstrapping the area under an ROC curve for a logistic regression model,lrm
already computes the area if the fit is on the training sample.fit.orig
is used to pass computed configuration parameters from the original fit such as quantiles of predicted probabilities that are used as cut points in other samples. The vector created by measure should havenames()
associated with it.- method
The default is
"boot"
for ordinary bootstrapping (Efron, 1983, Eq. 2.10). Use".632"
for Efron's.632
method (Efron, 1983, Section 6 and Eq. 6.10),"crossvalidation"
for grouped cross--validation,"randomization"
for the randomization method. May be abbreviated down to any level, e.g."b"
,"."
,"cross"
,"rand"
.- bw
Set to
TRUE
to do fast backward step-down for each training sample. Default isFALSE
.- B
Number of repetitions, default=50. For
method="crossvalidation"
, this is also the number of groups the original sample is split into.- pr
TRUE
to print results for each sample. Default isFALSE
.- prmodsel
set to
FALSE
to suppress printing of model selection output such as that fromfastbw
.- rule
Stopping rule for fastbw,
"aic"
or"p"
. Default is"aic"
to use Akaike's information criterion.- type
Type of statistic to use in stopping rule for fastbw,
"residual"
(the default) or"individual"
.- sls
Significance level for stopping in fastbw if
rule="p"
. Default is.05
.- aics
Stopping criteria for
rule="aic"
. Stops deleting factors when chi-square - 2 times d.f. falls belowaics
. Default is0
.- tol
Tolerance for singularity checking. Is passed to
fit
andfastbw
.- force
see
fastbw
- estimates
see
print.fastbw
- non.slopes.in.x
set to
FALSE
if the design matrixx
does not have columns for intercepts and these columns are needed- kint
For multiple intercept models such as the ordinal logistic model, you may specify which intercept to use as
kint
. This affects the linear predictor that is passed tomeasure
.- cluster
Vector containing cluster identifiers. This can be specified only if
method="boot"
. If it is present, the bootstrap is done using sampling with replacement from the clusters rather than from the original records. If this vector is not the same length as the number of rows in the data matrix used in the fit, an attempt will be made to usenaresid
onfit.orig
to conformcluster
to the data. Seebootcov
for more about this.- subset
specify a vector of positive or negative integers or a logical vector when you want to have the
measure
function compute measures of accuracy on a subset of the data. The whole dataset is still used for all model development. For example, you may want tovalidate
orcalibrate
a model by assessing the predictions on females when the fit was based on males and females. When you usecr.setup
to build extra observations for fitting the continuation ratio ordinal logistic model, you can usesubset
to specify whichcohort
or observations to use for deriving indexes of predictive accuracy. For example, specifysubset=cohort=="all"
to validate the model for the first layer of the continuation ratio model (Prob(Y=0)).- group
a grouping variable used to stratify the sample upon bootstrapping. This allows one to handle k-sample problems, i.e., each bootstrap sample will be forced to selected the same number of observations from each level of group as the number appearing in the original dataset.
- allow.varying.intercepts
set to
TRUE
to not throw an error if the number of intercepts varies from fit to fit- debug
set to
TRUE
to print subscripts of all training and test samples- …
The user may add other arguments here that are passed to
fit
andmeasure
.
Details
For method=".632"
, the program stops with an error if every observation
is not omitted at least once from a bootstrap sample. Efron's ".632" method
was developed for measures that are formulated in terms on per-observation
contributions. In general, error measures (e.g., ROC areas) cannot be
written in this way, so this function uses a heuristic extension to
Efron's formulation in which it is assumed that the average error measure
omitting the i
th observation is the same as the average error measure
omitting any other observation. Then weights are derived
for each bootstrap repetition and weighted averages over the B
repetitions
can easily be computed.
Value
a matrix of class "validate"
with rows corresponding
to indexes computed by measure
, and the following columns:
indexes in original overall fit
average indexes in training samples
average indexes in test samples
average training-test
except for method=".632"
- is .632 times
(index.orig - test)
index.orig-optimism
number of successful repetitions with the given index non-missing
References
Efron B, Tibshirani R (1997). Improvements on cross-validation: The .632+ bootstrap method. JASA 92:548--560.
See Also
Examples
# NOT RUN {
# See the code for validate.ols for an example of the use of
# predab.resample
# }