forward: Iterative bias reduction smoothing

Description

Performs a forward variable selection for iterative bias reduction using kernel, thin plate splines or low rank splines. Missing values are not allowed.

Usage

forward(formula,data,subset,criterion="gcv",df=1.5,Kmin=1,Kmax=1e+06,
   smoother="k",kernel="g",rank=NULL,control.par=list(),cv.options=list(),
   varcrit=criterion)

Value

Returns an object of class forwardibr which is a matrix with p columns. In the first row, each entry j contains the value of the chosen criterion for the univariate smoother using the jth explanatory variable. The variable which realize the minimum of the first row is included in the model. All the column of this variable will be Inf except the first row. In the second row, each entry j contains the bivariate smoother using the jth explanatory variable and the variable already included. The variable which realize the minimum of the second row is included in the model. All the column of this variable will be Inf except the two first row. This forward selection process continue until the chosen criterion increases.

Arguments

formula

An object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted.

data

An optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which forward is called.

subset

An optional vector specifying a subset of observations to be used in the fitting process.

criterion

Character string. If the number of iterations (iter) is missing or NULL the number of iterations is chosen using criterion. The criteria available are GCV (default, "gcv"), AIC ("aic"), corrected AIC ("aicc"), BIC ("bic"), gMDL ("gmdl"), map ("map") or rmse ("rmse"). The last two are designed for cross-validation.

df

A numeric vector of either length 1 or length equal to the number of columns of x. If smoother="k", it indicates the desired degree of freedom (trace) of the smoothing matrix for each variable or for the initial smoother (see contr.sp$dftotal); df is repeated when the length of vector df is 1. If smoother="tps", the minimum df of thin plate splines is multiplied by df. This argument is useless if bandwidth is supplied (non null).

Kmin

The minimum number of bias correction iterations of the search grid considered by the model selection procedure for selecting the optimal number of iterations.

Kmax

The maximum number of bias correction iterations of the search grid considered by the model selection procedure for selecting the optimal number of iterations.

smoother

Character string which allows to choose between thine plate splines "tps" or kernel ("k").

kernel

Character string which allows to choose between gaussian kernel ("g"), Epanechnikov ("e"), uniform ("u"), quartic ("q"). The default (gaussian kernel) is strongly advised.

rank

Numeric value that control the rank of low rank splines (denoted as k in mgcv package ; see also choose.k for further details or gam for another smoothing approach with reduced rank smoother.

control.par

a named list that control optional parameters. The components are bandwidth (default to NULL), iter (default to NULL), really.big (default to FALSE), dftobwitmax (default to 1000), exhaustive (default to FALSE),m (default to NULL), dftotal (default to FALSE), accuracy (default to 0.01), ddlmaxi (default to 2n/3) and fraction (default to c(100, 200, 500, 1000, 5000,10^4,5e+04,1e+05,5e+05,1e+06)).

bandwidth: a vector of either length 1 or length equal to the number of columns of x. If smoother="k", it indicates the bandwidth used for each variable, bandwidth is repeated when the length of vector bandwidth is 1. If smoother="tps", it indicates the amount of penalty (coefficient lambda). The default (missing) indicates, for smoother="k", that bandwidth for each variable is chosen such that each univariate kernel smoother (for each explanatory variable) has df degrees of freedom and for smoother="tps" that lambda is chosen such that the df of the smoothing matrix is df times the minimum df.

iter: the number of iterations. If null or missing, an optimal number of iterations is chosen from the search grid (integer from Kmin to Kmax) to minimize the criterion.

really.big: a boolean: if TRUE it overides the limitation at 500 observations. Expect long computation times if TRUE.

dftobwitmax: When bandwidth is chosen by specifying the degree of freedom (see df) a search is done by uniroot. This argument specifies the maximum number of iterations transmitted to uniroot function.

exhaustive: boolean, if TRUE an exhaustive search of optimal number of iteration on the grid Kmin:Kmax is performed. If FALSE the minimum of criterion is searched using optimize between Kmin and Kmax.

m: the order of thin plate splines. This integer m must verifies 2m/d>1, where d is the number of explanatory variables. The missing default to choose the order m as the first integer such that 2m/d>1, where d is the number of explanatory variables (same for NULL).

dftotal: a boolean wich indicates when FAlSE that the argument df is the objective df for each univariate kernel (the default) calculated for each explanatory variable or for the overall (product) kernel, that is the base smoother (when TRUE).

accuracy: tolerance when searching bandwidths which lead to a chosen overall intial df.

dfmaxi: the maximum degree of freedom allowed for iterated biased reduction smoother.

fraction: the subdivistion of interval Kmin,Kmax if non exhaustive search is performed (see also iterchoiceA or iterchoiceS1).

cv.options

A named list which controls the way to do cross validation with component bwchange, ntest, ntrain, Kfold, type, seed, method and npermut. bwchange is a boolean (default to FALSE) which indicates if bandwidth have to be recomputed each time. ntest is the number of observations in test set and ntrain is the number of observations in training set. Actually, only one of these is needed the other can be NULL or missing. Kfold a boolean or an integer. If Kfold is TRUE then the number of fold is deduced from ntest (or ntrain). type is a character string in random,timeseries,consecutive, interleaved and give the type of segments. seed controls the seed of random generator. method is either "inmemory" or "outmemory"; "inmemory" induces some calculations outside the loop saving computational time but leading to an increase of the required memory. npermut is the number of random draws. If cv.options is list(), then component ntest is set to floor(nrow(x)/10), type is random, npermut is 20 and method is "inmemory", and the other components are NULL

varcrit

Character string. Criterion used for variable selection. The criteria available are GCV, AIC ("aic"), corrected AIC ("aicc"), BIC ("bic") and gMDL ("gmdl").

Author

Pierre-Andre Cornillon, Nicolas Hengartner and Eric Matzner-Lober.

References

Cornillon, P.-A.; Hengartner, N.; Jegou, N. and Matzner-Lober, E. (2012) Iterative bias reduction: a comparative study. Statistics and Computing, 23, 777-791.

Cornillon, P.-A.; Hengartner, N. and Matzner-Lober, E. (2013) Recursive bias estimation for multivariate regression smoothers Recursive bias estimation for multivariate regression smoothers. ESAIM: Probability and Statistics, 18, 483-502.

Cornillon, P.-A.; Hengartner, N. and Matzner-Lober, E. (2017) Iterative Bias Reduction Multivariate Smoothing in R: The ibr Package. Journal of Statistical Software, 77, 1--26.

Examples

Run this code

if (FALSE) {
data(ozone, package = "ibr")
res.ibr <- forward(ozone[,-1],ozone[,1],df=1.2)
apply(res.ibr,1,which.min)
}

Run the code above in your browser using DataLab