test: Bootstrap based test for covariate selection

Description

Function that applies a bootstrap based test for covariate selection. It helps to determine the number of variables to be included in the model.

Usage

test(x, y, method = "lm", family = "gaussian", nboot = 50,
  speedup = TRUE, qmin = NULL, unique = FALSE, q = NULL,
  bootseed = NULL, cluster = TRUE, ncores = NULL)

Arguments

A data frame containing all the covariates.

A vector with the response values.

method

A character string specifying which regression method is used, i.e., linear models ("lm"), generalized additive models.

family

A description of the error distribution and link function to be used in the model: ("gaussian"), ("binomial") or ("poisson").

nboot

Number of bootstrap repeats.

speedup

A logical value. If TRUE (default), the testing procedure is computationally efficient since it considers one more variable to fit the alternative model than the number of variables used to fit the null. If FALSE, the fit of th

qmin

By default NULL. If speedup is FALSE, qmin is an integer number selected by the user. To help you select this argument, it is recommended to visualize the graphical output of the plot functi

unique

A logical value. By default FALSE. If TRUE, the test is performed only for one null hypothesis, given by the argument q.

By default NULL. If unique is TRUE, q is the size of the subset of variables to be tested.

bootseed

Seed to be used in the bootstrap procedure.

cluster

A logical value. If TRUE (default), the testing procedure is parallelized.

ncores

An integer value specifying the number of cores to be used in the parallelized procedure. If NULL (default), the number of cores to be used is equal to the number of cores of the machine - 1.

Value

A list with two objects. The first one is a table containing
HypothesisNumber of the null hypothesis tested
StatisticValue of the T statistic
pvaluepvalue obtained in the testing procedure
DecisionResult of the test for a significance level of 0.05
The second argument nvar indicates the number of variables that have to be included in the model.

Details

In a regression framework, let $X_1, X_2, \ldots, X_p$, a set of $p$ initial variables and $Y$ the response variable, we propose a procedure to test the null hypothesis of $q$ significant variables in the model --$q$ effects not equal to zero-- versus the alternative in which the model contains more than $q$ variables. Based on the general model $$Y=m(\textbf{X})+\varepsilon \quad {\rm{where}} \quad m(\textbf{X})= m_{1}(X_{1})+m_{2}(X_{2})+\ldots+m_{p}(X_{p})$$ the following strategy is considered: for a subset of size $q$, considerations will be given to a test for the null hypothesis $$H_{0} (q): \sum_{j=1}^p I_{{m_j \ne 0}} \le q$$ vs. the general hypothesis $$H_{1} : \sum_{j=1}^p I_{{m_j \ne 0}} > q$$

References

Sestelo, M., Villanueva, N. M. and Roca-Pardinas, J. (2013). FWDselect: an R package for selecting variables in regression models. Discussion Papers in Statistics and Operation Research, University of Vigo, 13/01.

Examples

Run this code

library(FWDselect)
data(diabetes)
x = diabetes[ ,2:11]
y = diabetes[ ,1]
test(x, y, method = "lm", cluster = FALSE, nboot = 5)

## for speedup = FALSE
# obj2 = qselection(x, y, qvector = c(1:9), method = "lm",
# cluster = FALSE)
# plot(obj2) # we choose q = 7 for the argument qmin
# test(x, y, method = "lm", cluster = FALSE, nboot = 5,
# speedup = FALSE, qmin = 7)

Run the code above in your browser using DataLab