selectModel(X, y, Xm = NULL, stay = NULL, intercept = TRUE, minpv = 0.15, multif = TRUE, maxf = min(ncol(X), 70), minb = 0, crit = mbic, crit.multif = bic, maxp = 1e+07, verbose = TRUE, file.out = NULL, ...)
X
contain the samples, the columns of X
contain
the observed variables. If your have variables in rows, see 'Details'.y
must equal
the number of rows of X
.Xm
which should be
included in all model selection steps.TRUE
, the intercept will be included.minpv
will be excluded from
the model selection procedure. If you do not want to do this, set
minpv=1
(not recommended if you have big data).TRUE
, the multi-forward step will be
performed (see 'Details').bic
, mbic
, mbic2
,
aic
, maic
, maic2
.crit
(but only criteria with small penalties
are recommended).X
is big, it will be splitted into parts
with maxp
elements. It will not change results, but it is
necessary if your computer does not have enough RAM. Set to a lower value
if you still have problems.FALSE
if you do not want to see any
information during the selection procedure.NULL
(and minpv<1< code="">),
the variables with p-value < minpv
will be saved to file.out
(txt file).1<>
crit
(you cannot send optional
arguments to crit.multif
)y
and all columns of X
are calculated
and columns with p-values for the Pearson correlation tests higher than
minpv
will be excluded from the model selection procedure.
In the second step (multi-forward) we start with the null model and add
variables which decrease crit.multif
(in order from the smallest
p-value). The step is finished after we add maxf
variables or none
of remaining variables improve crit.multif
. Then the classical
backward selection is performed (with crit
). When there is no
variables to remove, the last step, the classical stepwise procedure, is
performed (with crit
).Results from this four-step procedure should be very similar to the
classical stepwise procedure (when we start with the null model and do not
omit variables with high p-values) but the first one is much quicker.
The most time-consuming part is the forward step in the
stepwise selection (in the multi-forward step we do not add the best
variable but any which decrease crit.multif
) and it is performed less
often when we start with a reasonable model (sometimes you can find the best
model without using the stepwise selection). But you can omit the first three
steps if you set multif=FALSE
and minpv=1
. Resignation from
the multi-forward step can be reasonable when you expect that the final
model should be very small (a few variables).
If your data are too big to store in RAM, you should read them with the
read.big.matrix
function form the bigmemory
packages. The
selectModel
function will recognize that X
is not an ordinary
matrix and split your data to smaller parts. It will not change results but
is necessary to work with big data.
The default criterion in the model selection procedure is a modification of
the Bayesian Information Criterion, mBIC [1]. It was constructed to control
the so-called Family-wise Error Rate (FWER) at the level near 0.05 when you
have a lot of explanatory variables and only a few of them should stay in
the final model. If you are interested in controlling the so-called False
Discovery Rate (FDR) is such type of data, you can change crit
to
mBIC2
[2], which control FDR at the level near 0.05. There are more
criteria to choose from or you can easily define your own (see 'Examples')
If you have variables in rows, you have to transpose X
. It can be
problematic if your data are big, so you can use the
transposeBigMatrix
function from this package.
[2] F. Frommlet, A. Chakrabarti, M. Murawska, M. Bogdan (2011), "Asymptotic Bayes optimality under sparsity for generally distributed effect sizes under the alternative". Technical report at arXiv:1005.4753.
[3] F. Frommlet, F. Ruhaltinger, P. Twarog, M. Bogdan (2012), "A model selection approach to genome wide association studies", Computational Statistics and Data Analysis 56: 1038-1051.
set.seed(1)
n <- 100
M <- 10
X <- matrix(rnorm(M*n), ncol=M)
y <- X[, 2] - X[, 3] + X[, 6] - X[, 10] + rnorm(n)
fit <- selectModel(X, y, p=M)
summary(fit)
# more examples: type ?bigstep
Run the code above in your browser using DataLab