snpRFcv: Random Forest Cross-Valdidation for feature selection

Description

This function shows the cross-validated prediction performance of models with sequentially reduced number of predictors (ranked by variable importance) via a nested cross-validation procedure.

Usage

snpRFcv(trainx.autosome=NULL,trainx.xchrom=NULL,trainx.covar=NULL, trainy,  cv.fold=5, scale="log", step=0.5,  mtry=function(p) max(1, floor(sqrt(p))), recursive=FALSE, ...)

Arguments

trainx.autosome

A matrix of autosomal markers with each column corresponding to a SNP coded as count of a particular allele (i.e. 0,1 or 2), and each row corresponding to a sample/individual.

trainx.xchrom

A matrix of X chromosome markers, each marker coded as two adjacent columns, alleles of a marker are coded as 0 or 1 for carrying a particular allele. Although males only have one X-chromosome, their markers are coded as 2 columns as well, the second column being a duplicate of the first. Each row of this matrix corresponds to a sample/individual. This data must be phased in chromosomal order.

trainx.covar

A matrix of covariates, each column being a different covariate, and each row, a sample/individual.

trainy

vector of response, must be a factor and have length equal to the number of rows in trainx.*

cv.fold

number of folds in the cross-validation

scale

if "log", reduce a fixed proportion (step) of variables at each step, otherwise reduce step variables at a time

step

if log=TRUE, the fraction of variables to remove at each step, else remove this many variables at a time

mtry

a function of number of remaining predictor variables to use as the mtry parameter in the snpRF call

recursive

whether variable importance is (re-)assessed at each step of variable reduction

...

other arguments passed on to snpRF

Value

n.var: vector of number of variables used at each step
error.cv: corresponding vector of error rates or MSEs at each step
predicted: list of n.var components, each containing the predicted values from the cross-validation

References

Svetnik, V., Liaw, A., Tong, C. and Wang, T., ``Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules'', MCS 2004, Roli, F. and Windeatt, T. (Eds.) pp. 334-343.

Examples

Run this code

set.seed(647)
data(snpRFexample)
result <- snpRFcv(trainx.autosome=autosome.snps,trainx.xchrom=xchrom.snps,
                  trainx.covar=covariates, trainy=phenotype)
with(result, plot(n.var, error.cv, log="x", type="o", lwd=2))

## The following can take a while to run, so if you really want to try
## it, copy and paste the code into R.

## Not run: 
# result <- replicate(5,snpRFcv(trainx.autosome=autosome.snps,
#                               trainx.xchrom=xchrom.snps,
#                               trainx.covar=covariates, trainy=phenotype), 
# 		    simplify=FALSE)
# error.cv <- sapply(result, "[[", "error.cv")
# matplot(result[[1]]$n.var, cbind(rowMeans(error.cv), error.cv), type="l",
#         lwd=c(2, rep(1, ncol(error.cv))), col=1, lty=1, log="x",
#         xlab="Number of variables", ylab="CV Error")
# ## End(Not run)