snpRF
implements Breiman's random forest algorithm (based on
Breiman and Cutler's original Fortran code) for classification and
regression. It can also be used in unsupervised mode for assessing
proximities among data points. This is a modified version of the
randomForest function in the randomForest package addressing issues
of X-chromosome SNP importance bias by simulating the process of
X-inactivation.
snpRF(x.autosome=NULL,x.xchrom=NULL, xchrom.names=NULL, x.covar=NULL, y, xtest.autosome=NULL,xtest.xchrom=NULL, xtest.covar=NULL, ytest=NULL, ntree=500, mtry=floor(sqrt(sum(c(ncol(x.autosome),ncol(x.xchrom)/2, ncol(x.covar))))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) max(c(nrow(x.autosome),nrow(x.xchrom), nrow(x.covar))) else ceiling(.632*max(c(nrow(x.autosome), nrow(x.xchrom),nrow(x.covar)))), nodesize = 1, maxnodes=NULL, importance=FALSE, localImp=FALSE, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y) && (is.null(xtest.autosome) & is.null(xtest.xchrom) & is.null(xtest.covar)), keep.inbag=FALSE, ...)
"print"(x, ...)
snpRF
object.snpRF
will run in unsupervised mode.x.autosome
) containing
predictors for the test set.x.xchrom
) containing
predictors for the test set.x.covar
) containing
predictors for the test set.x.autosome
,
half of x.xchrom
, and x.covar
)nodesize
). If set larger than maximum
possible, a warning is issued.TRUE
will override importance
.) TRUE
(default), the final result of votes
are expressed as fractions. If FALSE
, raw vote counts are
returned (useful for combining results from different runs).
Ignored for regression.TRUE
, give a more verbose output as
snpRF
is run. If set to some integer, then running
output is printed for every do.trace
trees.FALSE
, the forest will not be
retained in the output object. If xtest
is given, defaults
to FALSE
.n
by ntree
matrix be
returned that keeps track of which samples are ``in-bag'' in which
trees (but not how many times, if sampling with replacement)snpRF
.snpRF
, which is a list with the
following components:snpRF
classification
, or
unsupervised
.nclass
+ 2 columns. The first
nclass
columns are the class-specific measures computed as
mean descrease in accuracy. The nclass
+ 1st column is the
mean descrease in accuracy over all classes. The last column is the
mean decrease in Gini index.p
by nclass
+ 1
matrix corresponding to the first nclass + 1
columns
of the importance matrix. For regression, a length p
vector.NULL
if localImp=FALSE
.NULL
if
snpRF
is run in unsupervised mode or if
keep.forest=FALSE
.proximity=TRUE
when
snpRF
is called, a matrix of proximity measures among
the input (based on the frequency that pairs of data points are in
the same terminal nodes).xtest
or additionally
ytest
arguments), this component is a list which contains the
corresponding predicted
, err.rate
, confusion
,
votes
for the test set. If
proximity=TRUE
, there is also a component, proximity
,
which contains the proximity among the test set as well as proximity
between test and training data.Breiman, L (2002), ``Manual On Setting Up, Using, And Understanding Random Forests V3.1'', http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf.
Jenkins, G., Biernacka J., Winham S., Random forest for genetic analysis: Integrating the X chromosome; (Abstract #1853). Presented at the 64th Annual Meeting of The American Society of Human Genetics, Date, October 21, 2014 in San Diego, CA.
predict.snpRF
, varImpPlot
## Classification:
data(snpRFexample)
set.seed(71)
eg.rf <- snpRF(x.autosome=autosome.snps,x.xchrom=xchrom.snps,
xchrom.names=xchrom.snps.names,x.covar=covariates,
y=phenotype,importance=TRUE, proximity=TRUE)
print(eg.rf)
## Look at variable importance:
round(importance(eg.rf), 2)
## Do MDS on 1 - proximity:
eg.mds <- cmdscale(1 - eg.rf$proximity, eig=TRUE)
print(eg.mds$GOF)
## Grow no more than 4 nodes per tree:
(treesize(snpRF(x.autosome=autosome.snps,x.xchrom=xchrom.snps,
xchrom.names=xchrom.snps.names,x.covar=covariates,
y=phenotype, maxnodes=4, ntree=30)))
Run the code above in your browser using DataLab