JRF: Joint Random Forest for the simultaneous estimation of multiple related networks

Description

Algorithm for the simultaneous estimation of multiple related networks. This file is a modified version of function RF contained in the R package randomForest.

Usage

JRF(x, y = NULL, xtest = NULL, ytest = NULL, ntree, sampsize,
  totsize = if (replace) ncol(x) else ceiling(0.632 * ncol(x)), mtry = if
  (!is.null(y) && !is.factor(y)) max(floor(nrow(x)/3), 1) else
  floor(sqrt(nrow(x))), replace = TRUE, classwt = NULL, cutoff, strata,
  nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL,
  importance = FALSE, localImp = FALSE, nPerm = 1, proximity,
  oob.prox = proximity, norm.votes = TRUE, do.trace = FALSE,
  keep.forest = !is.null(y) && is.null(xtest), corr.bias = FALSE,
  keep.inbag = FALSE, nclasses, ...)

Arguments

numeric matrix with C * r rows and n columns, where C is the number of networks to estimate, r the number of predictors and n the maximum sample size across classes. Therefore, rows correspond to predict

numeric matrix C by n, where C is the number of networks to estimate and n the maximum sample size across classes. Therefore, rows correspond to response variables for each class and columns correspond t

nclasses

numeric value: the total number of classes C.

ntree

numeric value: number of trees.

sampsize

numeric vector C by 1: number of samples for each class of data.

importance

Should importance of predictors be assessed?

totsize

Max number of samples across different classes

mtry

numeric value: number of predictors to be sampled at each node.

xtest

a data frame or matrix (like x) containing predictors for the test set.

ytest

response for the test set.

replace

Should sampling of cases be done with or without replacement?

classwt

Priors of the classes. Need not add up to one. Ignored for regression.

cutoff

(Classification only) A vector of length equal to number of classes. The winning class for an observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is the number of classes

strata

A (factor) variable that is used for stratified sampling.

nodesize

Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5).

maxnodes

Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize ). If set larger than maximum possible, a warning is issued.

localImp

Should casewise importance measure be computed?

nPerm

Number of times the OOB data are permuted per tree for assessing variable importance. Number larger than 1 gives slightly more stable estimate, but not very effectiv. Currently only implemented for regression.

proximity

Should proximity measure among the rows be calculated?

oob.prox

Should proximity be calculated only on out-of-bag data?

norm.votes

If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs). Ignored for regression.

do.trace

If set to TRUE, give a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees.

keep.forest

If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.

corr.bias

perform bias correction for regression? Note: Experimental. Use at your own risk.

keep.inbag

Should an n by ntree matrix be returned that keeps track of which samples are in-bag in which trees (but not how many times, if sampling with replacement)

...

optional parameters to be passed to the low level function

Value

out object of class JRF

References

Petralia, F., Song, WM., Tu, Z. and Wang, P., A New Method for Joint Network Analysis Reveals Common and Different Co-Expression Patterns Among Genes and Proteins in Breast Cancer, submitted

A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2, 18--22.

Examples

Run this code

# --- Derive weighted networks via JRF

nclasses=2               # number of data sets / classes
n1<-n2<-50               # sample size for each data sets
p<-100                   # number of variables (genes)

  # --- Generate data sets

data1<-matrix(rnorm(p*n1),p,n1)       # generate data1
data2<-matrix(rnorm(p*n2),p,n1)       # generate data2

  # --- Standardize variables to mean 0 and variance 1

 data1 <- t(apply(data1, 1, function(x) { (x - mean(x)) / sd(x) } ))
 data2 <- t(apply(data2, 1, function(x) { (x - mean(x)) / sd(x) } ))

  # --- Initialize variables

 imp1<-imp2<-matrix(0,p,p)   # matrix to store importance scores
 ntree=1000;                 # number of trees
 nsample<-c(n1,n2)           # vector containing sample size for each class

# --- run JRF for each target gene
for (j in 1:2){   # for loop over target genes

  #--- create matrix (classes by max(n1,n2)) of response variable
  y<-matrix(0,2,max(n1,n2));  
  y[1,seq(1,n1)]<-as.matrix(data1[j,])
  y[2,seq(1,n2)]<-as.matrix(data2[j,])

  x<-matrix(0,p*2-2,max(n1,n2)) #--- matrix of covariates 
  x[seq(1,p-1),seq(1,n1)]<-as.matrix(data1[-j,])
  x[seq(p,2*p-2),seq(1,n2)]<-as.matrix(data2[-j,])

   jrf.out<-JRF(x=x,y=y,mtry=round(sqrt(p-1)),importance=TRUE,
   sampsize=nsample,nclasses=nclasses,ntree=ntree)

   imp1[-j,j]<-importance(jrf.out,scale=FALSE)[seq(1,p-1)]      #- importance for net1
   imp2[-j,j]<-importance(jrf.out,scale=FALSE)[seq(p,(p-1)*2)]  #- importance for net2

}

Run the code above in your browser using DataLab