hrf: Random forest for longitudinal data

Description

Fits a random forest of historical regression trees to longitudinal data.

Usage

hrf(x,
    time=NULL,
    id=NULL,
    yindx,
    ntrees = 100,
    method="freqw",
    mtry=NULL,
    se=FALSE,
    B=100,
    R=10,
    nsplit=NULL,
    nsamp=5,
    historical=TRUE,
    keep_data=TRUE,
    vh=NULL,
    vc=NULL,
    delta=NULL,
    control=list(nmax=10,nodesize=1))

Arguments

a data frame containing response and predictors

time

vector of observation times

subject identifier, if NULL observations are assumed independent

yindx

Name of response variable, alt. its column number in x

ntrees

Number of trees in ensemble

method

Historical summary method, can be freq, frac, mean0, freqw, fracw and mean0w

mtry

Number of predictors sampled at each split

If TRUE then bootstrap standard errors are computed. Total number of trees for fit for bootstrapping is B*R.

Only used if se==TRUE, number of bootstrap samples, defaults to 100.

Only used if se==TRUE, forest size for each bootstrap sample, defaults to R=10.

nsplit

Number of splits in regression trees. Defaults to NULL, in which case tree size determined by requiring terminal nodes to contain atleast 5 observations.

nsamp

Number of sampled delta values, see below

historical

If TRUE then historical splitting is done, else only standard (ie concurrent predictor) splitting.

keep_data

If TRUE training data is returned in fit object.

Optional vector of indexes giving the historical predictors. Indexes correspond column numbers of x.

Optional vector of indexes giving concurrent predictors.

delta

A vector of history lags to be used (see below), defaults to NULL in which case all possible observed lags are available for splitting.

control

A list of control parameters.

Value

Returns a list with elements: trees giving the forest, error the OOB mse, pred the OOB predictions, boot giving bootstrapped models, as well as arguments of the function call.

Details

The hrf function fits a random forest model to longitudinal data. Data is assumed to be of form: $z_{ij}=(y_{ij},t_{ij},x_{ij})$ for $i=1,..,n$ and $j=1,..,n_i$, with $y_{ij}$ being the response for the $i$-th subject at the $j$-th observation time $t_{ij}$. The vector of predictors at time $t_{ij}$ are $x_{ij}$. The number of observations can vary across subjects, and the sampling in time can be irregular.

hrf estimates a model for the response $y_{ij}$ using both $(t_{ij},x_{ij})$ (the observations concurrent with $y_{ij}$) and all preceeding observations of the $i$-th subject up to (but not including) time $t_{ij}$. The model is fit using historical regression (alt. classification) trees. Here a predictor is one of two types, either concurrent or historical. The concurrent predictors for $y_{ij}$ are the elements of the vector ($(t_{ij},x_{ij})$), while a historic predictor is the set of all preceeding values (preceeding time $t_{ij}$) of a given element of $(y_{ij},t_{ij},x_{ij})$, for subject $i$. In a historic regression tree, node splitting on a concurrent predictor follows the approach in standard regression (classification) trees. For historical predictors the splitting is modified since, associated with each observed response $y_{ij}$, the number (and observation times) of observations of a historical predictor will vary according to $i$ and $j$. For these, the splitting is done by first transforming the preceeding values of a predictor using a summary function. This summary is invertible, in the sense that knowledge of it is equivalent to knowledge of the covariate history. Letting $\bar{z}_{ijk}$ denote the set of historical values of the $k$-th element of $z_{ij}$, the summary function is denoted $s(\eta;\bar{z}_{ijk})$ where $\eta$ is the argument vector of the summary function. Node splitting based on a historical predictor is done by solving $$\mbox{argmin}\sum_{(ij)\in Node} (y_{ij}-\mu_L I(s(\eta;\bar{z}_{ijk})<c)-\mu_R I(s(\eta;\bar{z}_{ijk})\geq c))^2,$$ where the minimization is over the vector $(k,\mu,c,\eta)$. Each node of historical regression tree is split according to the best split among all possible splits based on both concurrent and historical predictors.

Different summary functions are available in hrf, specified by the argument method, where method="freq" and corresponds the summary $$s(\eta;\bar{z}_{ijk})=\sum_{h: t_{ij}-\eta_1\leq t_{ih}<t_{ij}} I(z_{ihk}<\eta_2)$$; method="frac": $$s(\eta;\bar{z}_{ijk})=\sum_{h: t_{ij}-\eta_1\leq t_{ih}<t_{ij}} I(z_{ihk}<\eta_2)/n_{ij}(\eta)$$; method="mean0": $$s(\eta;\bar{z}_{ijk})=\sum_{h: t_{ij}-\eta_1\leq t_{ih}<t_{ij}} z_{ihk}/n_{ij}(\eta)$$; method="freqw": $$s(\eta;\bar{z}_{ijk})=\sum_{h: t_{ij}-\eta_1\leq t_{ih}<t_{ij}-\eta_2} I(z_{ihk}<\eta_3)$$; method="fracw": $$s(\eta;\bar{z}_{ijk})=\sum_{h: t_{ij}-\eta_1\leq t_{ih}<t_{ij}-\eta_2} I(z_{ihk}<\eta_3)/n_{ij}(\eta)$$; method="meanw0": $$s(\eta;\bar{z}_{ijk})=\sum_{h: t_{ij}-\eta_1\leq t_{ih}<t_{ij}-\eta_2} z_{ihk}/n_{ij}(\eta)$$. Here $n_{ij}(\eta)$ denotes the number of observations of subject $i$ in the time window $[t_{ij}-\eta_1,t_{ij}-\eta_2)$. In the case $n_{ij}(\eta)=0$, the summary function is set to zero, i.e $s(\eta;\bar{z}_{ijk})=0$. The default is method="freqw". The possible values of $\eta_1$ in the summary function can be set by the argument delta. If not supplied, the set of possible values of $\eta_1$ is determined by the difference in time between successive observations in the data. When a split is attempted on a historical predictor, a sample of this set is taken from which the best split is selected. The size of this set equals that of the nsamp argument.

Setting se=TRUE performs standard error estimation. The number of bootstrap samples (sampling subjects with replacement) is determined by B. For each bootstrap sample a random forest with R trees is built, which defaults to R=10. The bias induced by using smaller bootstrap ensemble sizes is corrected for in the estimate. Using se=TRUE will influence summaries from the fitted model, such as providing approximate confidence intervals for partial dependence plots (when running partdep_hrf), and give standard errors for predictions when predict_hrf is used.

References

L. Breiman (2001). “Random Forests,” Machine Learning 45(1):5-32.

Zhang and Singer (2010) “Recursive Partioning and Applications” Springer.

Sexton and Laake (2009) “Standard errors for bagged and random forest estimators,” Computational Statistics and Data Analysis.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
# ----------------------------------------------------------------------------------------- ##
# Mother's stress on child illness:
# 	Investigate whether mother's stress is (Granger) causal on child illness 
#	'hrf' model is fit using previous observations of mother's stress to predict 
#	child's illness at given time point, but not mother's stress at that time point
#
#	Predictor variables are classified into "historical" and "concurrent"
#
#	A predictor is "historical" if its prior realizations can be used to predict 
#	the outcome.   
#
#	A predictor is "concurrent" if its realization at the same timepoint as the outcome
#	can be used to predict the outcome at that timepoint
#
#	A predictor can be both "concurrent" and "historical", the status of the predictors 
#	can be set by the 'vh' and 'vc' arguments of 'hrf'. 
#	(if not set these are automatically determined) 
#	 
# ------------------------------------------------------------------------------------------- ##

data(mscm) 
mscm=as.data.frame(na.omit(mscm))

# -- first two weeks
mscm=mscm[mscm$day<=14,]

# -- set concurrent and historical predictors 
historical_predictors=match(c("stress","illness"),names(mscm))
concurrent_predictors=which(names(mscm)!="stress")

# -- fit historical random forest
#	(NOTE: response is 0/1 so a regression tree is equivalent 
##	 to a classification tree with Gini-index splitting) 
ff=hrf(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",
	vh=historical_predictors,vc=concurrent_predictors)

# -- variable importance table
vi=varimp_hrf(ff)
vi


# -- fit historical random forest with 'se=TRUE'
ff=hrf(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",
	vh=historical_predictors,vc=concurrent_predictors,se=T,B=50)

# -- partial dependence with top 4 predictors
par(mfrow=c(2,2))
for(k in 1:4)
	pd=partdep_hrf(ff,xindx=rownames(vi)[k])

par(mfrow=c(1,1))




# -------------------------------------------- ##
# Simulated data from random intercept model
# -------------------------------------------- ##

## -- simulated data -------------------------------------------------- #
p=5;sigma_e=.5;sigma_a=.5;v=rep(1,p);n=500;pnoise=2
random_intercept=rnorm(n,sd=sigma_a^.5)
random_intercept=as.numeric(matrix(random_intercept,nrow=p,ncol=n,byrow=TRUE))
x=random_intercept+rnorm(n*p,sd=sigma_e^.5)
id=sort(rep(1:n,p))
time<-rep(1:p,n)
znoise=matrix(rnorm(n*p*pnoise),ncol=pnoise)
xx=as.data.frame(cbind(time,x,znoise))
xx$rfact=as.factor(sample(letters,nrow(xx),replace=TRUE))
## --------------------------------------------------------------------- #

# fit historical random forest
ff=hrf(x=xx,time=time,id=id,yindx=2,ntrees=100)

# plot oob-error 
plot(1:ff$ntrees,ff$error,type="l",col="blue")


# fit forest with standard error selection
ff=hrf(x=xx,time=time,id=id,yindx=2,ntrees=100,se=TRUE)

# plot partial dependence of response on its past 
pd=partdep_hrf(ff,xindx=2)

# variable importance table
varimp_hrf(ff)




# -------------------------------------------- ##
# Examples with non-longitudinal data
# -------------------------------------------- ##



# -------------------- ##
# Boston Housing data 
# -------------------- ##
library(mlbench)
library(randomForest)

data(BostonHousing)
dat=as.data.frame(na.omit(BostonHousing))


## omitting arguments time/id assumes rows are iid 
h=hrf(x=dat,yindx="medv",ntrees=500)

## randomForest comparison
r=randomForest(medv~.,data=dat)

## plot oob-error for both
plot(1:length(r$mse),r$mse,type="l",ylim=c(min(r$mse,h$error),max(r$mse,h$error)),
	main="BostonHousing",xlab="forest size",ylab="out-of-bag mse")
points(1:length(h$error),h$error,type="l",col="blue")

## -- variable importance table
vi=varimp_hrf(h)
vi

## -- partial dependence plots with approximate 95<!-- % C.I  -->
h=hrf(x=dat,yindx="medv",ntrees=500,se=TRUE)

par(mfrow=c(2,2))
for(k in 1:4)
	pd=partdep_hrf(h,xindx=rownames(vi)[k])

par(mfrow=c(1,1))

# ------------------------------------- ##
# Ionosphere data (classification)  
# ---------------------------------- ##
data(Ionosphere)
dat=as.data.frame(na.omit(Ionosphere))


h=hrf(x=dat,yindx="Class",ntrees=500)
r=randomForest(Class~.,data=dat)

plot(1:length(r$err[,1]),r$err[,1],type="l",ylim=c(min(r$err[,1],h$error),max(r$err[,1],h$error)),
	ylab="mis-classification rate",xlab="forest size",main="Out-of-bag mis-classification")
points(1:length(h$error),h$error,type="l",col="blue")

# -- variable importance table
vi=varimp_hrf(h)


par(mfrow=c(2,2))
for(k in 1:4)
	pd=partdep_hrf(h,xindx=rownames(vi)[k])


par(mfrow=c(1,1))




# }

Run the code above in your browser using DataLab