hrf: Random forest for longitudinal data

Description

Fits a random forest of historical regression trees to longitudinal data.

Usage

hrf(x,
    time,
    id,
    yindx,
    ntrees = 100,
    mtry=NULL,
    se=FALSE,
    B=100,
    R=10,
    nsplit=NULL,
    nsamp=5,
    rsplit=FALSE,
    historical=TRUE,
    keep_data=TRUE,
    vh=NULL,
    vc=NULL,
    order.cat=FALSE)

Arguments

a data frame containing response and predictors

time

vector of observation times

subject identifier

yindx

Column number in x corresponding response variable

ntrees

Number of trees in ensemble

mtry

Number of predictors sampled at each split

If TRUE then bootstrap standard errors are computed. Total number of trees for fit for bootstrapping is B*R.

Only used if se==TRUE, number of bootstrap samples, defaults to 100.

Only used if se==TRUE, forest size for each bootstrap sample, defaults to R=10.

nsplit

Number of splits in regression trees. Defaults to NULL, in which case tree size determined by requiring terminal nodes to contain atleast 5 observations.

nsamp

Number of sampled delta values, see below

rsplit

If TRUE, then randomized historical splitting is done, defaults to FALSE.

historical

If TRUE then historical splitting is done, else only standard (ie concurrent predictor) splitting.

keep_data

If TRUE training data is returned in fit object.

Optional vector of indexes giving the historical predictors. Indexes correspond column numbers of x.

Optional vector of indexes giving concurrent predictors.

order.cat

If TRUE then categorical predictors are converted into ordinal predictors according to response mean in categories, else these predictors are converted according to order of appearance in data (default).

Value

Returns a list with elements: trees giving the forest, error the OOB mse, pred the OOB predictions, boot giving bootstrapped models, as well as arguments of the function call.

Details

The hrf function fits a random forest model to longitudinal data. In particular, it considers data of the form: $(y_{ij},t_{ij},x_{ij})$ for $i=1,..,n$ and $j=1,..,n_i$. Here $y_{ij}$ is the response for the $i$-th subject at the $j$-th observation time $t_{ij}$. The predictors at time $t_{ij}$ are $x_{ij}$. The number of observations can vary across subjects, and the sampling in time can be irregular (vary in intensity both within and across subjects).

The novelty of hrf is that it can construct a predictive model for the response $y_{ij}$ that makes use of $(t_{ij},x_{ij})$ (the observations concurrent with $y_{ij}$) but also all preceeding observations of the $i$-th subject upto time $t_{ij}$. This is accomplished using a modification of the standard tree growing algorithm. The modification is to allow nodes to be split based on all previous observations of the corresponding subject observation in the node. This is done by creating (invertible) summaries of the observation history. In particular, for the $k$-th predictor the summary is $$n_{ij}(\delta,\tau,k)=\sum_{h: t_{ij}-\delta\leq t_{ih}<t_{ij}} I(x_{ihk}< \tau)$$. The knowledge of $n_{ij}(\delta,\tau,k)$ for all $\delta$ and $\tau$ is equivalent to knowing all values of predictor $k$ upto time $t_{ij}$.

A node in a regression tree can be split using the $n_{ij}()$ summaries. In particular, a potential historical split can be detemined as $$\mbox{argmin}\sum_{(ij)\in Node} (y_{ij}-\mu_L I(n_{ij}(\delta,\tau,k)<c)-\mu_R I(n_{ij}(\delta,\tau,k)\geq c))^2$$

where the minimization is across $\delta, \tau, \mu, c, k$. For a fixed value of $delta$, the minimization is solved using a single pass through the vector of historic values of predictor $k$. The hrf function samples nsamp values of delta. The best split among all possible historical splits as well as concurrent splits (ie those determined by splitting on measurements made concurrently with the observations of the node) is chosen as the partition. Proceeding recursively, the historical regression tree is built. The random forest model is fit using historical regression trees rather than standard regression trees.

Setting se=TRUE performs standard error estimation. The number of bootstrap samples (sampling subjects with replacement) is determined by B. For each bootstrap sample a random forest with R trees is built, which defaults to R=10. The bias induced by using smaller bootstrap ensemble sizes is corrected for as described in Sexton and Laake (2009). Using se=TRUE will influence summaries from the fitted model, such as providing approximate confidence intervals for partial dependence plots (when running partdep_hrf), and give standard errors for predictions (if desired) when predict_hrf is used.

References

L. Breiman (2001). “Random Forests,” Machine Learning 45(1):5-32.

Sexton and Laake (2009) “Standard errors for bagged and random forest estimators,” Computational Statistics and Data Analysis.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
data(mscm) 
mscm=na.omit(mscm)


ff=hrf(x=mscm[,-1],id=mscm$id,time=mscm$day,yindx=3,ntrees=100,nsamp=5,vh=c(2,3),vc=c(1,4:14))
plot(1:length(ff$error),ff$error,type="l",col="blue")


# -- random intercept example --- #

p=5;sigma_e=.5;sigma_a=.5;v=rep(1,p);n=500;pnoise=2
random_intercept=rnorm(n,sd=sigma_a^.5)
random_intercept=as.numeric(matrix(random_intercept,nrow=p,ncol=n,byrow=TRUE))
x=random_intercept+rnorm(n*p,sd=sigma_e^.5)
id=sort(rep(1:n,p))
time<-rep(1:p,n)
znoise=matrix(rnorm(n*p*pnoise),ncol=pnoise)
xx=cbind(time,x,znoise)

# fit historical random forest
ff=hrf(x=xx,time=time,id=id,yindx=2,ntrees=100,mtry=4,nsamp=5)

# plot oob-error 
plot(1:ff$ntrees,ff$error,type="l",col="blue")


# fit forest, also do noisy-bootstrap for standard error estimates
ff=hrf(x=xx,time=time,id=id,yindx=2,ntrees=100,mtry=4,nsamp=5,se=TRUE)

# plot partial dependence of response on its past 
pd=partdep_hrf(ff,xindx=2,ngrid=25,subsample=.1)

# variable importance z-scores (barplot)
barplot(varimp_hrf(ff,nperm=20),main="Importance z-scores")

# }

Run the code above in your browser using DataLab