
Fits a boosted ensemble of historical regression trees to longitudinal data.
htb(x,
time=NULL,
id=NULL,
yindx,
ntrees = 100,
method="freq",
nsplit=1,
lambda=.05,
family="gaussian",
cv.fold=0,
cv.rep=NULL,
nsamp=5,
historical=TRUE,
vh=NULL,
vc=NULL,
delta=NULL,
control=list())
A data frame containing response and predictors
A vector of length nrow(x)
of observation times
A vector of subject identifiers (length equal to nrow(x)
), if NULL
observations are assumed independent
Name of response variable, alt. its column number in x
Number of trees in ensemble
Historical summary method, can be freq
, frac
, mean0
, freqw
, fracw
and mean0w
Number of splits in each regression tree.
Shrinkage parameter applied to each tree.
Either "gaussian" (default) or "bernoulli".
Number of cross-validation folds, if cv.fold<=1
no cross-validation is run.
Number of times to repeat the cross-validation. If not given set to cv.fold
.
If TRUE
then historical splitting is done, else standard splitting.
Number of sampled delta
values, see below
Optional vector of indexes giving the historical predictors.
Optional vector of indexes giving strictly concurrent effects.
A vector of history lags to be used (see below), defaults to NULL
in which case all possible observed lags are available for splitting.
A list of control parameters (see below). All arguments, except those describing the data, can be set in control
. Arguments in control
are used if both are given.
Returns a list
whose more important elements are: trees
giving the tree ensemble, cv
(if cv.fold>1
) the cross-validation model estimates, cv_error
cross-validated error (mse when family="gaussian"
and negative log-likelihood when family="bernoulli"
, control
a list controlling the fit, x,id,time
give the training data.
The htb
function fits a boosted tree ensemble to longitudinal data. Data are assumed to be of form:
htb
estimates a model for the response historical
regression (alt. classification) trees. Here a predictor is one
of two types, either concurrent
or historical
. The concurrent predictors for concurrent
predictor follows the approach in standard regression (classification) trees. For historical
predictors
the splitting is modified since, associated with each observed response historical
predictor will vary according to
summary function
. This summary is
invertible, in the sense that knowledge of it is equivalent to knowledge of the covariate history. Letting historical
regression tree is split using the best split
among all splits of concurrent and historical predictors.
Different summary functions
are available in htb
, specified by the argument method
. Setting method="freq"
corresponds the summary
method="frac"
:
method="mean0"
:
method="freqw"
:
method="fracw"
:
method="meanw0"
:
method="freqw"
. The possible values of delta
. If not supplied, the set of possible values of
nsamp
argument. See below on control
for futher arguments governing the
historical splitting.
When cv.fold>1
then cross-validation is performed. In subsequent summaries of the model, say the partial dependence plots from partdep_htb
,
these cross-validation model fits are used to estimate the standard error. This is done using the delete-d jackknife estimator (where deletion refers to
subjects, instead of rows of the training data). Each cross-validation model fit is performed by removing a random sample of 1/cv.fold
of the subjects. The number of cross-validation runs is determined by cv.rep
which defaults to the value of cv.fold
.
All arguments (except those decribing the data x
, yindx
,time
and id
) can be set in the control
list. The arguments supplied
in control
are used if both are supplied, so if ntrees=300
and control=list(ntrees=500)
then 500
trees are fit. Besides the arguments
described above, a number of other parameters can be set in control. These are: nodesize
giving the minimum number of training observations in a terminal node;
sample_fraction
giving the fraction of data sample to train each tree; dtry
the number of sampled delta
values used when splitting a variable
based on a covariate history (note this is alternatively controlled by the argument nsamp
above ;
ndelta
the number of delta
values to use if delta
is not supplied, these are taken as the quantiles from the
distribution of observed delta
values in the data; qtry
the number of sampled values of method=freq/frac
, method=freqw/fracw
; quantiles
is a vector of probabilities, and is used when method=frac
or method=fracw
, ie when covariate histories are split on their windowed empirical distributions. Splitting is restricted to these quantiles. The fold-size for cross-validation can be set by control$cvfold
, and the number
of repetitions by control$nfold
.
J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.
J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.
# NOT RUN {
# }
# NOT RUN {
# ----------------------------------------------------------------------------------------- ##
# Mother's stress on child illness:
# Investigate whether mother's stress is (Granger) causal of child illness
# 'htb' model is fit using previous observations of mother's stress to predict
# child's illness at given time point, but not mother's stress at that time point
#
# Predictor variables are classified into "historical" and "concurrent"
#
# A predictor is "historical" if its prior realizations can be used to predict
# the outcome.
#
# A predictor is "concurrent" if its realization at the same timepoint as the outcome
# can be used to predict the outcome at that timepoint
#
# A predictor can be both "concurrent" and "historical", the status of the predictors
# can be set by the 'vh' and 'vc' arguments of 'hrf'.
# (if not set these are automatically determined)
#
# ------------------------------------------------------------------------------------------- ##
data(mscm)
mscm=as.data.frame(na.omit(mscm))
# -- set concurrent and historical predictors
historical_predictors=match(c("illness","stress"),names(mscm))
concurrent_predictors=which(names(mscm)!="stress")
control=list(vh=historical_predictors,vc=concurrent_predictors,
ntrees=200,family="bernoulli",cvfold=10,lambda=.1)
# logistic regression
ff=htb(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)
# cross-validated negative log-likelihood
plot(1:ff$ntrees,ff$cv_error,type="l",#col="blue",
xlab="iterations",ylab="cross-validation error")
# -- variable importance table
vi=varimp_htb(ff)
vi
# -- plot partial dependence (with +/-2 approximate standard errors)
par(mfrow=c(2,2))
for(k in 1:4)
partdep_htb(ff,xindx=as.character(vi$Predictor[k]))
par(mfrow=c(1,1))
# -- Standard errors are based on cross-validation runs (using delete-d
# (subjects) jackknife estimator, d="number-of-subjects"/cvfold)
# -- increasing nfold (which defaults to equal cvfold) gives less
# noisy standard error estimates ...
control$nfold=50
ff=htb(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)
par(mfrow=c(2,2))
for(k in 1:4)
partdep_htb(ff,xindx=as.character(vi$Predictor[k]))
par(mfrow=c(1,1))
# -------------------------------- #
# Data w/irregular observation times
# ------------------------------- #
data(cd4)
control=list(cvfold=10,lambda=.1,nsplit=3,ntrees=200)
ff=htb(x=cd4,id=cd4$id,time=cd4$time,yindx="count",control=control)
vi=varimp_htb(ff)
# -- partial dependence for top 4 predictors (with +/-2 SE estimates)
par(mfrow=c(2,2))
for(k in 1:4)
pd=partdep_htb(ff,xindx=as.character(vi$Predictor[k]))
par(mfrow=c(1,1))
# -- cv error
plot(1:ff$control$ntrees,ff$cv_error,xlab="boosting iteration",
ylab="cross-validated mse",type="l")
# estimated boosting iteration
abline(v=ff$nboost_cv,lty=2)
## by default, the number of delta values (parameter 'eta_1' above) is 20
## can set this using 'ndelta'
control$ndelta=50
ff=htb(x=cd4,id=cd4$id,time=cd4$time,yindx="count",control=control)
points(1:ff$control$ntrees,ff$cv_error,type="l",lty=2)
## note: differences in cv_error can be due (in part) to randomness
## in out-of-sample selection
# the grid of delta values
ff$control$delta
# ------------------------------------------ #
# Boston Housing data (not longitudinal)
# ------------------------------------------ #
# library(htree)
library(mlbench)
data(BostonHousing)
dat=as.data.frame(na.omit(BostonHousing))
# omitting arguments 'time' and 'id' assumes rows are iid
control=list(cvfold=10,ntrees=500)
h=htb(x=dat,yindx="medv",control=control)
# -- plot cross-validated Mean-squared error --- #
plot(1:h$ntrees,h$cv_error,type="l",xlab="Boosting iterations",
ylab="MSE",main="Cross-validated Mean-squared error")
# -- variable importance table
vi=varimp_htb(h,nperm=20)
vi
# -- partial dependence of top 4 predictors with +/-2 S.E.
par(mfrow=c(2,2))
for(k in 1:4)
partdep_htb(h,xindx=as.character(vi$Predictor[k]))
par(mfrow=c(1,1))
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab