varimp_hrf: Variable importance

Description

Z-score variable importance for hrf and htb

Usage

varimp_hrf(object,nperm=20,parallel=TRUE)
varimp_htb(object,nperm=20)

Arguments

object

Return list from hrf or htb

nperm

Number of permutations.

parallel

If TRUE, run in parallel.

Value

A data.frame with columns: Predictor giving predictor being marginalized; Marginalized error gives the prediction error of model with Predictor marginalized out; Model error the prediction error with original model; Relative change gives relative change in prediction error due to marginalization; Z-value: Z value from test comparing prediction errors of original and marginalized models.

Details

To measure the importance of a predictor, varimp_hrf and varimp_htb compare the prediction errors of the estimated model with the prediction errors obtained after integrating the predictor out of the model. If \(F\) denotes the estimated model, the model obtained by integrating out predictor k is \(F_k(x)=\int F(x) dP(x_k)\), where \(P(x_k)\) is the marginal distribution of \(x_k\). In practice, the integration is done by averaging over multiple predictions from \(F\), each obtained using a random permutation of the observed values of \(x_k\). The number of permutations is set by nperm. Letting \(L(y,y_{hat})\)) be the loss of predicting \(y\) with \(y_{hat}\), the vector \(w_i=L(y_i,F_k(x_i))-L(y_i,F(x_i))\) for \(i=1,..,n\) gives the difference in the prediction error between the original and marginalized model. The corresponding z-score \(z=mean(w_i)/se(w_i)\) corresponds a paired test for the equality of the prediction errors, in which case it is approximately distributed as N(0,1). Larger z-score values indicate that the prediction error increases if \(x_k\) is marginalized out, and thus that \(x_k\) is useful. On the other hand, large negative values of the z-score indicate that the integrated model is more accurate. For longitudinal data, the w_i are computed by averaging across all observations from the i-th subject. For htb the prediction error is calculated based on the cross-validation model estimates, for hrf out-of-bag predictions are used.

References

L. Breiman (2001). “Random Forests,” Machine Learning 45(1):5-32.

Examples

Run this code

# NOT RUN {
data(mscm) 
mscm=as.data.frame(na.omit(mscm))


# -- set concurrent and historical predictors 
historical_predictors=match(c("stress","illness"),names(mscm))
concurrent_predictors=which(names(mscm)!="stress")
control=list(vh=historical_predictors,vc=concurrent_predictors,nodesize=20)

## -- fit model
ff=hrf(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)

# -- variable importance table
vi=varimp_hrf(ff)
vi


## same with htb

control=list(vh=historical_predictors,vc=concurrent_predictors,
	lambda=.1,ntrees=200,nsplit=3,family="bernoulli")
control$cvfold=10 ## need cross-validation runs to run varimp_htb
ff=htb(x=mscm,id=mscm$id,time=mscm$day,yindx="illness",control=control)

# -- variable importance table
vi=varimp_htb(ff)
vi




# --------------------------------------------------------------------------------------------- ##
# Boston Housing data 
#	Comparison of Z-score variable importance with coefficient Z-scores from linear model
# --------------------------------------------------------------------------------------------- ##

# Boston Housing data 
library(mlbench)
data(BostonHousing)
dat=as.data.frame(na.omit(BostonHousing))
dat$chas=as.numeric(dat$chas)

# -- random forest 
h=hrf(x=dat,yindx="medv")


# -- tree boosting
hb=htb(x=dat,yindx="medv",ntrees=1000,cv.fold=10,nsplit=3)


# -- Comparison of variable importance Z-scores and Z-scores from linear model 
vi=varimp_hrf(h)
vb=varimp_htb(hb)
dvi=data.frame(var=as.character(vi$Predictor),Z_hrf=vi$Z)
dvb=data.frame(var=as.character(vb$Predictor),Z_htb=vb$Z)

dlm=summary(lm(medv~.,dat))$coeffi
dlm=data.frame(var=rownames(dlm),Z_lm=round(abs(dlm[,3]),3))
dlm=merge(dlm[-1,],dvi,by="var",all.x=TRUE)

# -- Z-scores of hrf and lm for predictor variables 
merge(dlm,dvb,by="var",all.x=TRUE)



# }
# NOT RUN {
# }

Run the code above in your browser using DataLab