threeboost: Thresholded EEBoost

Description

Run the thresholded EEBoost procedure.

Usage

threeboost(Y, X, EE.fn, b.init = rep(0, ncol(X)), eps = 0.01, maxit = 1000, itertrack = FALSE, reportinterval = 1, stop.rule = "on.repeat", thresh = 1)

Arguments

Vector of outcomes.

Matrix of predictors. Will be automatically scaled using the scale function.

EE.fn

Estimating function taking arguments Y, X, and parameter vector b.

b.init

Initial parameter values. For variable selection, typically start with a vector of zeroes (the default).

eps

Step length. Default is 0.01, value should be relatively small.

maxit

Maximum number of iterations. Default is 1000.

itertrack

Indicates whether or not diagnostic information should be printed out at each iteration. Default is FALSE.

reportinterval

If itertrack is TRUE, how many iterations the algorithm should wait between each diagnostic report.

stop.rule

Rule for stopping the iterations before maxit is reached. Possible values are "on.repeat" and "pct.change". See 'Details' for more information.

thresh

Threshold parameter for ThrEEBoost.

Value

A matrix with maxit rows and ncol(X) columns, with each row containing the parameter vector from an iteration of ThrEEBoost.

Details

threeboost Implements a thresholded version of the EEBoost algorithm described in Wolfson (2011, JASA). EEBoost is a general-purpose method for variable selection which can be applied whenever inference would be based on an estimating equation. The package currently implements variable selection based on the Generalized Estimating Equations, but can also accommodate user-provided estimating functions. Thresholded EEBoost is a generalization which allows multiple variables to enter the model at each boosting step. Thresholded EEBoost with thresholding parameter = 1 is equivalent to EEBoost.

Typically, the boosting procedure is run for maxit iterations, producing maxit models defined by a set of regression coefficients. An additional step (e.g. model scoring, cross-validated estimate of prediction error) is needed to select a final model. However, an alternative is to stop the iterations before maxit is reached. The user can request this feature by setting stop.rule to one of the following options:

"on.repeat": Sometimes, ThrEEBoost will alternate between stepping on the same two directions, usually indicating numerical problems. Setting stop.rule="on.oscillate" will terminate the algorithm if this happens.
"pct.change": Stop if, for conseuctive iterations, the sum of the magnitudes of the elements of the estimating equation changes by < 1%.

Examples

Run this code

library(Matrix)

# Generate some test data - uses 'mvtnorm' package
n <- 30
n.var <- 50
clust.size <- 4
B <- c(rep(2,5),rep(0.2,5),rep(0.05,10),rep(0,n.var-20))
mn.X <- rep(0,n.var)
sd.X <- 0.5
rho.X <- 0.3
cov.sig.X <- sd.X^2*((1-rho.X)*diag(rep(1,10)) + rho.X*matrix(data=1,nrow=10,ncol=10))
sig.X <- as.matrix( Matrix::bdiag(lapply(1:(n.var/10),function(x) { cov.sig.X } ) ) )
sd.Y <- 0.5
rho.Y <- 0.3
indiv.Sig <- sd.Y^2*( (1-rho.Y)*diag(rep(1,4)) + rho.Y*matrix(data=1,nrow=4,ncol=4) )
sig.list <- list(length=n)
for(i in 1:n) { sig.list[[i]] <- indiv.Sig }
Sig <- Matrix::bdiag(sig.list)
indiv.index <- rep(1:n,each=clust.size)
sig.Y <- as.matrix(Sig)
if(require(mvtnorm)) {
X <- mvtnorm::rmvnorm(n*clust.size,mean=mn.X,sigma=sig.X)
mn.Y <- X %*% B
## Correlated continuous outcome
Y <- mvtnorm::rmvnorm(1,mean=mn.Y,sigma=sig.Y)
} else { stop('Need mvtnorm package to generate correlated example data.') }

## Define the Gaussian GEE estimating function with independence working correlation
mu.Lin <- function(eta){eta}
g.Lin <- function(m){m}
v.Lin <- function(eta){rep(1,length(eta))}

 EE.fn.ind <- function(Y,X,b) {
 ee.GEE(Y,X,b,
 mu.Y=mu.Lin,
 g.Y=g.Lin,
 v.Y=v.Lin,
 aux=function(...) { ee.GEE.aux(...,mu.Y=mu.Lin,g.Y=g.Lin,v.Y=v.Lin) },
 id=indiv.index,
 corstr="ind")
}

## These two give the same result
coef.mat <- eeboost(Y,X,EE.fn.ind,maxit=250)
coef.mat2 <- geeboost(Y,X,id=indiv.index,family="gaussian",corstr="ind",maxit=250)$coefmat

par(mfrow=c(1,2))
coef_traceplot(coef.mat)
coef_traceplot(coef.mat2)