tsvs2: Variable selection with NFT BART models.

Description

The tsvs2()/tsvs() function is for Thompson sampling variable selection with NFT BART.

Usage

tsvs2(
               ## data
               xftrain, xstrain, times, delta=NULL, 
               rm.const=TRUE, rm.dupe=TRUE,
               ##tsvs args
               K=20, a.=1, b.=0.5, C=0.5,
               rds.file='tsvs2.rds', pdf.file='tsvs2.pdf',
               ## multi-threading
               tc=getOption("mc.cores", 1), ##OpenMP thread count
               ##MCMC
               nskip=1000, ndpost=2000, 
               nadapt=1000, adaptevery=100, 
               chvf=NULL, chvs=NULL,
               method="spearman", use="pairwise.complete.obs",
               pbd=c(0.7, 0.7), pb=c(0.5, 0.5),
               stepwpert=c(0.1, 0.1), probchv=c(0.1, 0.1),
               minnumbot=c(5, 5),
               ## BART and HBART prior parameters
               ntree=c(10, 2), numcut=100,
               xifcuts=NULL, xiscuts=NULL,
               power=c(2, 2), base=c(0.95, 0.95),
               ## f function
               fmu=NA, k=5, tau=NA, dist='weibull', 
               ## s function
               total.lambda=NA, total.nu=10, mask=0.95,
               ## survival analysis 
               ##K=100, events=NULL, 
               ## DPM LIO
               drawDPM=1L, 
               alpha=1, alpha.a=1, alpha.b=0.1, alpha.draw=1,
               neal.m=2, constrain=1, 
               m0=0, k0.a=1.5, k0.b=7.5, k0=1, k0.draw=1,
               a0=3, b0.a=2, b0.b=1, b0=1, b0.draw=1,
               ## misc
               na.rm=FALSE, probs=c(0.025, 0.975), printevery=100,
               transposed=FALSE
)
tsvs(
               ## data
               x.train, times, delta=NULL, 
               rm.const=TRUE, rm.dupe=TRUE,
               ##tsvs args
               K=20, a.=1, b.=0.5, C=0.5,
               rds.file='tsvs.rds', pdf.file='tsvs.pdf',
               ## multi-threading
               tc=getOption("mc.cores", 1), ##OpenMP thread count
               ##MCMC
               nskip=1000, ndpost=2000, 
               nadapt=1000, adaptevery=100, 
               chv=NULL,
               method="spearman", use="pairwise.complete.obs",
               pbd=c(0.7, 0.7), pb=c(0.5, 0.5),
               stepwpert=c(0.1, 0.1), probchv=c(0.1, 0.1),
               minnumbot=c(5, 5),
               ## BART and HBART prior parameters
               ntree=c(10, 2), numcut=100, xicuts=NULL,
               power=c(2, 2), base=c(0.95, 0.95),
               ## f function
               fmu=NA, k=5, tau=NA, dist='weibull', 
               ## s function
               total.lambda=NA, total.nu=10, mask=0.95,
               ## survival analysis 
               ##K=100, events=NULL, 
               ## DPM LIO
               drawDPM=1L, 
               alpha=1, alpha.a=1, alpha.b=0.1, alpha.draw=1,
               neal.m=2, constrain=1, 
               m0=0, k0.a=1.5, k0.b=7.5, k0=1, k0.draw=1,
               a0=3, b0.a=2, b0.b=1, b0=1, b0.draw=1,
               ## misc
               na.rm=FALSE, probs=c(0.025, 0.975), printevery=100,
               transposed=FALSE
)

Arguments

xftrain: n x pf matrix of predictor variables for the training data.
xstrain: n x ps matrix of predictor variables for the training data.
x.train: n x ps matrix of predictor variables for the training data.
times: nx1 vector of the observed times for the training data.
delta: nx1 vector of the time type for the training data: 0, for right-censoring; 1, for an event; and, 2, for left-censoring.
rm.const: To remove constant variables or not.
rm.dupe: To remove duplicate variables or not.
K: The number of Thompson sampling steps to take. Not to be confused with the size of the time grid for survival distribution estimation.
a.: The prior parameter for successes of a Beta distribution.
b.: The prior parameter for failures of a Beta distribution.
C: The probability cut-off for variable selection.
rds.file: File name to store RDS object containing Thompson sampling parameters.
pdf.file: File name to store PDF graphic of variables selected.

tc: Number of OpenMP threads to use.
nskip: Number of MCMC iterations to burn-in and discard.
ndpost: Number of MCMC iterations kept after burn-in.
nadapt: Number of MCMC iterations for adaptation prior to burn-in.
adaptevery: Adapt MCMC proposal distributions every adaptevery iteration.
chvf,chvs,chv: Predictor correlation matrix used as a pre-conditioner for MCMC change-of-variable proposals.
method,use: Correlation options for change-of-variable proposal pre-conditioner.
pbd: Probability of performing a birth/death proposal, otherwise perform a rotate proposal.
pb: Probability of performing a birth proposal given that we choose to perform a birth/death proposal.
stepwpert: Initial width of proposal distribution for peturbing cut-points.
probchv: Probability of performing a change-of-variable proposal. Otherwise, only do a perturb proposal.
minnumbot: Minimum number of observations required in leaf (terminal) nodes.
ntree: Vector of length two for the number of trees used for the mean model and the number of trees used for the variance model.
numcut: Number of cutpoints to use for each predictor variable.
xifcuts,xiscuts,xicuts: More detailed construction of cut-points can be specified by the xicuts function and provided here.
power: Power parameter in the tree depth penalizing prior.
base: Base parameter in the tree depth penalizing prior.
fmu: Prior parameter for the center of the mean model.
k: Prior parameter for the mean model.
tau: Desired SD/ntree for f function leaf prior if known.
dist: Distribution to be passed to intercept-only AFT model to center y.train.
total.lambda: A rudimentary estimate of the process standard deviation. Used in calibrating the variance prior.
total.nu: Shape parameter for the variance prior.
mask: If a proportion is provided, then said quantile of max.i sd(x.i) is used to mask non-stationary departures (with respect to convergence) above this threshold.

drawDPM: Whether to utilize DPM or not.
alpha: Initial value of DPM concentration parameter.
alpha.a: Gamma prior parameter setting for DPM concentration parameter where E[alpha]=alpha.a/alpha.b.
alpha.b: See alpha.a above.
alpha.draw: Whether to draw alpha or it is fixed at the initial value.
neal.m: The number of additional atoms for Neal 2000 DPM algorithm 8.
constrain: Whether to perform constained DPM or unconstrained.
m0: Center of the error distribution: defaults to zero.
k0.a: First Gamma prior argument for k0.
k0.b: Second Gamma prior argument for k0.
k0: Initial value of k0.
k0.draw: Whether to fix k0 or draw it if from the DPM LIO prior hierarchy: k0~Gamma(k0.a, k0.b), i.e., E[k0]=k0.a/k0.b.
a0: First Gamma prior argument for \(tau\).
b0.a: First Gamma prior argument for b0.
b0.b: Second Gamma prior argument for b0.
b0: Initial value of b0.
b0.draw: Whether to fix b0 or draw it from the DPM LIO prior hierarchy: b0~Gamma(b0.a, b0.b), i.e., E[b0]=b0.a/b0.b.
na.rm: Value to be passed to the predict function.
probs: Value to be passed to the predict function.
printevery: Outputs MCMC algorithm status every printevery iterations.
transposed: tsvs handles all of the pre-processing for x.train/x.test (including tranposing) computational efficiency.

Author

Rodney Sparapani: rsparapa@mcw.edu

Details

tsvs2()/tsvs() is the function to perform variable selection. The tsvs2()/tsvs() function returns a fit object of S3 class type list as well as storing it in rds.file for sampling in progress.

References

Sparapani R., Logan B., Maiers M., Laud P., McCulloch R. (2023) Nonparametric Failure Time: Time-to-event Machine Learning with Heteroskedastic Bayesian Additive Regression Trees and Low Information Omnibus Dirichlet Process Mixtures Biometrics (ahead of print) <doi:10.1111/biom.13857>.

Liu Y., Rockova V. (2021) Variable selection via Thompson sampling. Journal of the American Statistical Association. Jun 29:1-8.

Examples

Run this code


##library(nftbart)
data(lung)
N=length(lung$status)

##lung$status: 1=censored, 2=dead
##delta: 0=censored, 1=dead
delta=lung$status-1

## this study reports time in days rather than weeks or months
times=lung$time
times=times/7  ## weeks

## matrix of covariates
x.train=cbind(lung[ , -(1:3)])
## lung$sex:        Male=1 Female=2

# \donttest{
##vars=tsvs2(x.train, x.train, times, delta)
vars=tsvs2(x.train, x.train, times, delta, K=0) ## K=0 just returns 0
# }

Run the code above in your browser using DataLab