VCBART_ind: Fit a VCBART model with independent error structure

Description

Fit a varying coefficient model to panel data. Assumes residual errors are independent within and between subjects. See Deshpande et al. (2024) for details about the model and MCMC sampler.

Usage

VCBART_ind(Y_train,subj_id_train, ni_train,X_train,
           Z_cont_train = matrix(0, nrow = 1, ncol = 1),
           Z_cat_train = matrix(0L, nrow = 1, ncol = 1),
           X_test = matrix(0, nrow = 1, ncol = 1),
           Z_cont_test = matrix(0, nrow = 1, ncol = 1),
           Z_cat_test = matrix(0, nrow = 1, ncol = 1),
           unif_cuts = rep(TRUE, times = ncol(Z_cont_train)),
           cutpoints_list = NULL,
           cat_levels_list = NULL,
           edge_mat_list = NULL,
           graph_split = rep(FALSE, times = ncol(Z_cat_train)),
           sparse = TRUE,
           M = 50,
           mu0 = NULL, tau = NULL, nu = NULL, lambda = NULL,
           nd = 1000, burn = 1000, thin = 1,
           save_samples = TRUE, save_trees = TRUE,
           verbose = TRUE, print_every = floor( (nd*thin + burn)/10))

Value

A list containing

y_mean: Mean of the training observations (needed by predict_VCBART)
y_sd: Standard deviation of the training observations (needed by predict_VCBART)
x_mean: Vector of means of columns of X_train, including the intercept (needed by predict_VCBART).
x_sd: Vector of standard deviations of X_trian, including the intercept (needed by predict_VCBART).
yhat.train.mean: Vector containing posterior mean of evaluations of regression function E[y|x,z] on training data.
betahat.train.mean: Matrix with length(Y_train) rows and ncol(X_train)+1 columns containing the posterior mean of evaluations of each coefficient function evaluated on the training data. Each row corresponds to a training set observation and each colunn corresponds to a coefficient function. Note the first column is for the intercept function.
yhat.train: Matrix with nd rows and length(Y_train) columns. Each row corresponds to a posterior sample of the regression function E[y|x,z] and each column corresponds to a training set observation. Only returned if save_samples == TRUE.
betahat.train: Array of dimension with nd x length(Y_train) x ncol(X_train)+1 containing posterior samples of evaluations of the coefficient functions. The first dimension corresponds to posterior samples/MCMC iterations, the second dimension corresponds to individual training set observations, and the third dimension corresponds to coefficient functions. Only returned if save_samples == TRUE.
yhat.test.mean: Vector containing posterior mean of evaluations of regression function E[y|x,z] on testing data.
betahat.test.mean: Matrix with nrow(X_test) rows and ncol(X_testn)+1 columns containing the posterior mean of evaluations of each coefficient function evaluated on the training data. Each row corresponds to a training set observation and each colunn corresponds to a coefficient function. Note the first column is for the intercept function.
yhat.test: Matrix with nd rows and nrow(X_test) columns. Each row corresponds to a posterior sample of the regression function E[y|x,z] and each column corresponds to a testing set observation. Only returned if save_samples == TRUE.
betahat.test: Array of size nd x nrow(X_test) x ncol(X_test)+1 containing posterior samples of evaluations of the coefficient functions. The first dimension corresponds to posterior samples/MCMC iterations, the second dimension corresponds to individual training set observations, and the third dimension corresponds to coefficient functions. Only returned if save_samples == TRUE.
sigma: Vector containing ALL samples of the residual standard deviation, including warmup.
varcounts: Array of size nd x R x ncol(X)+1 that counts the number of times a variable was used in a decision rule in each posterior sample of each ensemble. Here R is the total number of potential modifiers (i.e. R = ncol(Z_cont_train) + ncol(Z_cat_train)).
theta: If sparse=TRUE, an array of size nd x R ncol(X)+1 containing samples of the variable splitting probabilities.
trees: A list (of length nd) of lists (of length ncol(X_train)+1) of character vectors (of length M) containing textual representations of the regression trees. The string for the s-th sample of the m-th tree in the j-th ensemble is contaiend in trees[[s]][[j]][m]. These strings are parsed by predict_VCBART to reconstruct the C++ representations of the sampled trees.

Arguments

Y_train: Vector of continous responses for training data
ni_train: Vector containing the number of observations per subject in the training data.
subj_id_train: Vector of length length(Y_train) that records which subject contributed each observation. Subjects should be numbered sequentially from 1 to length(ni_train).
X_train: Matrix of covariates for training observations. Do not include intercept as the first column.
Z_cont_train: Matrix of continuous modifiers for training data. Note, modifiers must be rescaled to lie in the interval [-1,1]. Default is a 1x1 matrix, which signals that there are no continuous modifiers in the training data.
Z_cat_train: Integer matrix of categorical modifiers for training data. Note categorical levels should be 0-indexed. That is, if a categorical modifier has 10 levels, the values should run from 0 to 9. Default is a 1x1 matrix, which signals that there are no categorical modifiers in the training data.
X_test: Matrix of covariate for testing observations. Default is a 1x1 matrix, which signals that testing data is not provided.
Z_cont_test: Matrix of continuous modifiers for testing data. Default is a 1x1 matrix, which signals that testing data is not provided.
Z_cat_test: Integer matrix of categorical modifiers for testing data. Default is a 1x1 matrix, which signals that testing data is not provided.
unif_cuts: Vector of logical values indicating whether cutpoints for each continuous modifier should be drawn from a continuous uniform distribution (TRUE) or a discrete set (FALSE) specified in cutpoints_list. Default is TRUE for each variable in Z_cont_train
cutpoints_list: List of length ncol(Z_cont_train) containing a vector of cutpoints for each continuous modifier. By default, this is set to NULL so that cutpoints are drawn uniformly from a continuous distribution.
cat_levels_list: List of length ncol(Z_cat_train) containing a vector of levels for each categorical modifier. If the j-th categorical modifier contains L levels, cat_levels_list[[j]] should be the vector 0:(L-1). Default is NULL, which corresponds to the case that no categorical modifiers are available.
edge_mat_list: List of adjacency matrices if any of the categorical modifiers are network-structured. Default is NULL, which corresponds to the case that there are no network-structured categorical modifiers.
graph_split: Vector of logicals indicating whether each categorical modifier is network-structured. Default is rep(FALSE, times = ncol(Z_cat_train)).
sparse: Logical, indicating whether or not to perform variable selection in each tree ensemble based on a sparse Dirichlet prior rather than uniform prior; see Linero 2018. Default is TRUE
M: Number of trees in each ensemble. Default is 50.
mu0: Prior mean for jumps/leaf parameters. Default is 0 for each beta function. If supplied, must be a vector of length 1 + ncol(X_train).
tau: Prior standard deviation for jumps/leaf parameters. Default is 1/sqrt(M) for each beta function. If supplied, must be a vector of length 1 + ncol(X_train).
nu: Degrees of freedom for scaled-inverse chi-square prior on sigma^2. Default is 3.
lambda: Scale hyperparameter for scaled-inverse chi-square prior on sigma^2. Default places 90% prior probability that sigma is less than sd(Y_train).
nd: Number of posterior draws to return. Default is 1000.
burn: Number of MCMC iterations to be treated as "warmup" or "burn-in". Default is 1000.
thin: Number of post-warmup MCMC iteration by which to thin. Default is 1.
save_samples: Logical, indicating whether to return all posterior samples. Default is TRUE. If FALSE, only posterior mean is returned.
save_trees: Logical, indicating whether or not to save a text-based representation of the tree samples. This representation can be passed to predict_flexBART to make predictions at a later time. Default is FALSE.
verbose: Logical, inciating whether to print progress to R console. Default is TRUE.
print_every: As the MCMC runs, a message is printed every print_every iterations. Default is floor( (nd*thin + burn)/10) so that only 10 messages are printed.

Details

Given \(p\) covariates \(X_{1}, \ldots, X_{p}\) and \(r\) effect modifiers \(Z_{1}, \ldots, Z_{r}\), the varying coefficient model asserts that

\(E[Y \vert X = x, Z = ] = \beta_0(z) + \beta_1(z) * x_1 + ... \beta_p(z) * X_p.\)

That is, for any r-vector \(Z\), the relationships between \(X\) and \(Y\) is linear. However, the specific relationship is allowed to vary with respect tp \(Z\). VCBART approximates the covariate effect functions \(\beta_0(Z), \ldots, \beta_p(Z)\) using ensembles of regression trees. This function assumes that the within-subject errors are independent.

References

Deshpande, S.K, Bai, R., Balocchi, C., Starling, J., and Weiss, J. (2024). VCBART: Bayesian trees for varying coefficients. Bayesian Analysis. tools:::Rd_expr_doi("doi:10.1214/24-BA1470")

Examples

Run this code


############
# True beta functions
beta0_true <- function(Z){
  tmp_Z <- (Z+1)/2
  return( 3 * tmp_Z[,1] + 
  (2 - 5 * (tmp_Z[,2] > 0.5)) * sin(pi * tmp_Z[,1]) - 
  2 * (tmp_Z[,2] > 0.5))
}
beta1_true <- function(Z){
  tmp_Z <- (Z+1)/2
  return(sin(2*tmp_Z[,1] + 0.5)/(4*tmp_Z[,1] + 1) + (2*tmp_Z[,1] - 0.5)^3)
}
beta2_true <- function(Z){
  tmp_Z <- (Z+1)/2
  return( (3 - 3*cos(6*pi*tmp_Z[,1]) * tmp_Z[,1]^2) * (tmp_Z[,1] > 0.6) - 
  (10 * sqrt(tmp_Z[,1])) * (tmp_Z[,1] < 0.25) )
}


################
# Set problem dimensions
###############

set.seed(417)
n_all <- 500
ni_all <- rep(4, times = n_all) # 4 observations per subject
subj_id_all <- rep(1:n_all, each = 4) # give every subject an id number
N_all <- sum(ni_all) # total number of observations

p <- 2 # number of covariates
R_cont <- 20 # number of continuous modifiers
R_cat <- 0 # number of categorical modifiers
R <- R_cont + R_cat
################
# Generate covariates & modifiers
################

X_all <- 
  matrix(rnorm(N_all*p, mean = 0, sd = 1), nrow = N_all, ncol = p)
Z_cont_all <- 
  matrix(runif(N_all * R_cont, min = -1, max = 1), nrow = N_all, ncol = R_cont)

################
# Define true coefficient functions & noise level
###############
beta0_all <- beta0_true(Z_cont_all)
beta1_all <- beta1_true(Z_cont_all)
beta2_all <- beta2_true(Z_cont_all)
beta_all <- cbind(beta0_all, beta1_all, beta2_all)
sigma <- 0.1

################
# Generate response surface & outcomes
###############
mu_all <- beta0_all + X_all[,1] * beta1_all + X_all[,2] * beta2_all
Y_all <- mu_all + sigma * rnorm(n = N_all, mean = 0, sd = 1)


## Token run to ensure installation works

fit <- 
  VCBART_ind(Y_train = Y_all,
             subj_id_train = subj_id_all,
             ni_train = ni_all,
             X_train = X_all,
             Z_cont_train = Z_cont_all,
             nd = 5, burn = 5,
             verbose = FALSE)
             
# \donttest{
## Longer example
  fit <- 
    VCBART_ind(Y_train = Y_all,
               subj_id_train = subj_id_all,
               ni_train = ni_all,
               X_train = X_all,
               Z_cont_train = Z_cont_all,
               verbose = FALSE)

oldpar <- par(no.readonly = TRUE)
par(mar = c(3,3,2,1), mgp = c(1.8, 0.5, 0), mfrow = c(1,2))
plot(beta_all, fit$betahat.train.mean, 
     pch = 16, cex = 0.5,
     xlab = "Actual", ylab = "Posterior Mean",
     main = "Coefficients")
abline(a = 0, b = 1, col = 'blue')
plot(mu_all, fit$yhat.train.mean,
     pch = 16, cex = 0.5,
     xlab = "Actual", ylab = "Posterior Mean",
     main = "Regression Function E[Y|X,Z]")
abline(a = 0, b = 1, col = 'blue')

par(oldpar)
# }

Run the code above in your browser using DataLab