VCBART_cs: Fit a VCBART model with compound symmetry error structure

Description

Fit a varying coefficient model to panel data. Assumes a compound symmetry error structure in which the residual errors for a given subject are equally correlated. This is equivalent to assuming that there is a normally distributed random effect per subject.

Usage

VCBART_cs(Y_train,subj_id_train, ni_train,X_train,
          Z_cont_train = matrix(0, nrow = 1, ncol = 1),
          Z_cat_train = matrix(0L, nrow = 1, ncol = 1),
          X_test = matrix(0, nrow = 1, ncol = 1),
          Z_cont_test = matrix(0, nrow = 1, ncol = 1),
          Z_cat_test = matrix(0, nrow = 1, ncol = 1),
          unif_cuts = rep(TRUE, times = ncol(Z_cont_train)),
          cutpoints_list = NULL,
          cat_levels_list = NULL,
          edge_mat_list = NULL,
          graph_split = rep(FALSE, times = ncol(Z_cat_train)),
          sparse = TRUE,
          rho = 0.9,
          M = 50,
          mu0 = NULL, tau = NULL, nu = NULL, lambda = NULL,
          nd = 1000, burn = 1000, thin = 1,
          save_samples = TRUE, save_trees = TRUE,
          verbose = TRUE, print_every = floor( (nd*thin + burn)/10))

Value

A list containing

y_mean: Mean of the training observations (needed by predict_VCBART)
y_sd: Standard deviation of the training observations (needed by predict_VCBART)
x_mean: Vector of means of columns of X_train, including the intercept (needed by predict_VCBART).
x_sd: Vector of standard deviations of X_trian, including the intercept (needed by predict_VCBART).
yhat.train.mean: Vector containing posterior mean of evaluations of regression function E[y|x,z] on training data.
betahat.train.mean: Matrix with length(Y_train) rows and ncol(X_train)+1 columns containing the posterior mean of evaluations of each coefficient function evaluated on the training data. Each row corresponds to a training set observation and each colunn corresponds to a coefficient function. Note the first column is for the intercept function.
yhat.train: Matrix with nd rows and length(Y_train) columns. Each row corresponds to a posterior sample of the regression function E[y|x,z] and each column corresponds to a training set observation. Only returned if save_samples == TRUE.
betahat.train: Array of dimension with nd x length(Y_train) x ncol(X_train)+1 containing posterior samples of evaluations of the coefficient functions. The first dimension corresponds to posterior samples/MCMC iterations, the second dimension corresponds to individual training set observations, and the third dimension corresponds to coefficient functions. Only returned if save_samples == TRUE.
yhat.test.mean: Vector containing posterior mean of evaluations of regression function E[y|x,z] on testing data.
betahat.test.mean: Matrix with nrow(X_test) rows and ncol(X_testn)+1 columns containing the posterior mean of evaluations of each coefficient function evaluated on the training data. Each row corresponds to a training set observation and each colunn corresponds to a coefficient function. Note the first column is for the intercept function.
yhat.test: Matrix with nd rows and nrow(X_test) columns. Each row corresponds to a posterior sample of the regression function E[y|x,z] and each column corresponds to a testing set observation. Only returned if save_samples == TRUE.
betahat.test: Array of size nd x nrow(X_test) x ncol(X_test)+1 containing posterior samples of evaluations of the coefficient functions. The first dimension corresponds to posterior samples/MCMC iterations, the second dimension corresponds to individual training set observations, and the third dimension corresponds to coefficient functions. Only returned if save_samples == TRUE.
sigma: Vector containing ALL samples of the residual standard deviation, including warmup.
rho: Vector containing ALL samples of the auto-correlation parameter rho, including warmup.
varcounts: Array of size nd x R x ncol(X)+1 that counts the number of times a variable was used in a decision rule in each posterior sample of each ensemble. Here R is the total number of potential modifiers (i.e. R = ncol(Z_cont_train) + ncol(Z_cat_train)).
theta: If sparse=TRUE, an array of size nd x R ncol(X)+1 containing samples of the variable splitting probabilities.
trees: A list (of length nd) of lists (of length ncol(X_train)+1) of character vectors (of length M) containing textual representations of the regression trees. The string for the s-th sample of the m-th tree in the j-th ensemble is contaiend in trees[[s]][[j]][m]. These strings are parsed by predict_VCBART to reconstruct the C++ representations of the sampled trees.

Arguments

Y_train: Vector of continous responses for training data
ni_train: Vector containing the number of observations per subject in the training data.
subj_id_train: Vector of length length(Y_train) that records which subject contributed each observation. Subjects should be numbered sequentially from 1 to length(ni_train).
X_train: Matrix of covariates for training observations. Do not include intercept as the first column.
Z_cont_train: Matrix of continuous modifiers for training data. Note, modifiers must be rescaled to lie in the interval [-1,1]. Default is a 1x1 matrix, which signals that there are no continuous modifiers in the training data.
Z_cat_train: Integer matrix of categorical modifiers for training data. Note categorical levels should be 0-indexed. That is, if a categorical modifier has 10 levels, the values should run from 0 to 9. Default is a 1x1 matrix, which signals that there are no categorical modifiers in the training data.
X_test: Matrix of covariate for testing observations. Default is a 1x1 matrix, which signals that testing data is not provided.
Z_cont_test: Matrix of continuous modifiers for testing data. Default is a 1x1 matrix, which signals that testing data is not provided.
Z_cat_test: Integer matrix of categorical modifiers for testing data. Default is a 1x1 matrix, which signals that testing data is not provided.
unif_cuts: Vector of logical values indicating whether cutpoints for each continuous modifier should be drawn from a continuous uniform distribution (TRUE) or a discrete set (FALSE) specified in cutpoints_list. Default is TRUE for each variable in Z_cont_train
cutpoints_list: List of length ncol(Z_cont_train) containing a vector of cutpoints for each continuous modifier. By default, this is set to NULL so that cutpoints are drawn uniformly from a continuous distribution.
cat_levels_list: List of length ncol(Z_cat_train) containing a vector of levels for each categorical modifier. If the j-th categorical modifier contains L levels, cat_levels_list[[j]] should be the vector 0:(L-1). Default is NULL, which corresponds to the case that no categorical modifiers are available.
edge_mat_list: List of adjacency matrices if any of the categorical modifiers are network-structured. Default is NULL, which corresponds to the case that there are no network-structured categorical modifiers.
graph_split: Vector of logicals indicating whether each categorical modifier is network-structured. Default is rep(FALSE, times = ncol(Z_cat_train)).
sparse: Logical, indicating whether or not to perform variable selection in each tree ensemble based on a sparse Dirichlet prior rather than uniform prior; see Linero 2018. Default is TRUE
rho: Initial auto-correlation parameter for compound symmetry error structure. Must be between 0 and 1. Default is 0.9.
M: Number of trees in each ensemble. Default is 50.
mu0: Prior mean for jumps/leaf parameters. Default is 0 for each beta function. If supplied, must be a vector of length 1 + ncol(X_train).
tau: Prior standard deviation for jumps/leaf parameters. Default is 1/sqrt(M) for each beta function. If supplied, must be a vector of length 1 + ncol(X_train).
nu: Degrees of freedom for scaled-inverse chi-square prior on sigma^2. Default is 3.
lambda: Scale hyperparameter for scaled-inverse chi-square prior on sigma^2. Default places 90% prior probability that sigma is less than sd(Y_train).
nd: Number of posterior draws to return. Default is 1000.
burn: Number of MCMC iterations to be treated as "warmup" or "burn-in". Default is 1000.
thin: Number of post-warmup MCMC iteration by which to thin. Default is 1.
save_samples: Logical, indicating whether to return all posterior samples. Default is TRUE. If FALSE, only posterior mean is returned.
save_trees: Logical, indicating whether or not to save a text-based representation of the tree samples. This representation can be passed to predict_flexBART to make predictions at a later time. Default is FALSE.
verbose: Logical, inciating whether to print progress to R console. Default is TRUE.
print_every: As the MCMC runs, a message is printed every print_every iterations. Default is floor( (nd*thin + burn)/10) so that only 10 messages are printed.

Details

Given \(p\) covariates \(X_{1}, \ldots, X_{p}\) and \(r\) effect modifiers \(Z_{1}, \ldots, Z_{r}\), the varying coefficient model asserts that

\(E[Y \vert X = x, Z = ] = \beta_0(z) + \beta_1(z) * x_1 + ... \beta_p(z) * X_p.\)

That is, for any r-vector \(Z\), the relationships between \(X\) and \(Y\) is linear. However, the specific relationship is allowed to vary with respect tp \(Z\). VCBART approximates the covariate effect functions \(\beta_0(Z), \ldots, \beta_p(Z)\) using ensembles of regression trees. This function assumes that the within-subject errors are equi-correlated (i.e. a compound symmetry error structure).