Learn R Programming

VCBART (version 1.2.4)

VCBART_cs: Fit a VCBART model with compound symmetry error structure

Description

Fit a varying coefficient model to panel data. Assumes a compound symmetry error structure in which the residual errors for a given subject are equally correlated. This is equivalent to assuming that there is a normally distributed random effect per subject.

Usage

VCBART_cs(Y_train,subj_id_train, ni_train,X_train,
          Z_cont_train = matrix(0, nrow = 1, ncol = 1),
          Z_cat_train = matrix(0L, nrow = 1, ncol = 1),
          X_test = matrix(0, nrow = 1, ncol = 1),
          Z_cont_test = matrix(0, nrow = 1, ncol = 1),
          Z_cat_test = matrix(0, nrow = 1, ncol = 1),
          unif_cuts = rep(TRUE, times = ncol(Z_cont_train)),
          cutpoints_list = NULL,
          cat_levels_list = NULL,
          edge_mat_list = NULL,
          graph_split = rep(FALSE, times = ncol(Z_cat_train)),
          sparse = TRUE,
          rho = 0.9,
          M = 50,
          mu0 = NULL, tau = NULL, nu = NULL, lambda = NULL,
          nd = 1000, burn = 1000, thin = 1,
          save_samples = TRUE, save_trees = TRUE,
          verbose = TRUE, print_every = floor( (nd*thin + burn)/10))

Value

A list containing

y_mean

Mean of the training observations (needed by predict_VCBART)

y_sd

Standard deviation of the training observations (needed by predict_VCBART)

x_mean

Vector of means of columns of X_train, including the intercept (needed by predict_VCBART).

x_sd

Vector of standard deviations of X_trian, including the intercept (needed by predict_VCBART).

yhat.train.mean

Vector containing posterior mean of evaluations of regression function E[y|x,z] on training data.

betahat.train.mean

Matrix with length(Y_train) rows and ncol(X_train)+1 columns containing the posterior mean of evaluations of each coefficient function evaluated on the training data. Each row corresponds to a training set observation and each colunn corresponds to a coefficient function. Note the first column is for the intercept function.

yhat.train

Matrix with nd rows and length(Y_train) columns. Each row corresponds to a posterior sample of the regression function E[y|x,z] and each column corresponds to a training set observation. Only returned if save_samples == TRUE.

betahat.train

Array of dimension with nd x length(Y_train) x ncol(X_train)+1 containing posterior samples of evaluations of the coefficient functions. The first dimension corresponds to posterior samples/MCMC iterations, the second dimension corresponds to individual training set observations, and the third dimension corresponds to coefficient functions. Only returned if save_samples == TRUE.

yhat.test.mean

Vector containing posterior mean of evaluations of regression function E[y|x,z] on testing data.

betahat.test.mean

Matrix with nrow(X_test) rows and ncol(X_testn)+1 columns containing the posterior mean of evaluations of each coefficient function evaluated on the training data. Each row corresponds to a training set observation and each colunn corresponds to a coefficient function. Note the first column is for the intercept function.

yhat.test

Matrix with nd rows and nrow(X_test) columns. Each row corresponds to a posterior sample of the regression function E[y|x,z] and each column corresponds to a testing set observation. Only returned if save_samples == TRUE.

betahat.test

Array of size nd x nrow(X_test) x ncol(X_test)+1 containing posterior samples of evaluations of the coefficient functions. The first dimension corresponds to posterior samples/MCMC iterations, the second dimension corresponds to individual training set observations, and the third dimension corresponds to coefficient functions. Only returned if save_samples == TRUE.

sigma

Vector containing ALL samples of the residual standard deviation, including warmup.

rho

Vector containing ALL samples of the auto-correlation parameter rho, including warmup.

varcounts

Array of size nd x R x ncol(X)+1 that counts the number of times a variable was used in a decision rule in each posterior sample of each ensemble. Here R is the total number of potential modifiers (i.e. R = ncol(Z_cont_train) + ncol(Z_cat_train)).

theta

If sparse=TRUE, an array of size nd x R ncol(X)+1 containing samples of the variable splitting probabilities.

trees

A list (of length nd) of lists (of length ncol(X_train)+1) of character vectors (of length M) containing textual representations of the regression trees. The string for the s-th sample of the m-th tree in the j-th ensemble is contaiend in trees[[s]][[j]][m]. These strings are parsed by predict_VCBART to reconstruct the C++ representations of the sampled trees.

Arguments

Y_train

Vector of continous responses for training data

ni_train

Vector containing the number of observations per subject in the training data.

subj_id_train

Vector of length length(Y_train) that records which subject contributed each observation. Subjects should be numbered sequentially from 1 to length(ni_train).

X_train

Matrix of covariates for training observations. Do not include intercept as the first column.

Z_cont_train

Matrix of continuous modifiers for training data. Note, modifiers must be rescaled to lie in the interval [-1,1]. Default is a 1x1 matrix, which signals that there are no continuous modifiers in the training data.

Z_cat_train

Integer matrix of categorical modifiers for training data. Note categorical levels should be 0-indexed. That is, if a categorical modifier has 10 levels, the values should run from 0 to 9. Default is a 1x1 matrix, which signals that there are no categorical modifiers in the training data.

X_test

Matrix of covariate for testing observations. Default is a 1x1 matrix, which signals that testing data is not provided.

Z_cont_test

Matrix of continuous modifiers for testing data. Default is a 1x1 matrix, which signals that testing data is not provided.

Z_cat_test

Integer matrix of categorical modifiers for testing data. Default is a 1x1 matrix, which signals that testing data is not provided.

unif_cuts

Vector of logical values indicating whether cutpoints for each continuous modifier should be drawn from a continuous uniform distribution (TRUE) or a discrete set (FALSE) specified in cutpoints_list. Default is TRUE for each variable in Z_cont_train

cutpoints_list

List of length ncol(Z_cont_train) containing a vector of cutpoints for each continuous modifier. By default, this is set to NULL so that cutpoints are drawn uniformly from a continuous distribution.

cat_levels_list

List of length ncol(Z_cat_train) containing a vector of levels for each categorical modifier. If the j-th categorical modifier contains L levels, cat_levels_list[[j]] should be the vector 0:(L-1). Default is NULL, which corresponds to the case that no categorical modifiers are available.

edge_mat_list

List of adjacency matrices if any of the categorical modifiers are network-structured. Default is NULL, which corresponds to the case that there are no network-structured categorical modifiers.

graph_split

Vector of logicals indicating whether each categorical modifier is network-structured. Default is rep(FALSE, times = ncol(Z_cat_train)).

sparse

Logical, indicating whether or not to perform variable selection in each tree ensemble based on a sparse Dirichlet prior rather than uniform prior; see Linero 2018. Default is TRUE

rho

Initial auto-correlation parameter for compound symmetry error structure. Must be between 0 and 1. Default is 0.9.

M

Number of trees in each ensemble. Default is 50.

mu0

Prior mean for jumps/leaf parameters. Default is 0 for each beta function. If supplied, must be a vector of length 1 + ncol(X_train).

tau

Prior standard deviation for jumps/leaf parameters. Default is 1/sqrt(M) for each beta function. If supplied, must be a vector of length 1 + ncol(X_train).

nu

Degrees of freedom for scaled-inverse chi-square prior on sigma^2. Default is 3.

lambda

Scale hyperparameter for scaled-inverse chi-square prior on sigma^2. Default places 90% prior probability that sigma is less than sd(Y_train).

nd

Number of posterior draws to return. Default is 1000.

burn

Number of MCMC iterations to be treated as "warmup" or "burn-in". Default is 1000.

thin

Number of post-warmup MCMC iteration by which to thin. Default is 1.

save_samples

Logical, indicating whether to return all posterior samples. Default is TRUE. If FALSE, only posterior mean is returned.

save_trees

Logical, indicating whether or not to save a text-based representation of the tree samples. This representation can be passed to predict_flexBART to make predictions at a later time. Default is FALSE.

verbose

Logical, inciating whether to print progress to R console. Default is TRUE.

print_every

As the MCMC runs, a message is printed every print_every iterations. Default is floor( (nd*thin + burn)/10) so that only 10 messages are printed.

Details

Given \(p\) covariates \(X_{1}, \ldots, X_{p}\) and \(r\) effect modifiers \(Z_{1}, \ldots, Z_{r}\), the varying coefficient model asserts that

\(E[Y \vert X = x, Z = ] = \beta_0(z) + \beta_1(z) * x_1 + ... \beta_p(z) * X_p.\)

That is, for any r-vector \(Z\), the relationships between \(X\) and \(Y\) is linear. However, the specific relationship is allowed to vary with respect tp \(Z\). VCBART approximates the covariate effect functions \(\beta_0(Z), \ldots, \beta_p(Z)\) using ensembles of regression trees. This function assumes that the within-subject errors are equi-correlated (i.e. a compound symmetry error structure).