beta_gen: Generate regression coefficients

Description

Uses the output from questionnaire_gen to generate linear regression coefficients.

Usage

beta_gen(
  data,
  MC = FALSE,
  MC_replications = 100,
  CI = c(0.005, 0.995),
  output_cov = FALSE,
  rename_to_q = FALSE,
  verbose = TRUE
)

Value

By default, this function will output a vector of the regression coefficients, including intercept. If MC == TRUE, the output will instead be a matrix comparing the true regression coefficients obtained from the covariance matrix with expected values obtained from a Monte Carlo simulation, complete with 99% confidence interval.

If output_cov = TRUE, the output will be a list with two elements: the first one, betas, will contain the same output described in the previous paragraph. The second one, called vcov_YXW, contains the covariance matrix of the regression coefficients.

Arguments

data: output from the questionnaire_gen function with full_output = TRUE and theta = TRUE
MC: if TRUE, performs Monte Carlo simulation to estimate regression coefficients
MC_replications: for MC = TRUE, this represents the number of Monte Carlo subsamples calculated
CI: confidence interval for Monte Carlo simulations
output_cov: if TRUE, will also output the covariance matrix of YXW
rename_to_q: if TRUE, renames the variables from "x" and "w" to "q"
verbose: if `FALSE`, output messages will be suppressed (useful for simulations). Defaults to `TRUE`

Details

This function was primarily conceived as a subfunction of questionnaire_gen, when family = "gaussian", theta = TRUE, and full_output = TRUE. However, it can also be directly called by the user so they can perform further analysis.

This function primarily calculates the true regression coefficients ($\beta$) for the linear influence of the background questionnaire variables in $\theta$. From a statistical perspective, this relationship can be modeled as follows, where $E(\theta | \boldsymbol{X}, \boldsymbol{W})$ is the expectation of $\theta$ given $\boldsymbol{X} = \{X_1, \ldots, X_P\}$ and $\boldsymbol{W} = \{W_1, \ldots, W_Q\}$:

$$E(\theta | \boldsymbol{X}, \boldsymbol{W}) = \beta_0 + \sum_{p = 1}^P \beta_p X_p + \sum_{q = 1}^Q \beta_{P + q} W_q$$

The regression coefficients are calculated using the true covariance matrix either provided by the user upon calling of questionnaire_gen or randomly generated by that function if none was provided. In any case, that matrix is not sample-dependent, though it should be similar to the one observed in the generated data (especially for larger samples). One convenient way to check for this similarity is by running the function with MC = TRUE, which will generate a numeric estimate; the MC_replications argument can be then increased to improve the estimates at a often-noticeable cost in processing time. If MC = FALSE, the MC_replications will have no effect on the results. In any case, each subsample will always have the same size as the original sample.

If the background questionnaire contains categorical variables ($W$), the original covariance matrix cannot be used because it contains the covariances involving $Z ~ N(0, 1)$, which is the random variable that gets categorized into $W$. The case where $W$ is always binomial is trivial, but if at least one $W$ has more than two categories, the structure of the covariance matrix changes drastically. In this case, this function recalculates all covariances between $\theta$, $X$ and each category of $W$ using some auxiliary internal functions which rely on the appropriate distribution (either multivariate normal or truncated normal). To avoid multicollinearity, the first categories of each $W$ are dropped before the regression coefficients are calculated.

Examples

Run this code


data <- questionnaire_gen(100, family="gaussian", theta = TRUE,
                           full_output = TRUE, n_X = 2, n_W = list(2, 2, 4))
beta_gen(data, MC = TRUE)