create_marginal_data_cat: Create marginal categorical data for causal Shapley values

Description

This function is used when we generate marginal data for the categorical approach when we have several sampling steps. We need to treat this separately, as we here in the marginal step CANNOT make feature values such that the combination of those and the feature values we condition in S are NOT in categorical.joint_prob_dt. If we do this, then we cannot progress further in the chain of sampling steps. E.g., X1 in (1,2,3), X2 in (1,2,3), and X3 in (1,2,3). We know X2 = 2, and let causal structure be X1 -> X2 -> X3. Assume that P(X1 = 1, X2 = 2, X = 3) = P(X1 = 2, X2 = 2, X = 3) = 1/2. Then there is no point generating X1 = 3, as we then cannot generate X3. The solution is only to generate the values which can proceed through the whole chain of sampling steps. To do that, we have to ensure the the marginal sampling respects the valid feature coalitions for all sets of conditional features, i.e., the features in features_steps_cond_on. We sample from the valid coalitions using the MARGINAL probabilities.

Usage

create_marginal_data_cat(
  n_MC_samples,
  x_explain,
  Sbar_features,
  S_original,
  joint_prob_dt
)

Value

Data table of dimension \((`n_MC_samples` * `nrow(x_explain)`) \times `length(Sbar_features)`\) with the sampled observations.

Arguments

n_MC_samples: Positive integer. For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration of every conditional expectation. For approach="ctree", n_MC_samples corresponds to the number of samples from the leaf node (see an exception related to the ctree.sample argument setup_approach.ctree()). For approach="empirical", n_MC_samples is the \(K\) parameter in equations (14-15) of Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the empirical.eta argument setup_approach.empirical().
x_explain: Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained.
Sbar_features: Vector of integers containing the features indices to generate marginal observations for. That is, if Sbar_features is c(1,4), then we sample n_MC_samples observations from \(P(X_1, X_4)\). That is, we sample the first and fourth feature values from the same valid feature coalition using the marginal probability, so we do not break the dependence between them.
S_original: Vector of integers containing the features indices of the original coalition S. I.e., not the features in the current sampling step, but the features are known to us before starting the chain of sampling steps.

Author

Lars Henry Berge Olsen

Details

For undocumented arguments, see setup_approach.categorical().