policy_eval_online: Online/Sequential Policy Evaluation

Description

policy_eval_online() is used to estimate the value of a given fixed policy or a data adaptive policy (e.g. a policy learned from the data). policy_eval_online() is also used to estimate the subgroup average treatment effect as defined by the (learned) policy. The evaluation is based on a online/sequential validation estimation scheme making the estimation approach valid for a non-converging policy under no heterogenuous treatment effect (exceptional law), see details.

Usage

policy_eval_online(
  policy_data,
  policy = NULL,
  policy_learn = NULL,
  g_functions = NULL,
  g_models = g_glm(),
  g_full_history = FALSE,
  save_g_functions = TRUE,
  q_functions = NULL,
  q_models = q_glm(),
  q_full_history = FALSE,
  save_q_functions = TRUE,
  c_functions = NULL,
  c_models = NULL,
  c_full_history = FALSE,
  save_c_functions = TRUE,
  m_function = NULL,
  m_model = NULL,
  m_full_history = FALSE,
  save_m_function = TRUE,
  target = "value",
  M = 4,
  train_block_size = get_n(policy_data)/5,
  name = NULL,
  min_subgroup_size = 1
)

Value

policy_eval_online() returns an object of inherited class "policy_eval_online", "policy_eval". The object is a list containing the following elements:

coef: Numeric vector. The estimated target parameters: policy value or subgroup average treatment effect.
vcov: Numeric vector. The estimated squared standard deviation associated with coef.
target: Character string. The target parameter ("value" or "subgroup")
id: Character vector. The IDs of the observations.
name: Character vector. Names for the each element in coef.
train_sequential_index: list of indexes used for training at each step.
valid_sequential_index: list of indexes used for validation at each step.

Arguments

policy_data: Policy data object created by policy_data().
policy: Policy object created by policy_def().
policy_learn: Policy learner object created by policy_learn().
g_functions: Fitted g-model objects, see nuisance_functions. Preferably, use g_models.
g_models: List of action probability models/g-models for each stage created by g_empir(), g_glm(), g_rf(), g_sl() or similar functions. Only used for evaluation if g_functions is NULL. If a single model is provided and g_full_history is FALSE, a single g-model is fitted across all stages. If g_full_history is TRUE the model is reused at every stage.
g_full_history: If TRUE, the full history is used to fit each g-model. If FALSE, the state/Markov type history is used to fit each g-model.
save_g_functions: If TRUE, the fitted g-functions are saved.
q_functions: Fitted Q-model objects, see nuisance_functions. Only valid if the Q-functions are fitted using the same policy. Preferably, use q_models.
q_models: Outcome regression models/Q-models created by q_glm(), q_rf(), q_sl() or similar functions. Only used for evaluation if q_functions is NULL. If a single model is provided, the model is reused at every stage.
q_full_history: Similar to g_full_history.
save_q_functions: Similar to save_g_functions.
c_functions: Fitted c-model/censoring probability model objects. Preferably, use c_models.
c_models: List of right-censoring probability models, see c_model.
c_full_history: Similar to g_full_history.
save_c_functions: Similar to save_g_functions.
m_function: Fitted outcome model object for stage K+1. Preferably, use m_model.
m_model: Outcome model for the utility at stage K+1. Only used if the final utility contribution is missing/has been right-censored
m_full_history: Similar to g_full_history.
save_m_function: Similar to save_g_functions.
target: Character string. Either "value" or "subgroup". If "value", the target parameter is the policy value. If "subgroup", the target parameter is the subgroup average treatement effect given by the policy, see details. "subgroup" is only implemented for type = "dr" in the single-stage case with a dichotomous action set.
M: Number of folds for online estimation/sequential validation excluding the initial training block, see details.
train_block_size: Integer. Size of the initial training block only used for training of the policy and nuisance models, see details.
name: Character string.
min_subgroup_size: Minimum number of observations in the evaluated subgroup (Only used if target = "subgroup").

Details

Setup

Each observation has the sequential form $$O= {B, U_1, X_1, A_1, ..., U_K, X_K, A_K, U_{K+1}},$$ for a possibly stochastic number of stages K.

$B$ is a vector of baseline covariates.
$U_k$ is the reward at stage k (not influenced by the action $A_k$).
$X_k$ is a vector of state covariates summarizing the state at stage k.
$A_k$ is the categorical action within the action set $\mathcal{A}$ at stage k.

The utility is given by the sum of the rewards, i.e., $U = \sum_{k = 1}^{K+1} U_k$.

A (subgroup) policy is a set of functions $$d = \{d_1, ..., d_K\},$$ where $d_k$ for $k\in \{1, ..., K\}$ maps a subset or function $V_1$ of $\{B, X_1, A_1, ..., A_{k-1}, X_k\}$ into the action set (or set of subgroups).

Recursively define the Q-models (q_models): $$Q^d_K(h_K, a_K) = \mathbb{E}[U|H_K = h_K, A_K = a_K]$$ $$Q^d_k(h_k, a_k) = \mathbb{E}[Q_{k+1}(H_{k+1}, d_{k+1}(V_{k+1}))|H_k = h_k, A_k = a_k].$$ If q_full_history = TRUE, $H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}$, and if q_full_history = FALSE, $H_k = \{B, X_k\}$.

The g-models (g_models) are defined as $$g_k(h_k, a_k) = \mathbb{P}(A_k = a_k|H_k = h_k).$$ If g_full_history = TRUE, $H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}$, and if g_full_history = FALSE, $H_k = \{B, X_k\}$. Furthermore, if g_full_history = FALSE and g_models is a single model, it is assumed that $g_1(h_1, a_1) = ... = g_K(h_K, a_K)$.

Target parameters

If target = "value", policy_eval_online returns the estimates of the value, i.e., the expected potential utility under the policy, (coef): $$\mathbb{E}[U^{(d)}]$$ If target = "subgroup", K = 1, $\mathcal{A} = \{0,1\}$, and $d_1(V_1) \in \{s_1, s_2\} $, policy_eval() returns the estimates of the subgroup average treatment effect (coef): $$\mathbb{E}[U^{(1)} - U^{(0)}| d_1(\cdot) = s]\quad s\in \{s_1,s_2\},$$

Online estimation/sequential validation

Estimation of the target parameter is based online estimation/sequential validation using the doubly robust score. The following figure illustrate online estimation using M = 5 steps and an initial training block of size train_block_size = $l$.

Online estimaiton scheme

Step 1:

The $n$ observations are randomly ordered. In step 1, the first $\{1,...,l\}$ observations, highlighted in teal/blue, are used to fit the Q-models, g-models, the policy (if using the policy_learn argument), and other required models. We denote the collection of these fitted models as $P$. The remaining observations are split into M blocks of size $m = (n-l)/M$, which we for simplicity assume to be a whole number. In step 1, the target parameter is estimated using the associated doubly robust score $Z(P)$ evaluated on the first validation fold highlighted in pink $\{l+1,...,l+m\}$:

$$ \frac{\sum_{i = l+1}^{l+m} {\widehat \sigma_{i}^{-1}} Z(\widehat P_i)(O_i)} {\sum_{i = l+1}^{l+m} \widehat \sigma_{i}^{-1}}, $$ where $\widehat P_i$ for $i \in \{l+1,...,l+m\}$ refer to the fitted models trained on $\{1,...,l\}$, and $\widehat \sigma_i$ is the insample estimate for the standard deviation based on the training observations $\{1,...,l\}$. We will later give an exact expression for $\widehat \sigma_i$ for each target parameter. Note that $\widehat \sigma_i$ is constant for $i \in \{l+1,...,l+m\}$, but it will be convenient to keep the same index for $\widehat \sigma$.

Step 2 to M:

In step 2, observations with index $\{1,...,l+m\}$ are used to fit the model collection $P$, as well as the insample estimate for the standard deviation. For $i \in \{l+m+1,...,l+2m\}$ these are denoted as $\widehat P_i, \widehat \sigma_i$. This sequential model fitting is repeated for all M steps and the updated online estimator is given by

$$ \frac{\sum_{i = l+1}^{n} {\widehat \sigma_{i}^{-1}} Z(\widehat P_i)(O_i)} {\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}}, $$

with an associated standard error estimate given by

$$\frac{\left(\frac{1}{n-l}\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}\right)^{-1}}{\sqrt{n-l}}.$$

Doubly robust scores

target = "value":

For a policy value target the doubly robust score is given by $$ Z(d, g, Q^d)(O) = Q^d_1(H_1 , d_1(V_1)) + \sum_{r = 1}^K \prod_{j = 1}^{r} \frac{I\{A_j = d_j(\cdot)\}}{g_{j}(H_j, A_j)} \{Q_{r+1}^d(H_{r+1} , d_{r+1}(V_1)) - Q_{r}^d(H_r , d_r(V_1))\}. $$ The influence function(/curve) for the associated onestep etimator is $$Z(d, g, Q^d)(O) - \mathbb{E}[Z(d,g, Q^d)(O)],$$ which is used to estimate the insample stadard deviation. For example, in step 2, i.e., for $i \in \{l+m+1,...,l+2m\}$ $$ \widehat \sigma_i^2 = \frac{1}{l+m}\sum_{j=1}^{l+m} \left(Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_j) - \frac{1}{l+m}\sum_{r=1}^{l+m} Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_r) \right)^2 $$

target = "subgroup":

For a subgroup average treatment effect target, where K = 1 (single-stage), $\mathcal{A} = \{0,1\}$ (binary treatment), and $d_1(V_1) \in \{s_1, s_2\}$ (dichotomous subgroup policy) the doubly robust score is given by

$$ Z(d,g,Q, D) = \frac{I\{d_1(\cdot) = s\}}{D} \Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) \Big\}.$$ $$ Z_1(a, g, Q)(O) = Q_1(H_1 , a) + \frac{I\{A = a\}}{g_1(H_1, a)} \{U - Q_{1}(H_1 , a)\}, $$ where $D$ is $\mathbb{P}(d_1(V_1) = s)$.

The associated onestep/estimating equation estimator has influence function $$\frac{ I\{d_1(\cdot) = s\}}{D} \Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) - E[Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) | d_1(\cdot) = s]\Big\},$$ which is used to estimate the standard deviation $\widehat \sigma$.

References

Luedtke, Alexander R, and Mark J van der Laan. “STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY.” Annals of statistics vol. 44,2 (2016): 713-742. tools:::Rd_expr_doi("10.1214/15-AOS1384")