policy_eval_online() is used to estimate
the value of a given fixed policy
or a data adaptive policy (e.g. a policy
learned from the data). policy_eval_online()
is also used to estimate the subgroup average
treatment effect as defined by the (learned) policy.
The evaluation is based on a online/sequential validation
estimation scheme making the estimation approach valid for a
non-converging policy under no heterogenuous treatment effect
(exceptional law), see details.
policy_eval_online(
policy_data,
policy = NULL,
policy_learn = NULL,
g_functions = NULL,
g_models = g_glm(),
g_full_history = FALSE,
save_g_functions = TRUE,
q_functions = NULL,
q_models = q_glm(),
q_full_history = FALSE,
save_q_functions = TRUE,
c_functions = NULL,
c_models = NULL,
c_full_history = FALSE,
save_c_functions = TRUE,
m_function = NULL,
m_model = NULL,
m_full_history = FALSE,
save_m_function = TRUE,
target = "value",
M = 4,
train_block_size = get_n(policy_data)/5,
name = NULL,
min_subgroup_size = 1
)policy_eval_online() returns an object of inherited class "policy_eval_online", "policy_eval".
The object is a list containing the following elements:
coefNumeric vector. The estimated target parameters: policy value or subgroup average treatment effect.
vcovNumeric vector. The estimated squared standard deviation associated with
coef.
targetCharacter string. The target parameter ("value" or "subgroup")
idCharacter vector. The IDs of the observations.
nameCharacter vector. Names for the each element in coef.
train_sequential_indexlist of indexes used for training at each step.
valid_sequential_indexlist of indexes used for validation at each step.
Policy data object created by policy_data().
Policy object created by policy_def().
Policy learner object created by policy_learn().
Fitted g-model objects, see nuisance_functions.
Preferably, use g_models.
List of action probability models/g-models for each stage
created by g_empir(), g_glm(), g_rf(), g_sl() or similar functions.
Only used for evaluation if g_functions is NULL.
If a single model is provided and g_full_history is FALSE,
a single g-model is fitted across all stages. If g_full_history is
TRUE the model is reused at every stage.
If TRUE, the full history is used to fit each g-model. If FALSE, the state/Markov type history is used to fit each g-model.
If TRUE, the fitted g-functions are saved.
Fitted Q-model objects, see nuisance_functions.
Only valid if the Q-functions are fitted using the same policy.
Preferably, use q_models.
Outcome regression models/Q-models created by
q_glm(), q_rf(), q_sl() or similar functions.
Only used for evaluation if q_functions is NULL.
If a single model is provided, the model is reused at every stage.
Similar to g_full_history.
Similar to save_g_functions.
Fitted c-model/censoring probability model objects. Preferably, use c_models.
List of right-censoring probability models, see c_model.
Similar to g_full_history.
Similar to save_g_functions.
Fitted outcome model object for stage K+1. Preferably, use m_model.
Outcome model for the utility at stage K+1. Only used if the final utility contribution is missing/has been right-censored
Similar to g_full_history.
Similar to save_g_functions.
Character string. Either "value" or "subgroup". If "value",
the target parameter is the policy value.
If "subgroup", the target parameter
is the subgroup average treatement effect given by the policy, see details.
"subgroup" is only implemented for type = "dr"
in the single-stage case with a dichotomous action set.
Number of folds for online estimation/sequential validation excluding the initial training block, see details.
Integer. Size of the initial training block only used for training of the policy and nuisance models, see details.
Character string.
Minimum number of observations in the evaluated subgroup (Only used if target = "subgroup").
Each observation has the sequential form $$O= {B, U_1, X_1, A_1, ..., U_K, X_K, A_K, U_{K+1}},$$ for a possibly stochastic number of stages K.
\(B\) is a vector of baseline covariates.
\(U_k\) is the reward at stage k (not influenced by the action \(A_k\)).
\(X_k\) is a vector of state covariates summarizing the state at stage k.
\(A_k\) is the categorical action within the action set \(\mathcal{A}\) at stage k.
The utility is given by the sum of the rewards, i.e., \(U = \sum_{k = 1}^{K+1} U_k\).
A (subgroup) policy is a set of functions $$d = \{d_1, ..., d_K\},$$ where \(d_k\) for \(k\in \{1, ..., K\}\) maps a subset or function \(V_1\) of \(\{B, X_1, A_1, ..., A_{k-1}, X_k\}\) into the action set (or set of subgroups).
Recursively define the Q-models (q_models):
$$Q^d_K(h_K, a_K) = \mathbb{E}[U|H_K = h_K, A_K = a_K]$$
$$Q^d_k(h_k, a_k) = \mathbb{E}[Q_{k+1}(H_{k+1},
d_{k+1}(V_{k+1}))|H_k = h_k, A_k = a_k].$$
If q_full_history = TRUE,
\(H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}\), and if
q_full_history = FALSE, \(H_k = \{B, X_k\}\).
The g-models (g_models) are defined as
$$g_k(h_k, a_k) = \mathbb{P}(A_k = a_k|H_k = h_k).$$
If g_full_history = TRUE,
\(H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}\), and if
g_full_history = FALSE, \(H_k = \{B, X_k\}\).
Furthermore, if g_full_history = FALSE and g_models is a
single model, it is assumed that \(g_1(h_1, a_1) = ... = g_K(h_K, a_K)\).
If target = "value", policy_eval_online
returns the estimates of
the value, i.e., the expected potential utility under the policy, (coef):
$$\mathbb{E}[U^{(d)}]$$
If target = "subgroup", K = 1, \(\mathcal{A} = \{0,1\}\),
and \(d_1(V_1) \in \{s_1, s_2\} \), policy_eval()
returns the estimates of the subgroup average
treatment effect (coef):
$$\mathbb{E}[U^{(1)} - U^{(0)}| d_1(\cdot) = s]\quad s\in \{s_1,s_2\},$$
Estimation of the target parameter is based online estimation/sequential
validation using the doubly robust score. The following figure illustrate
online estimation using M = 5 steps and an initial training block of
size train_block_size = \(l\).

Step 1:
The \(n\) observations are randomly ordered. In step 1,
the first \(\{1,...,l\}\) observations, highlighted in teal/blue, are used to fit the
Q-models, g-models, the policy (if using the policy_learn argument), and other required models.
We denote the collection of these fitted models as \(P\).
The remaining observations are split into M blocks of size \(m = (n-l)/M\), which
we for simplicity assume to be a whole number. In step 1, the target
parameter is estimated using the associated doubly robust score \(Z(P)\)
evaluated on the first validation fold
highlighted in pink \(\{l+1,...,l+m\}\):
$$ \frac{\sum_{i = l+1}^{l+m} {\widehat \sigma_{i}^{-1}} Z(\widehat P_i)(O_i)} {\sum_{i = l+1}^{l+m} \widehat \sigma_{i}^{-1}}, $$ where \(\widehat P_i\) for \(i \in \{l+1,...,l+m\}\) refer to the fitted models trained on \(\{1,...,l\}\), and \(\widehat \sigma_i\) is the insample estimate for the standard deviation based on the training observations \(\{1,...,l\}\). We will later give an exact expression for \(\widehat \sigma_i\) for each target parameter. Note that \(\widehat \sigma_i\) is constant for \(i \in \{l+1,...,l+m\}\), but it will be convenient to keep the same index for \(\widehat \sigma\).
Step 2 to M:
In step 2, observations with index \(\{1,...,l+m\}\) are used to fit the model collection \(P\),
as well as the insample estimate for the standard deviation. For \(i \in \{l+m+1,...,l+2m\}\) these are
denoted as \(\widehat P_i, \widehat \sigma_i\).
This sequential model fitting is repeated for all M
steps and the updated online estimator is given by
$$ \frac{\sum_{i = l+1}^{n} {\widehat \sigma_{i}^{-1}} Z(\widehat P_i)(O_i)} {\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}}, $$
with an associated standard error estimate given by
$$\frac{\left(\frac{1}{n-l}\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}\right)^{-1}}{\sqrt{n-l}}.$$
target = "value":
For a policy value target the doubly robust score is given by $$ Z(d, g, Q^d)(O) = Q^d_1(H_1 , d_1(V_1)) + \sum_{r = 1}^K \prod_{j = 1}^{r} \frac{I\{A_j = d_j(\cdot)\}}{g_{j}(H_j, A_j)} \{Q_{r+1}^d(H_{r+1} , d_{r+1}(V_1)) - Q_{r}^d(H_r , d_r(V_1))\}. $$ The influence function(/curve) for the associated onestep etimator is $$Z(d, g, Q^d)(O) - \mathbb{E}[Z(d,g, Q^d)(O)],$$ which is used to estimate the insample stadard deviation. For example, in step 2, i.e., for \(i \in \{l+m+1,...,l+2m\}\) $$ \widehat \sigma_i^2 = \frac{1}{l+m}\sum_{j=1}^{l+m} \left(Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_j) - \frac{1}{l+m}\sum_{r=1}^{l+m} Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_r) \right)^2 $$
target = "subgroup":
For a subgroup average treatment effect target,
where K = 1 (single-stage),
\(\mathcal{A} = \{0,1\}\) (binary treatment), and
\(d_1(V_1) \in \{s_1, s_2\}\) (dichotomous subgroup policy) the
doubly robust score is given by
$$ Z(d,g,Q, D) = \frac{I\{d_1(\cdot) = s\}}{D} \Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) \Big\}.$$ $$ Z_1(a, g, Q)(O) = Q_1(H_1 , a) + \frac{I\{A = a\}}{g_1(H_1, a)} \{U - Q_{1}(H_1 , a)\}, $$ where \(D\) is \(\mathbb{P}(d_1(V_1) = s)\).
The associated onestep/estimating equation estimator has influence function $$\frac{ I\{d_1(\cdot) = s\}}{D} \Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) - E[Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) | d_1(\cdot) = s]\Big\},$$ which is used to estimate the standard deviation \(\widehat \sigma\).
Luedtke, Alexander R, and Mark J van der Laan. “STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY.” Annals of statistics vol. 44,2 (2016): 713-742. tools:::Rd_expr_doi("10.1214/15-AOS1384")