ipd: Inference on Predicted Data (ipd)

Description

The main wrapper function to conduct ipd using various methods and models, and returns a list of fitted model components.

Usage

ipd(
  formula,
  method,
  model,
  data,
  label = NULL,
  unlabeled_data = NULL,
  seed = NULL,
  intercept = TRUE,
  alpha = 0.05,
  alternative = "two-sided",
  n_t = Inf,
  na_action = "na.fail",
  ...
)

Value

a summary of model output.

A list containing the fitted model components:

coefficients: Estimated coefficients of the model

Standard errors of the estimated coefficients

Confidence intervals for the estimated coefficients

formula

The formula used to fit the ipd model.

data

The data frame used for model fitting.

method

The method used for model fitting.

model

The type of model fitted.

intercept

Logical. Indicates if an intercept was included in the model.

fit

Fitted model object containing estimated coefficients, standard errors, confidence intervals, and additional method-specific output.

...

Additional output specific to the method used.

Arguments

formula: An object of class formula: a symbolic description of the model to be fitted. Must be of the form Y - f ~ X, where Y is the name of the column corresponding to the observed outcome in the labeled data, f is the name of the column corresponding to the predicted outcome in both labeled and unlabeled data, and X corresponds to the features of interest (i.e., X = X1 + ... + Xp). See 1. Formula in the Details below for more information.
method: The IPD method to be used for fitting the model. Must be one of "postpi_analytic", "postpi_boot", "ppi", "ppi_plusplus", or "pspa". See 3. Method in the Details below for more information.
model: The type of downstream inferential model to be fitted, or the parameter being estimated. Must be one of "mean", "quantile", "ols", "logistic", or "poisson". See 4. Model in the Details below for more information.
data: A data.frame containing the variables in the model, either a stacked data frame with a specific column identifying the labeled versus unlabeled observations (label), or only the labeled data set. Must contain columns for the observed outcomes (Y), the predicted outcomes (f), and the features (X) needed to specify the formula. See 2. Data in the Details below for more information.
label: A string, int, or logical specifying the column in the data that distinguishes between the labeled and unlabeled observations. See the Details section for more information. If NULL, unlabeled_data must be specified. See 2. Data in the Details below for more information.
unlabeled_data: (optional) A data.frame of unlabeled data. If NULL, label must be specified. Specifying both the label and unlabeled_data arguments will result in an error message. If specified, must contain columns for the predicted outcomes (f), and the features (X) needed to specify the formula. See 2. Data in the Details below for more information.
seed: (optional) An integer seed for random number generation.
intercept: Logical. Should an intercept be included in the model? Default is TRUE.
alpha: The significance level for confidence intervals. Default is 0.05.
alternative: A string specifying the alternative hypothesis. Must be one of "two-sided", "less", or "greater".
n_t: (integer, optional) Size of the dataset used to train the prediction function (necessary for the "postpi_analytic" and "postpi_boot" methods if n_t < nrow(X_l). Defaults to Inf.
na_action: (string, optional) How missing covariate data should be handled. Currently "na.fail" and "na.omit" are accommodated. Defaults to "na.fail".
...: Additional arguments to be passed to the fitting function. See the Details section for more information. See 5. Auxiliary Arguments and 6. Other Arguments in the Details below for more information.

Details

1. Formula:

The ipd function uses one formula argument that specifies both the calibrating model (e.g., PostPI "relationship model", PPI "rectifier" model) and the inferential model. These separate models will be created internally based on the specific method called.

2. Data:

The data can be specified in two ways:

Single data argument (data) containing a stacked data.frame and a label identifier (label).
Two data arguments, one for the labeled data (data) and one for the unlabeled data (unlabeled_data).

For option (1), provide one data argument (data) which contains a stacked data.frame with both the unlabeled and labeled data and a label argument that specifies the column identifying the labeled versus the unlabeled observations in the stacked data.frame (e.g., label = "set_label" if the column "set_label" in the stacked data denotes which set an observation belongs to).

NOTE: Labeled data identifiers can be:

String: "l", "lab", "label", "labeled", "labelled", "tst", "test", "true"
Logical: TRUE
Factor: Non-reference category (i.e., binary 1)

Unlabeled data identifiers can be:

String: "u", "unlab", "unlabeled", "unlabelled", "val", "validation", "false"
Logical: FALSE
Factor: Non-reference category (i.e., binary 0)

For option (2), provide separate data arguments for the labeled data set (data) and the unlabeled data set (unlabeled_data). If the second argument is provided, the function ignores the label identifier and assumes the data provided are not stacked.

NOTE: Not all columns in data or unlabeled_data may be used unless explicitly referenced in the formula argument or in the label argument (if the data are passed as one stacked data frame).

3. Method:

Use the method argument to specify the fitting method:

"postpi_analytic": Wang et al. (2020) Post-Prediction Inference (PostPI) Analytic Correction
"postpi_boot": Wang et al. (2020) Post-Prediction Inference (PostPI) Bootstrap Correction
"ppi": Angelopoulos et al. (2023) Prediction-Powered Inference (PPI)
"ppi_plusplus": Angelopoulos et al. (2023) PPI++
"pspa": Miao et al. (2023) Assumption-Lean and Data-Adaptive Post-Prediction Inference (PSPA)

4. Model:

Use the model argument to specify the type of downstream inferential model or parameter to be estimated:

"mean": Mean value of a continuous outcome
"quantile": qth quantile of a continuous outcome
"ols": Linear regression coefficients for a continuous outcome
"logistic": Logistic regression coefficients for a binary outcome
"poisson": Poisson regression coefficients for a count outcome

The ipd wrapper function will concatenate the method and model arguments to identify the required helper function, following the naming convention "method_model".

5. Auxiliary Arguments:

The wrapper function will take method-specific auxiliary arguments (e.g., q for the quantile estimation models) and pass them to the helper function through the "..." with specified defaults for simplicity.

6. Other Arguments:

All other arguments that relate to all methods (e.g., alpha, ci.type), or other method-specific arguments, will have defaults.

Examples

Run this code


#-- Generate Example Data

set.seed(12345)

dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1)

head(dat)

formula <- Y - f ~ X1

#-- PostPI Analytic Correction (Wang et al., 2020)

ipd(formula, method = "postpi_analytic", model = "ols",

    data = dat, label = "set_label")

#-- PostPI Bootstrap Correction (Wang et al., 2020)

nboot <- 200

ipd(formula, method = "postpi_boot", model = "ols",

    data = dat, label = "set_label", nboot = nboot)

#-- PPI (Angelopoulos et al., 2023)

ipd(formula, method = "ppi", model = "ols",

    data = dat, label = "set_label")

#-- PPI++ (Angelopoulos et al., 2023)

ipd(formula, method = "ppi_plusplus", model = "ols",

    data = dat, label = "set_label")

#-- PSPA (Miao et al., 2023)

ipd(formula, method = "pspa", model = "ols",

    data = dat, label = "set_label")

Run the code above in your browser using DataLab