Learn R Programming

OptHoldoutSize: an R package for estimating the optimal holdout set size for a predictive risk score to be deployed in a population.

This R package implements procedures for estimating an 'optimal holdout size' for a predictive score in order for it to be safely updated. Procedures are detailed in the manuscript 'Optimal sizing of a holdout set for safe predictive model updating' by Sami Haidar-Wehbe, Samuel R. Emerson, Louis J. M. Aslett, and James Liley.

When a predictive risk score for binary outcome $Y$ given covariates $X$ is deployed in a population, it may be used to guide interventions so as to avoid $Y$. This makes it difficult to update the predictive score safely, since $X$ can influence incidence of $Y$ in two ways: through the system being modelled, or through the predictive score itself.

A simple way to safely update a predictive is to with-hold calculation of the risk score for a proportion of the population maintained as a 'holdout' set. The predictive score can then be updated using data $X$, $Y$ from this holdout set. A question naturally arises over how large this hold-out set should be: too small, and a new predictive score cannot be trained sufficiently accurately; too large, and too many members of the population miss out on potential benefits of the risk score.

To download and install this package, use

install.packages("OptHoldoutSize")
library(OptHoldoutSize)

For examples demonstrating use of this package, see vignettes simulated_example and ASPRE_example. For a comparison of the two major algorithms implemented in this package, see vignette comparison_of_algorithms.

Copy Link

Version

Install

install.packages('OptHoldoutSize')

Monthly Downloads

215

Version

0.1.0.1

License

GPL (>= 3)

Maintainer

James Liley

Last Published

April 9th, 2025

Functions in OptHoldoutSize (0.1.0.1)

exp_imp_fn

Expected improvement
gen_base_coefs

Coefficients for imperfect risk score
gen_resp

Generate response
logit

Logit
logistic

Logistic
data_nextpoint_em

Data for 'next point' demonstration vignette on algorithm comparison using emulation algorithm
powerlaw

Power law function
grad_mincost_powerlaw

Gradient of minimum cost (power law)
ohs_array

Data for vignette on algorithm comparison
plot.optholdoutsize

Plot estimated cost function
ohs_resample

Data for vignette on algorithm comparison
plot.optholdoutsize_emul

Plot estimated cost function using emulation (semiparametric)
data_example_simulation

Data for vignette showing general example
cov_fn

Covariance function for Gaussian process
oracle_pred

Generate responses
grad_nstar_powerlaw

Gradient of optimal holdout size (power law)
powersolve

Fit power law curve
error_ohs_emulation

Measure of error for emulation-based OHS emulation
sim_random_aspre

Simulate random dataset similar to ASPRE training data
optimal_holdout_size

Estimate optimal holdout size under parametric assumptions
optimal_holdout_size_emulation

Estimate optimal holdout size under semi-parametric assumptions
model_predict

Make predictions
mu_fn

Updating function for mean.
split_data

Split data
model_train

Train model (wrapper)
psi_fn

Updating function for variance.
params_aspre

Parameters of reported ASPRE dataset
next_n

Finds best value of n to sample next
powersolve_general

General solver for power law curve
powersolve_se

Standard error matrix for learning curve parameters (power law)
sens10

Sensitivity at theshold quantile 10%
ci_mincost

Confidence interval for minimum total cost, when estimated using parametric method
aspre_k2

Cost estimating function in ASPRE simulation
aspre_parametric

Parametric-based OHS estimation for ASPRE
aspre_emulation

Emulation-based OHS estimation for ASPRE
ci_ohs

Confidence interval for optimal holdout size, when estimated using parametric method
aspre

Computes ASPRE score
ci_cover_e_yn

Data for example on empirical confidence interval for OHS.
ci_cover_cost_a_yn

Data for example on asymptotic confidence interval for min cost.
ci_cover_cost_e_yn

Data for example on empirical confidence interval for min cost.
add_aspre_interactions

Add interaction terms corresponding to ASPRE model
gen_preds

Generate matrix of random observations
ci_cover_a_yn

Data for example on asymptotic confidence interval for OHS.
data_nextpoint_par

Data for 'next point' demonstration vignette on algorithm comparison using parametric algorithm