estimate_accuracy: Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.

Description

Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.

Usage

estimate_accuracy(
  formula,
  model,
  data = NULL,
  dim = NULL,
  maxn = NULL,
  upperlimit = NULL,
  nsample = 30,
  steps = 50,
  eta = 0.05,
  delta = 0.05,
  epsilon = 0.05,
  predictfn = NULL,
  power = FALSE,
  effect_size = NULL,
  powersims = NULL,
  alpha = 0.05,
  parallel = TRUE,
  coreoffset = 0,
  packages = list(),
  method = c("Uniform", "Class Imbalance"),
  p = NULL,
  minn = ifelse(is.null(data), (dim + 1), (ncol(data) + 1)),
  x = NULL,
  y = NULL,
  ...
)

Value

A list containing two named elements. Raw gives the exact output of the simulations, while Summary gives a table of accuracy metrics, including the achieved levels of $\epsilon$ and $\delta$ given the specified values. Alternative values can be calculated using getpac()

Arguments

formula: A formula that can be passed to the model argument to define the classification algorithm
model: A binary classification model supplied by the user. Must take arguments formula and data
data: Optional. A rectangular data.frame object giving the full data from which samples are to be drawn. If left unspecified, gendata() is called to produce synthetic data with an appropriate structure.
dim: Required if data is unspecified. Gives the horizontal dimension of the data (number of predictor variables) to be generated.
maxn: Required if data is unspecified. Gives the vertical dimension of the data (number of observations) to be generated.
upperlimit: Optional. A positive integer giving the maximum sample size to be simulated, if data was supplied.
nsample: A positive integer giving the number of samples to be generated for each value of $n$. Larger values give more accurate results.
steps: A positive integer giving the interval of values of $n$ for which simulations should be conducted. Larger values give more accurate results.
eta: A real number between 0 and 1 giving the probability of misclassification error in the training data.
delta: A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than epsilon
epsilon: A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate
predictfn: An optional user-defined function giving a custom predict method. If also using a user-defined model, the model should output an object of class "svrclass" to avoid errors.
power: A logical indicating whether experimental power based on the predictions should also be reported
effect_size: If power is TRUE, a real number indicating the scaled effect size the user would like to be able to detect.
powersims: If power is TRUE, an integer indicating the number of simulations to be conducted at each step to calculate power.
alpha: If power is TRUE, a real number between 0 and 1 indicating the probability of Type I error to be used for hypothesis testing. Default is 0.05.
parallel: Boolean indicating whether or not to use parallel processing.
coreoffset: If parallel is true, a positive integer indicating the number of free threads to be kept unused. Should not be larger than the number of CPU cores.
packages: A list of packages that need to be loaded in order to run model.
method: An optional string stating the distribution from which data is to be generated. Default is i.i.d. uniform sampling. Can also take a function outputting a vector of probabilities if the user wishes to specify a custom distribution.
p: If method is 'Class Imbalance', gives the degree of weight placed on the positive class.
minn: Optional argument to set a different minimum n than the dimension of the algorithm. Useful with e.g. regularized regression models such as elastic net.
x: Optional argument for methods that take separate predictor and outcome data. Specifies a matrix-like object containing predictors. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function.
y: Optional argument for methods that take separate predictor and outcome data. Specifies a vector-like object containing outcome values. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function.
...: Additional arguments that need to be passed to model

Examples

Run this code

mylogit <- function(formula, data){
m <- structure(
  glm(formula=formula,data=data,family=binomial(link="logit")),
  class=c("svrclass","glm")  #IMPORTANT - must use the class svrclass to work correctly
)
return(m)
}
mypred <- function(m,newdata){
out <- predict.glm(m,newdata,type="response")
out <- factor(ifelse(out>0.5,1,0),levels=c("0","1"))
#Important - must specify levels to account for possibility of all
#observations being classified into the same class in smaller samples
return(out)
}
# \donttest{
library(parallel)
  results <- estimate_accuracy(two_year_recid ~
    race + sex + age + juv_fel_count + juv_misd_count + priors_count +
    charge_degree..misd.fel.,mylogit,br,
    predictfn = mypred,
    nsample=10,
    steps=1000,
    coreoffset = (detectCores() -2)
  )
# }

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples