Learn R Programming

scR (version 0.4.0)

estimate_accuracy: Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.

Description

Estimate sample complexity bounds for a binary classification algorithm using either simulated or user-supplied data.

Usage

estimate_accuracy(
  formula,
  model,
  data = NULL,
  dim = NULL,
  maxn = NULL,
  upperlimit = NULL,
  nsample = 30,
  steps = 50,
  eta = 0.05,
  delta = 0.05,
  epsilon = 0.05,
  predictfn = NULL,
  power = FALSE,
  effect_size = NULL,
  powersims = NULL,
  alpha = 0.05,
  parallel = TRUE,
  coreoffset = 0,
  packages = list(),
  method = c("Uniform", "Class Imbalance"),
  p = NULL,
  minn = ifelse(is.null(data), (dim + 1), (ncol(data) + 1)),
  x = NULL,
  y = NULL,
  ...
)

Value

A list containing two named elements. Raw gives the exact output of the simulations, while Summary gives a table of accuracy metrics, including the achieved levels of \(\epsilon\) and \(\delta\) given the specified values. Alternative values can be calculated using getpac()

Arguments

formula

A formula that can be passed to the model argument to define the classification algorithm

model

A binary classification model supplied by the user. Must take arguments formula and data

data

Optional. A rectangular data.frame object giving the full data from which samples are to be drawn. If left unspecified, gendata() is called to produce synthetic data with an appropriate structure.

dim

Required if data is unspecified. Gives the horizontal dimension of the data (number of predictor variables) to be generated.

maxn

Required if data is unspecified. Gives the vertical dimension of the data (number of observations) to be generated.

upperlimit

Optional. A positive integer giving the maximum sample size to be simulated, if data was supplied.

nsample

A positive integer giving the number of samples to be generated for each value of $n$. Larger values give more accurate results.

steps

A positive integer giving the interval of values of $n$ for which simulations should be conducted. Larger values give more accurate results.

eta

A real number between 0 and 1 giving the probability of misclassification error in the training data.

delta

A real number between 0 and 1 giving the targeted maximum probability of observing an OOS error rate higher than epsilon

epsilon

A real number between 0 and 1 giving the targeted maximum out-of-sample (OOS) error rate

predictfn

An optional user-defined function giving a custom predict method. If also using a user-defined model, the model should output an object of class "svrclass" to avoid errors.

power

A logical indicating whether experimental power based on the predictions should also be reported

effect_size

If power is TRUE, a real number indicating the scaled effect size the user would like to be able to detect.

powersims

If power is TRUE, an integer indicating the number of simulations to be conducted at each step to calculate power.

alpha

If power is TRUE, a real number between 0 and 1 indicating the probability of Type I error to be used for hypothesis testing. Default is 0.05.

parallel

Boolean indicating whether or not to use parallel processing.

coreoffset

If parallel is true, a positive integer indicating the number of free threads to be kept unused. Should not be larger than the number of CPU cores.

packages

A list of packages that need to be loaded in order to run model.

method

An optional string stating the distribution from which data is to be generated. Default is i.i.d. uniform sampling. Can also take a function outputting a vector of probabilities if the user wishes to specify a custom distribution.

p

If method is 'Class Imbalance', gives the degree of weight placed on the positive class.

minn

Optional argument to set a different minimum n than the dimension of the algorithm. Useful with e.g. regularized regression models such as elastic net.

x

Optional argument for methods that take separate predictor and outcome data. Specifies a matrix-like object containing predictors. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function.

y

Optional argument for methods that take separate predictor and outcome data. Specifies a vector-like object containing outcome values. Note that if used, the x and y objects are bound together columnwise; this must be handled in the user-supplied helper function.

...

Additional arguments that need to be passed to model

See Also

plot_accuracy(), to represent simulations visually, getpac(), to calculate summaries for alternate values of \(\epsilon\) and \(\delta\) without conducting a new simulation, and gendata(), to generated synthetic datasets.

Examples

Run this code
mylogit <- function(formula, data){
m <- structure(
  glm(formula=formula,data=data,family=binomial(link="logit")),
  class=c("svrclass","glm")  #IMPORTANT - must use the class svrclass to work correctly
)
return(m)
}
mypred <- function(m,newdata){
out <- predict.glm(m,newdata,type="response")
out <- factor(ifelse(out>0.5,1,0),levels=c("0","1"))
#Important - must specify levels to account for possibility of all
#observations being classified into the same class in smaller samples
return(out)
}
# \donttest{
library(parallel)
  results <- estimate_accuracy(two_year_recid ~
    race + sex + age + juv_fel_count + juv_misd_count + priors_count +
    charge_degree..misd.fel.,mylogit,br,
    predictfn = mypred,
    nsample=10,
    steps=1000,
    coreoffset = (detectCores() -2)
  )
# }

Run the code above in your browser using DataLab