If direction
and/or pos_class
and neg_class
are not given, the function will
assume that higher values indicate the positive class and use the class
with a higher median as the positive class.
Different methods can be selected for determining the optimal cutpoint via
the method argument. The package includes the following method functions:
maximize_metric
: Maximize the metric function
minimize_metric
: Minimize the metric function
maximize_loess_metric
: Maximize the metric function after LOESS
smoothing
minimize_loess_metric
: Minimize the metric function after LOESS
smoothing
maximize_spline_metric
: Maximize the metric function after spline
smoothing
minimize_spline_metric
: Minimize the metric function after spline
smoothing
maximize_boot_metric
: Maximize the metric function as a summary of
the optimal cutpoints in bootstrapped samples
minimize_boot_metric
: Minimize the metric function as a summary of
the optimal cutpoints in bootstrapped samples
oc_youden_kernel
: Maximize the Youden-Index after kernel smoothing
the distributions of the two classes
oc_youden_normal
: Maximize the Youden-Index parametrically
assuming normally distributed data in both classes
oc_manual
: Specify the cutpoint manually
User-defined functions can be supplied to method, too. As a reference,
the code of all included method functions can be accessed by simply typing
their name. To define a new method function, create a function that may take
as input(s):
data
: A data.frame
or tbl_df
x
: (character) The name of the predictor or independent variable
class
: (character) The name of the class or dependent variable
metric_func
: A function for calculating a metric, e.g. accuracy
pos_class
: The positive class
neg_class
: The negative class
direction
: ">=" if the positive class has higher x values, "<=" otherwise
tol_metric
: (numeric) In the built-in methods a tolerance around
the optimal metric value
use_midpoints
: (logical) In the built-in methods whether to
use midpoints instead of exact optimal cutpoints
...
Further arguments
The ...
argument can be used to avoid an error if not all of the above
arguments are needed or in order to pass additional arguments to method.
The function should return a data.frame
or tbl_df
with
one row, the column "optimal_cutpoint", and an optional column with an arbitrary name
with the metric value at the optimal cutpoint.
Built-in metric functions include:
accuracy
: Fraction correctly classified
youden
: Youden- or J-Index = sensitivity + specificity - 1
sum_sens_spec
: sensitivity + specificity
sum_ppv_npv
: The sum of positive predictive value (PPV) and negative
predictive value (NPV)
prod_sens_spec
: sensitivity * specificity
prod_ppv_npv
: The product of positive predictive value (PPV) and
negative predictive value (NPV)
cohens_kappa
: Cohen's Kappa
abs_d_sens_spec
: The absolute difference between
sensitivity and specificity
abs_d_ppv_npv
: The absolute difference between positive predictive
value (PPV) and negative predictive value (NPV)
p_chisquared
: The p-value of a chi-squared test on the confusion
matrix of predictions and observations
odds_ratio
: The odds ratio calculated as (TP / FP) / (FN / TN)
risk_ratio
: The risk ratio (relative risk) calculated as
(TP / (TP + FN)) / (FP / (FP + TN))
positive and negative likelihood ratio calculated as
plr
= true positive rate / false positive rate and
nlr
= false negative rate / true negative rate
misclassification_cost
: The sum of the misclassification cost of
false positives and false negatives fp * cost_fp + fn * cost_fn.
Additional arguments to cutpointr: cost_fp
, cost_fn
total_utility
: The total utility of true / false positives / negatives
calculated as utility_tp * TP + utility_tn * TN - cost_fp * FP - cost_fn * FN.
Additional arguments to cutpointr: utility_tp
, utility_tn
,
cost_fp
, cost_fn
F1_score
: The F1-score (2 * TP) / (2 * TP + FP + FN)
Furthermore, the following functions are included which can be used as metric
functions but are more useful for plotting purposes, for example in
plot_cutpointr, or for defining new metric functions:
tp
, fp
, tn
, fn
, tpr
, fpr
,
tnr
, fnr
, false_omission_rate
,
false_discovery_rate
, ppv
, npv
, precision
,
recall
, sensitivity
, and specificity
.
User defined metric functions can be created as well which can accept the following
inputs as vectors:
tp
: Vector of true positives
fp
: Vector of false positives
tn
: Vector of true negatives
fn
: Vector of false negatives
...
If the metric function is used in conjunction with any of the
maximize / minimize methods, further arguments can be passed
The function should return a numeric vector or a matrix or a data.frame
with one column. If the column is named,
the name will be included in the output and plots. Avoid using names that
are identical to the column names that are by default returned by cutpointr.
If boot_runs
is positive, that number of bootstrap samples will be drawn
and the optimal cutpoint using method
will be determined. Additionally,
as a way of internal validation, the function in metric
will be used to
score the out-of-bag predictions using the cutpoints determined by
method
. Various default metrics are always included in the bootstrap results.
If multiple optimal cutpoints are found, the column optimal_cutpoint becomes a
list that contains the vector(s) of the optimal cutpoints.
If use_midpoints = TRUE
the mean of the optimal cutpoint and the next
highest or lowest possible cutpoint is returned, depending on direction
.
The tol_metric
argument can be used to avoid floating-point problems
that may lead to exclusion of cutpoints that achieve the optimally achievable
metric value. Additionally, by selecting a large tolerance multiple cutpoints
can be returned that lead to decent metric values in the vicinity of the
optimal metric value. tol_metric
is passed to metric and is only
supported by the maximization and minimization functions, i.e.
maximize_metric
, minimize_metric
, maximize_loess_metric
,
minimize_loess_metric
, maximize_spline_metric
, and
minimize_spline_metric
. In maximize_boot_metric
and
minimize_boot_metric
multiple optimal cutpoints will be passed to the
summary_func
of these two functions.