roc: ROC curve analysis

Description

Fits Receiver Operator Characteristic (ROC) curves to training set data. Used to determine the critical value of a dissimilarity coefficient that best descriminate between assemblage-types in palaeoecological data sets, whilst minimising the false positive error rate (FPF).

Usage

roc(object, groups, k = 1, ...)
## S3 method for class 'default':
roc(object, groups, k = 1, thin = FALSE,
    max.len = 10000, ...)
## S3 method for class 'mat':
roc(object, groups, k = 1, ...)
## S3 method for class 'analog':
roc(object, groups, k = 1, ...)

Arguments

object

an R object.

groups

a vector of group memberships, one entry per sample in the training set data. Can be a factor, and will be coerced to one if supplied vecvtor is not a factor.

numeric; the k closest analogues to use to calculate ROC curves.

thin

logical; should the points on the ROC curve be thinned? See Details, below.

max.len

numeric; length of analolgue and non-analogue vectors. Used as limit to thin points on ROC curve to.

...

arguments passed to/from other methods.

Value

A list with two components; i, statistics, a summary of ROC statistics for each level of groups and a combined ROC analysis, and ii, roc, a list of ROC objects, one per level of groups. For the latter, each ROC object is a list, with the following components:
TPFThe true positive fraction.
FPEThe false positive error
optimalThe optimal dissimilarity value, asessed where the slope of the ROC curve is maximal.
AUCThe area under the ROC curve.
se.fitStandard error of the AUC estimate.
n.innumeric; the number of samples within the current group.
n.outnumeric; the number of samples not in the current group.
p.valueThe p-value of a Wilcoxon rank sum test on the two sets of dissimilarities. This is also known as a Mann-Whitney test.
roc.pointsThe unique dissimilarities at which the ROC curve was evaluated
max.rocnumeric; the position along the ROC curve at which the slope of the ROC curve is maximal. This is the index of this point on the curve.
priornumeric, length 2. Vector of observed prior probabilities of true analogue and true non-analogues in the group.
analoguea list with components yes and no containing the dissimilarities for true analogue and true non-analogues in the group.

concept

ROC

Details

A ROC curve is generated from the within-group and between-group dissimilarities.

For each level of the grouping vector (groups) the dissimilarity between each group member and it's k closest analogues within that group are compared with the k closest dissimilarities between the non-group member and group member samples.

If one is able to discriminate between members of different group on the basis of assemblage dissimilarity, then the dissimilarities between samples within a group will be small compared to the dissimilarities between group members and non group members.

thin is useful for large problems, where the number of analogue and non-analogue distances can conceivably be large and thus overflow the largest number R can work with. This option is also useful to speed up computations for large problems. If thin == TRUE, then the larger of the analogue or non-analogue distances is thinned to a maximum length of max.len. The smaller set of distances is scaled proportionally. In thinning, we approximate the distribution of distances by taking max.len (or a fraction of max.len for the smaller set of distances) equally-spaced probability quantiles of the distribution as a new set of distances.

References

Brown, C.D., and Davis, H.T. (2006) Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Laboratory Systems 80, 24--38. Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research 60, 356--367.

Henderson, A.R. (1993) Assessing test accuracy and its clinical consequences: a primer for receiver operating characteristic curve analysis. Annals of Clinical Biochemistry 30, 834--846.

Examples

Run this code

## load the example data
data(swapdiat, swappH, rlgh)

## merge training and test set on columns
dat <- join(swapdiat, rlgh, verbose = TRUE)

## extract the merged data sets and convert to proportions
swapdiat <- dat[[1]] / 100
rlgh <- dat[[2]] / 100

## fit an analogue matching (AM) model using the squared chord distance
## measure - need to keep the training set dissimilarities
swap.ana <- analog(swapdiat, rlgh, method = "SQchord",
                   keep.train = TRUE)

## fit the ROC curve to the SWAP diatom data using the AM results
## Generate a grouping for the SWAP lakes
clust <- hclust(as.dist(swap.ana$train), method = "ward")
grps <- cutree(clust, 12)

## fit the ROC curve
swap.roc <- roc(swap.ana, groups = grps)
swap.roc

## draw the ROC curve
plot(swap.roc, 1)

## draw the four default diagnostic plots
layout(matrix(1:4, ncol = 2))
plot(swap.roc)
layout(1)

Run the code above in your browser using DataLab