cordillera: The OPTICS Cordillera

Description

Calculates the OPTICS Cordillera as described in Rusch et al. (2017). Based on optics in dbscan package.

Usage

cordillera(X, q = 2, minpts = 2, epsilon, distmeth = "euclidean",
  dmax = NULL, rang, digits = 10, scale = 4, ...)

Value

A list with the elements

$raw... The raw cordillera
$norm... The normalization constant
$normfac... The normalization factor (the number of times that dmax is taken)
$dmaxe... The effective maximum distance used for maximum structure (either dmax or epsilon or rang[2]-rang[1]).
$normed... The normed cordillera (raw/norm)
$optics... The optics object

Arguments

X: numeric matrix or data frame representing coordinates of points, or a symmetric matrix of distance of points or an object of class dist. Passed to optics, see also there.
q: The norm used for the Cordillera. Defaults to 2.
minpts: The minimum number of points that must make up a cluster in OPTICS (corresponds to k in the paper). It is passed to optics where it is called minPts. Defaults to 2.
epsilon: The epsilon parameter for OPTICS (called epsilon_max in the paper). Defaults to 2 times the maximum distance between any two points.
distmeth: The distance to be computed if X is not a symmetric matrix or a dist object (otherwise ignored). Defaults to Euclidean distance.
dmax: The winsorization value for the highest allowed reachability. If used for comparisons this should be supplied. If no value is supplied, it is NULL (default), then dmax is taken from the data as minimum of epsilon or the largest reachability.
rang: A range of values for making up dmax. If supplied it overrules the dmax parameter and rang[2]-rang[1] is returned as dmax in the object. If no value is supplied rang is taken to be (0, dmax) taken from the data. Only use this when you know what you're doing, which would mean you're me (and even then we should be cautious).
digits: The precision to round the raw Cordillera and the norm factor. Defaults to 10.
scale: Should X be scaled if it is an asymmetric matrix or data frame? Can take values TRUE or FALSE or a numeric value. If TRUE or 1, standardisation is to mean=0 and sd=1. If 2, no centering is applied and scaling of each column is done with the root mean square of each column. If 3, no centering is applied and scaling of all columns is done as X/max(standard deviation(allcolumns)). If 4, no centering is applied and scaling of all columns is done as X/max(rmsq(allcolumns)). If FALSE, 0 or any other numeric value, no standardisation is applied. Defaults to 4.
...: Additional arguments to be passed to optics

Warning

It may happen that the (normed) cordillera cannot be calculated properly (e.g. division by zero, infinite raw cordillera, q value to high etc.). A warning will be printed and the normed Cordillera is either 0, 1 (if infinity is involved) or NA. In that case one needs to check one or more of the following: reachability values returned from optics, minpts, eps, the raw cordillera, dmax and the normalization factor normfac.

Examples

Run this code

data(iris)
res<-princomp(iris[,1:4])
#2 dim goodness-of-clusteredness with clusters of at least 2 points
#With a matrix of points
cres2<-cordillera(res$scores[,1:2])
cres2
summary(cres2)
plot(cres2)

#with a dist object 
dl0 <- dist(res$scores[,1:2],"maximum") #maximum distance
cres0<-cordillera(dl0)
cres0
summary(cres0)
plot(cres0)

#with any symmetric distance/dissimilarity matrix 
dl1 <- cluster::daisy(res$scores[,1:2],"manhattan") 
cres1<-cordillera(dl1)
cres1
summary(cres1)
plot(cres1)

#4 dim goodness-of-clusteredness with clusters of at least 20
#points for PCA
cres4<-cordillera(res$scores[,1:4],minpts=20,epsilon=13,scale=3) 
#4 dim goodness-of-clusteredness with clusters of at least 20 points for original
#data
cres<-cordillera(iris[,1:4],minpts=20,epsilon=13,dmax=cres4$dmaxe,scale=3)
#There is more clusteredness for the original result
summary(cres4) 
summary(cres) 
plot(cres4) #cluster structure only a bit intelligible
plot(cres) #clearly two well separated clusters

###############################################################################
# Example from Rusch et al. (2017) with original data, PCA and Sammon mapping #
###############################################################################

#data preparation
data(CAClimateIndicatorsCountyMedian)
sovisel <- CAClimateIndicatorsCountyMedian[,-c(1,2,4,9)]
#normalize to [0,1]
sovisel <- apply(sovisel,2,function(x) (x-min(x))/(max(x)-min(x))) 
rownames(sovisel)  <- CAClimateIndicatorsCountyMedian[,1]
dis <- dist(sovisel)

#hyper parameters
dmax=1.22
q=2
minpts=3

#original data directly
cdat <- cordillera(sovisel,distmeth="euclidean",minpts=minpts,epsilon=10,q=q,
                   scale=0)
#equivalently
#dis2=dist(sovisel)
#cdat2 <- cordillera(dis2,minpts=minpts,epsilon=10,q=q,scale=FALSE) 

#PCA in 2-dim
pca1 <- princomp(sovisel)
pcas <- scale(pca1$scores[,1:2])
cpca <- cordillera(pcas,minpts=minpts,epsilon=10,q=q,dmax=dmax,scale=FALSE)

#Sammon mapping in 2-dim
sam <- MASS::sammon(dis)
samp <- scale(sam$points)
csam <- cordillera(samp,epsilon=10,minpts=minpts,q=q,dmax=dmax,scale=FALSE)

#results
cdat
cpca
csam

par(mfrow=c(3,1))
plot(cdat)
plot(cpca)
plot(csam)
par(mfrow=c(1,1))

Run the code above in your browser using DataLab