clusterer: Cluster Analysis Verification

Description

Perform Cluster Analysis (CA) verifcation per Marzban and Sandgathe (2006).

Usage

clusterer(X, Y, xloc = NULL, xyp = TRUE, threshold = 1e-08,
          linkage.method = "complete", stand = TRUE, trans = "identity",
          verbose = FALSE, ...)
## S3 method for class 'clusterer':
summary(object, ...)
## S3 method for class 'clusterer':
plot(x, ...)
## S3 method for class 'summary.clusterer':
plot(x, ...)

Arguments

X,Y

m by n matrices giving the verification and forecast fields, resp.

object,x

list object of class clusterer as returned by clusterer (or summary.clusterer in the case of plot.summary.clusterer).

xloc

(optional) numeric mn by 2 matrix giving the gridpoint locations. If NULL, this will be created using 1:m and 1:n.

xyp

logical, should the cluster analysis be performed on the locations and intensities (TRUE) or only the locations (FALSE)?

threshold

numeric of length one or two giving the threshold to apply to each field (>=). If length is two, the first value corresponds to the threshold for the veriifcation field, and the second to the foreast field.

linkage.method

character naming a valid linkage method accepted by hclust.

stand

logical, should the data matrices consisting of xloc and each field first be standardized before performing cluster analysis?

trans

character naming a function to be applied to the field intensities before performing the CA. Only used if xyp is TRUE. Default applies no transformation.

verbose

logical, should progress information be printed to the screen?

...

optional arguments to the hclust function. In the case of the summary method function, z and/or sigma giving a numeric value used to find the cut-off given by median + z*sigma for detemining matched obj

Value

A list object of class clusterer is returned with components:
linkage.methodcharacter vector of length one or two giving the linkage method as passed into the function. The length is two only if the McQuitty method is chosen in which case this method is used for the CA, but not for the inter-cluster differencs across fields (average is used for that instead).
transcharacter naming the transformation function applied to the intensities.
data.namecharacter vector giving the name of each field.
Nnumeric giving the size of the fields.
xlocmn by 2 matrix giving the location values.
thresholdnumeric of length two giving the threshold applied to each field.
NCo,NCfnumeric vectors giving the number of clusters at each iteration of the CA for the verification and forecast fields, resp.
cluster.identifiersa list with components X and Y giving lists of lists identifying specific CA components at each level of the CA for both fields.
idX,idYlogical vectors describing which grid points were included in the CA for each field (i.e., which grid points were >= threshold and had non-missing values).
cluster.objectsa list with components X and Y giving the objects returned by hclust for each field.
inter.cluster.dista list of list objects with NCf by NCo matrix components giving the inter-cluster distances (between verification and forecast fields) for each iteration of CA for each field.
min.intercluster.distsnumeric vector givng the minimum values inter.cluster.dist at each iteration. Used to determine the cut-off for matched objects.
The summary method function returns a list with the same components as above, but also the components:
cutoffThe cut-off value used for determining matches.
csi,AvgErrNCo by NCf numeric matrix giving the critical success index (CSI) and average intercluster error (distance) based on matched/un-matched objects.
HMFNCo by NCf by 3 array giving the hits, misses and false alarms based on matched/un-matched objects.

Warning

Although some effort has been put into making the functions in this package as computationally efficient as possible, there is a lot of bookeeping involved with this approach, and the current functions are probably not as efficient as they could be. In any case, they will likely be slow for large data sets. The function can work quickly on large fields if an adequately high threshold is used (e.g., if threshold=10 is replaced for 16 in the not run example below, the function is VERY slow). Performing the actual cluster analysis on each field is fast because the hclust function from the fastcluster package is used, which works very well. However, bookeeping after the CA is done employs a lot of loops within loops, which possibly can be made more efficient (and maybe someday will be), but for now...

If it is desired to simply look at the CA for the two fields, the function hclust from fastcluster can be used, which essentially replaces the hclust function from the stats package with a faster version, but otherwise operates the same as far as what is returned, etc., and the same method functions can be employed.

Details

This function performs cluster analysis (CA) on positive values from each of two fields in a verification set using the hclust function from package fastcluster. Inter-cluster distances are computed between each cluster of each field at every level of the CA. The function clusterer performs CA on both fields, and finds the inter-cluster distances across fields for every possible combination of objects at each iteration of each CA. The summary method function finishes the analysis by determining hits, misses and false alarms as well as the numbers of clusters. It also computes CSI for each number of cluster combinations. This is the verification approach described in Marzban and Sandgathe (2006).

The plot method function creates a 4 by 2 panel of plots. The top two plots give image plots of the verification and forecast fields with grid points below the threshold(s) showing zero. The next two plots are dendrograms as performed by the plot method function for hclust (dendrogram) objects. The next row gives a histogram of the minimum inter-cluster distances, then box plots showing the hits, misses and false alarms for every possible combination of levels of each CA. Finally, the bottom two plots show, for each combination of CA level (i.e., numbers of clusters), the CSI and average error (inter-cluster distance) for all matched objects. These last three plots are the ones made by the plot method for values returned from the summary method function.

References

Marzban, C. and Sandgathe, S. (2006) Cluster analysis for verification of precipitation fields. Wea. Forecasting, 21, 824--838.

Examples

Run this code

data(UKobs6)
data(UKfcst6)
look <- clusterer(X=UKobs6, Y=UKfcst6, threshold=16, trans="log", verbose=TRUE)
plot(look)

Run the code above in your browser using DataLab