featClust: R function for features clustering on the basis of distances/area

Description

The function provides the facility to cluster the features of the input dataset on the basis of either their (projected) coordinates (for points; SpatialPointsDataFrame class) or of their area (for polygons; SpatialPolygonsDataFrame class). If a target feature dataset (to.feat) is provided, the clustering will be based on the distance of the x feature to the nearest to.feature. When a to.feature is specified, the x feature (i.e., the feature that the user wants to cluster) can be either a point (SpatialPointsDataFrame class), or a polyline (SpatialLinesDataFrame class) , or a polygon (SpatialPolygonsDataFrame class) feature. Notice that if all the x features overlap with all the to.feature, all the minimum distances will be 0, and the function will trow an error.

Usage

featClust(x, to.feat = NULL, aggl.meth = "ward.D2", part = NULL,
  showID = TRUE, oneplot = TRUE, cex.dndr.lab = 0.85,
  cex.sil.lab = 0.75, cex.feat.lab = 0.65, col.feat.lab = "black",
  export = FALSE)

Arguments

Dataset whose feature are to be clustered; either points (SpatialPointsDataFrame class) or polygons (SpatialPolygonsDataFrame class); if the to.feat is specified, x can also be a polylines feature (SpatialLinesDataFrame class).

to.feat

Dataset (NULL by default) representing the feature the distance toward which is used as basis for clustering x; either points (SpatialPointsDataFrame class), polygons (SpatialPolygonsDataFrame class), or polylines (SpatialLinesDataFrame).

aggl.meth

Agglomeration method ("ward.D2" by default).

part

Desired number of clusters; if NULL (default), an optimal partition is calculated (see Details).

showID

TRUE (default) or FALSE if the used wants or does not want the ID of the clustered features to be displayed in the plot where the features are colored by cluster membership.

oneplot

TRUE (default) or FALSE if the user wants or does not want the plots to be visualized in a single window.

cex.dndr.lab

Set the size of the labels used in the dendrogram.

cex.sil.lab

Set the size of the labels used in the silhouette plot.

cex.feat.lab

Set the size of the labels used (if 'showID' is set to TRUE) to show the clustered features' IDs.

col.feat.lab

Set the color of the clustered features' IDs ('black' by default).

export

TRUE or FALSE (default) if the user wants or does not want the clustered input dataset to be exported; if TRUE, the input dataset with a new variable indicating the cluster membership will be exported as a shapefile.

Value

The function returns a list storing the following components

$dist.matrix: distance matrix
$avr.silh.width.by.n.of.clusters: average silhouette width by number of clusters
$partition.silh.data: silhouette data for the selected partition
$coord.or.area.or.min.dist.by.clust: coordinates, area, or distance to the nearest to.feat coupled with cluster membership
$dist.stats.by.cluster: by-cluster summary statistics of the x feature distance to the nearest to.feature
$dataset: the input dataset with two variables added ($feat_ID and $clust, the latter storing the cluster membership)

Details

If the to.feature is not provided, the function internally calculates a distance matrix (based on the Euclidean Distance) on the basis of the points' coordinates or polygons' area. If the to.feature is provided, the distance matrix will be based on the distance of the x feature to the nearest to.feature. A dendrogram is produced which depicts the hierarchical clustering based (by default) on the Ward's agglomeration method; rectangles identify the selected cluster partition. Besides the dendrogram, a silhouette plot is produced, which allows to measure how 'good' is the selected cluster solution.

As for the latter, if the parameter 'part' is left empty (default), an optimal cluster solution is obtained. The optimal partition is selected via an iterative procedure which locates at which cluster solution the highest average silhouette width is achieved. If a user-defined partition is needed, the user can input the desired number of clusters using the parameter 'part'. In either case, an additional plot is returned besides the cluster dendrogram and the silhouette plot; it displays a scatterplot in which the cluster solution (x-axis) is plotted against the average silhouette width (y-axis). A black dot represent the partition selected either by the iterative procedure or by the user.

Notice that in the silhouette plot, the labels on the left-hand side of the chart show the point ID number and the cluster to which each point is closer.

Also, the function returns a plot showing the input dataset, with features colored by cluster membership. Two new variables are added to the shapefile's dataframe, storing a point ID number and the corresponding cluster membership.

The silhouette plot is obtained from the 'silhouette()' function out from the 'cluster' package (https://cran.r-project.org/web/packages/cluster/index.html). For a detailed description of the silhouette plot, its rationale, and its interpretation, see: Rousseeuw P J. 1987. "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis", Journal of Computational and Applied Mathematics 20, 53-65 (http://www.sciencedirect.com/science/article/pii/0377042787901257)

For the hierarchical clustering of features, see: Conolly, J., & Lake, M. (2006). Geographic Information Systems in Archaeology. Cambridge: Cambridge University Press, 168-173.

Examples

Run this code

# NOT RUN {
data(springs)

#perform the analysis and automatically select an optimal partition
res <- featClust(springs)

#as above, but selecting a 3-cluster partition
res <- featClust(springs, part=3)

#cluster springs on the basis of their distance to the nearest geological fault
res <- featClust(springs, faults)

#cluster polygonal areas on the basis of their distance to the nearest spring
res <- featClust(polygons, springs)

#cluster points on the basis of their distance to the nearest polygon
res <- featClust(points, polygons)


# }

Run the code above in your browser using DataLab