Isopam classification is performed either as a hierarchical, divisive method or as non-hierarchical partitioning. Isopam is designed for matrices representing species abundances in plots and with a diagnostic species approach in mind. It optimises clusters and cluster numbers for concentration of indicative species in groups. Predefined indicative species and cluster medoids can optionally be added for fully or semi-supervised classification.
isopam(dat, c.fix = NULL, c.max = NULL, l.max = FALSE, stopat = c(1,7),
sieve = TRUE, Gs = 3.5, ind = NULL, centers = NULL,
distance = 'bray', k.max = 100, d.max = 7, juice = FALSE,
polishing = c('strict', 'relaxed'), ...) # S3 method for isopam
identify(x, ...)
# S3 method for isopam
plot(x, ...)
# S3 method for isopam
summary(object, ...)
# S3 method for isopam
print(x, ...)
generating call
distance measure used by Isomap
observations (plots) with group affiliation. Running group numbers for each level of the hierarchy.
observations (plots) with group affiliation. Group identifiers reflect the cluster hierarchy. Not present with only one level of partitioning.
observations (plots) representing the medoids of the resulting groups.
table summarizing parameter settings for
the partitioning steps. Name: name of the
respective parent cluster (0 in case of the first partition);
Subgroups: number of subgroups; Isomap.dim:
Isomap dimensions used; Isomap.k.min: minimum
possible Isomap k; Isomap.k: Isomap
k used; Isomap.k.max: maximum possible
Isomap k; Ind.N: number of indicators
reaching or exceeding Gs; Ind.Gs: the average
standardized G value of these indicators; and
Global.Gs: the average standardized G value of all
descriptors (species).
Cluster centers suggested by user.
Indicators suggested by user.
character string describing the type of
supervision used, or NULL if unsupervised.
Indicators used.
an object of class hclust representing
the clustering (as used by plot). Not present
with only one level of partitioning.
data used
data matrix: each row corresponds to an object (typically a plot), each column corresponds to a descriptor (typically a species). All variables must be numeric. Missing values (NAs) are not allowed. At least 3 rows (plots) are required.
number of clusters (defaults to NULL).
If a number is given, non-hierarchical partitioning is
performed, c.max is ignored and l.max is
set to one. In combination with centers, if
c.fix exceeds the number of provided centers,
semi-supervised mode is activated: the provided centers
remain fixed while additional clusters are formed
to reach the specified total (see Details).
maximum number of clusters per partition.
Applies to all splits. Defaults to 6 unless centers
is provided, in which case it defaults to the number of
centers. If explicitly set to a value greater than the
number of provided centers (with c.fix = NULL),
semi-supervised search mode is activated: the algorithm
searches for the optimal number of additional clusters
(see Details).
maximum number of hierarchy levels. Defaults
to FALSE (no maximum number). Note that divisions
may stop well before this number is reached (see
stopat). Use l.max = 1 for non-hierarchical
partitioning (or use c.fix).
vector with stopping rules for hierarchical
clustering. Two values define if a partition should be
retained in hierarchical clustering: the first determines
how many indicator species must be present per cluster,
the second defines the standardized G-value that must be
reached by these indicators. stopat is not effective
at the first hierarchy level or in non-hierarchical
partitioning.
logical. If TRUE (the deafult), only
species exceeding a threshold defined by Gs are
used in the search for a good clustering solution. Their
number is multiplied with their mean standardized G-value.
The product is used as optimality criterion. If FALSE
all species are used for optimization.
threshold (standardized G value) for species
to be considered in the search for a good clustering solution.
Effective with sieve = TRUE.
optional vector of column names from dat
defining species used as indicators. This turns Isopam
in an expert system. Replaces the automated selection of
indicators with sieve = TRUE (ind overrules
sieve).
optional vector with indices (numeric) or
names (character) of observations used as cluster cores.
With centers alone (fully supervised mode), exactly
as many clusters as centers are created. Combine with
c.fix or c.max for semi-supervised modes
(see Details).
name of a dissimilarity index for the distance matrix used as a starting point for Isomap. Any distance measure implemented in packages vegan (predefined or using a designdist equation) or proxy can be used (see details).
maximum Isomap k.
maximum number of Isomap dimensions.
logical. If TRUE input files for Juice are
generated.
treatment of rare or invariant species and
plots with few species. In the case of polishing = "strict"
(default), species with only one occurrence or no variance and plots
with only one species are omitted during clustering. If
"relaxed" is used, only missing and invariant species and
empty plots are removed.
other arguments used by juice or passed to S3
functions plot and identify (see
dendrogram and hclust).
isopam result object in methods
plot, print, and identify.
isopam result object in method
summary.
Sebastian Schmidtlein with contributions from Jason Collison, Robin Pfannendoerfer and Lubomir Tichý
Isopam is described in Schmidtlein et al. (2010). It consists of dimensionality reduction (Isomap: Tenenbaum et al. 2000; isomap in vegan) and partitioning of the resulting ordination space (PAM: Kaufman & Rousseeuw 1990; pam in cluster). The classification is performed either as a hierarchical, divisive method, or as non-hierarchical partitioning. It has the following features: partitions are optimized for the occurrence of species with high fidelity to groups; it optionally selects the number of clusters per division; the shapes of groups in feature space are not restricted to spherical or other regular geometric shapes (thanks to the underlying Isomap algorithm); the distance measure used for the initial distance matrix can be freely defined.
Three supervised modes are available when centers are
provided:
Fully supervised mode (default with centers):
Only the specified centers are used, creating exactly as many
clusters as centers provided. No optimization of cluster number.
Semi-supervised mode with fixed number of clusters
(centers + c.fix): When c.fix exceeds the
number of centers, additional clusters are formed
to reach the total specified by c.fix. For example, with
2 centers and c.fix = 4, you get 2 fixed + 2 free clusters.
The initial medoids for swapping are placed using a greedy
maximin approach.
Semi-supervised mode with optimization of cluster number
(centers + c.max): When c.max is explicitly set
greater than the number of centers (and c.fix = NULL),
the algorithm searches for the optimal total number of clusters
up to c.max. The provided centers remain fixed while the number
and placement of additional clusters is optimized. If no additional
clusters improve the result, only the provided centers are used.
Pre-defined indicator species are not as constraining as centers, even if preference is given to cluster solutions in which their fidelity is maximized. It depends on the data how much they affect the result.
Using polishing = "strict" reduces noise introduced by rare
species and random outcomes due to species-poor plots, which
consequently are not allocated. If you have the feeling that species
with only one occurrence and plots with only one species should also
contribute to the clustering, work with polishing = "relaxed",
where only empty plots and missing species are excluded. This comes at
the risk of noise and unstable results caused by coincidental species
occurrences.
The preset distance measure is Bray-Curtis (Odum 1950). Distance measures are passed to vegdist or to designdist in vegan. If this does not work it is passed to dist in proxy. Measures available in vegan are listed in vegdist. Isopam does not accept distance matrices as a replacement for the original data matrix because it operates on individual descriptors (species).
Isopam is slow with large data sets. It switches to a slow mode when an internally used lookup array does not fit into RAM. It is used for the results of the search for an optimal parameterisation (selection of Isomap dimensions and -k, optionally selection of cluster numbers) does not fit into RAM.
plot creates (and silently returns) an object of class
dendrogram and calls the S3 plot method for that class.
identify works just like identify.hclust.
Odum, E.P. (1950): Bird populations in the Highlands (North Carolina) plateau in relation to plant succession and avian invasion. Ecology 31: 587--605.
Kaufman, L., Rousseeuw, P.J. (1990): Finding groups in data. Wiley.
Schmidtlein, S., Tichý, L., Feilhauer, H., Faude, U. (2010): A brute force approach to vegetation classification. Journal of Vegetation Science 21: 1162--1171.
Tenenbaum, J.B., de Silva, V., Langford, J.C. (2000): A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319--2323.
clusters for extracting cluster assignments from
the result object.
isotab for a table of descriptor (species)
frequencies in clusters and fidelity measures. There is a plot
method associated to isotab objects that visualizes
species fidelities to clusters.
## load data to the current environment
data(andechs)
## call isopam with the standard options
ip <- isopam(andechs)
## print function
ip
## examine cluster hierarchy
plot(ip)
## retrieve cluster vector (second hierarchy level)
clusters(ip, 2)
## frequency table
it <- isotab(ip)
it
## plot with species fidelities (equalized phi)
plot(it)
## non-hierarchical partitioning with (forced) three clusters
ip <- isopam(andechs, c.fix = 3)
ip
## limiting the set of species used in cluster search
ip <- isopam(andechs, ind = c("Car_pan", "Sch_fer"))
ip
## supervised mode with fixed cluster medoids
ip <- isopam(andechs, centers = c("p3", "p19"))
ip
## semi-supervised: one fixed medoid + two free clusters
ip <- isopam(andechs, centers = "p3", c.fix = 3)
ip
## semi-supervised search: one fixed medoid and optimized number of additional clusters
ip <- isopam(andechs, centers = "p3", c.max = 4)
ip
Run the code above in your browser using DataLab