
Last chance! 50% off unlimited learning
Sale ends in
Performs clustering analysis with selection of variables.
clustering(
.data,
...,
by = NULL,
scale = FALSE,
selvar = FALSE,
verbose = TRUE,
distmethod = "euclidean",
clustmethod = "average",
nclust = NULL
)
The data to be analyzed. It can be a data frame, possible with
grouped data passed from group_by()
.
The variables in .data
to compute the distances. Set to
NULL
, i.e., all the numeric variables in .data
are used.
One variable (factor) to compute the function by. It is a shortcut
to group_by()
. To compute the statistics by more than
one grouping variable use that function.
Should the data be scaled before computing the distances? Set to
FALSE. If TRUE, then, each observation will be divided by the standard
deviation of the variable Z_{ij} = X_{ij} / sd(j)
Logical argument, set to FALSE
. If TRUE
, then an
algorithm for selecting variables is implemented. See the section
Details for additional information.
Logical argument. If TRUE
(default) then the results
for variable selection are shown in the console.
The distance measure to be used. This must be one of
'euclidean'
, 'maximum'
, 'manhattan'
,
'canberra'
, 'binary'
, 'minkowski'
, 'pearson'
,
'spearman'
, or 'kendall'
. The last three are
correlation-based distance.
The agglomeration method to be used. This should be one of
'ward.D'
, 'ward.D2'
, 'single'
, 'complete'
,
'average'
(= UPGMA), 'mcquitty'
(= WPGMA), 'median'
(=
WPGMC) or 'centroid'
(= UPGMC).
The number of clusters to be formed. Set to NULL
data The data that was used to compute the distances.
cutpoint The cutpoint of the dendrogram according to Mojena (1977).
distance The matrix with the distances.
de The distances in an object of class dist
.
hc The hierarchical clustering.
Sqt The total sum of squares.
tab A table with the clusters and similarity.
clusters The sum of square and the mean of the clusters for each variable.
cofgrap If selectvar = TRUE
, then, cofpgrap
is a
ggplot2-based graphic showing the cophenetic correlation for each model
(with different number of variables). Else, will be a NULL
object.
statistics If selectvar = TRUE
, then, statistics
shows
the summary of the models fitted with different number of variables,
including cophenetic correlation, Mantel's correlation with the original
distances (all variables) and the p-value associated with the Mantel's
test. Else, will be a NULL
object.
When selvar = TRUE
a variable selection algorithm is executed. The
objective is to select a group of variables that most contribute to explain
the variability of the original data. The selection of the variables is based
on eigenvalue/eigenvectors solution based on the following steps. 1:
compute the distance matrix and the cophenetic correlation with the original
variables (all numeric variables in dataset); 2: compute the
eigenvalues and eigenvectors of the correlation matrix between the variables;
3: delete the variable with the largest weight (highest eigenvector in
the lowest eigenvalue); 4: compute the distance matrix and cophenetic
correlation with the remaining variables; 5: compute the Mantel's
correlation between the obtained distances matrix and the original distance
matrix; 6: iterate steps 2 to 5 p - 2 times, where p is
the number of original variables. At the end of the p - 2 iterations,
a summary of the models is returned. The distance is calculated with the
variables that generated the model with the largest cophenetic correlation. I
suggest a careful evaluation aiming at choosing a parsimonious model, i.e.,
the one with the fewer number of variables, that presents acceptable
cophenetic correlation and high similarity with the original distances.
Mojena, R. 2015. Hierarchical grouping methods and stopping rules: an evaluation. Comput. J. 20:359-363. 10.1093/comjnl/20.4.359
# NOT RUN {
library(metan)
# All rows and all numeric variables from data
d1 <- clustering(data_ge2)
# Based on the mean for each genotype
mean_gen <-
data_ge2 %>%
means_by(GEN) %>%
column_to_rownames("GEN")
d2 <- clustering(mean_gen)
# Select variables for compute the distances
d3 <- clustering(mean_gen, selvar = TRUE)
# Compute the distances with standardized data
# Define 4 clusters
d4 <- clustering(data_ge,
by = ENV,
scale = TRUE,
nclust = 4)
# }
Run the code above in your browser using DataLab