clustering: Clustering analysis

Description

Performs clustering analysis with selection of variables.

Usage

clustering(
  .data,
  ...,
  by = NULL,
  means_by = NULL,
  scale = FALSE,
  selvar = FALSE,
  verbose = TRUE,
  distmethod = "euclidean",
  clustmethod = "average",
  nclust = NULL
)

Arguments

.data

The data to be analyzed. It may be a data frame containing the means of each observation in each variable or, alternatively, replicates for each factor. In this case, a grouping variable is required in the argument means_by to compute the means. In addition, .data may be an object passed from the function split_factors. In this case, the distances are computed for each level of the grouping variable.

...

The variables in .data to compute the distances. Set to NULL, i.e., all the numeric variables in .data are used.

One variable (factor) to split the data into subsets. The function is then applied to each subset and returns a list where each element contains the results for one level of the variable in by. To split the data by more than one factor variable, use the function split_factors to pass subsetted data to .data.

means_by

If .data doesn't contain the mean for each observation, then means_by is a grouping variable to compute the means. For example, if means_by = GEN, then the means of the numerical variables will be computed for each level of the grouping variable GEN.

scale

Should the data be scaled before computing the distances? Set to FALSE. If TRUE, then, each observation will be divided by the standard deviation of the variable Z_{ij} = X_{ij} / sd(j)

selvar

Logical argument, set to FALSE. If TRUE, then an algorithm for selecting variables is implemented. See the section Details for additional information.

verbose

Logical argument. If TRUE (default) then the results for variable selection are shown in the console.

distmethod

The distance measure to be used. This must be one of 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary', 'minkowski', 'pearson', 'spearman', or 'kendall'. The last three are correlation-based distance.

clustmethod

The agglomeration method to be used. This should be one of 'ward.D', 'ward.D2', 'single', 'complete', 'average' (= UPGMA), 'mcquitty' (= WPGMA), 'median' (= WPGMC) or 'centroid' (= UPGMC).

nclust

The number of clusters to be formed. Set to NULL

Value

data The data that was used to compute the distances.
cutpoint The cutpoint of the dendrogram according to Mojena (1977).
distance The matrix with the distances.
de The distances in an object of class dist.
hc The hierarchical clustering.
Sqt The total sum of squares.
tab A table with the clusters and similarity.
clusters The sum of square and the mean of the clusters for each variable.
cofgrap If selectvar = TRUE, then, cofpgrap is a ggplot2-based graphic showing the cophenetic correlation for each model (with different number of variables). Else, will be a NULL object.
statistics If selectvar = TRUE, then, statistics shows the summary of the models fitted with different number of variables, including cophenetic correlation, Mantel's correlation with the original distances (all variables) and the p-value associated with the Mantel's test. Else, will be a NULL object.

Details

When selvar = TRUE a variable selection algorithm is executed. The objective is to select a group of variables that most contribute to explain the variability of the original data. The selection of the variables is based on eigenvalue/eigenvectors solution based on the following steps. 1: compute the distance matrix and the co-optic correlation with the original variables (all numeric variables in dataset); 2: compute the eigenvalues and eigenvectors of the correlation matrix between the variables; 3: delete the variable with the largest weight (highest eigenvector in the lowest eigenvalue); 4: compute the distance matrix and co-phenetic correlation with the remaining variables; 5: compute the Mantel's correlation between the obtained distances matrix and the original distance matrix; 6: iterate steps 2 to 5 p - 2 times, where p is the number of original variables. At the end of the p - 2 iterations, a summary of the models is returned. The distance is calculated with the variables that generated the model with the largest cophenetic correlation. I suggest a careful evaluation aiming at choosing a parsimonious model, i.e., the one with the fewer number of variables, that presents acceptable cophenetic correlation and high similarity with the original distances.

References

Mojena, R. 2015. Hierarchical grouping methods and stopping rules: an evaluation. Comput. J. 20:359-363. doi:10.1093/comjnl/20.4.359

Examples

Run this code

# NOT RUN {
library(metan)

# All rows and all numeric variables from data
d1 <- clustering(data_ge2)

# Based on the mean for each genotype
d2 <- clustering(data_ge2, means_by = GEN)

# Based on the mean of each genotype
# Variables NKR, TKW, and NKE
d3 <- clustering(data_ge2, NKR, TKW, NKE, means_by = GEN)

# Select variables for compute the distances
d4 <- clustering(data_ge2, means_by = GEN, selvar = TRUE)

# Compute the distances with standardized data
# Define 4 clusters
d5 <- clustering(data_ge2,
                 means_by = GEN,
                 scale = TRUE,
                 nclust = 4)

# Compute the distances for each environment
# Select the variables NKR, TKW, and NKE
# Use the mean for each genotype
d6 <- clustering(data_ge2,
                NKR, TKW, NKE,
                by = ENV,
                means_by = GEN)

# Check the correlation between distance matrices
pairs_mantel(d6)

# }

Run the code above in your browser using DataLab