cluster.diagnostic: Function for Plotting Summary Cluster Diagnostic Plots

Description

Plot similarity statistic profiles and the optimal joint clustering configuration for the means and the variances by group. Plot quantile profiles of means and standard deviations by group and for each clustering configuration, to check that the distributions of first and second moments of the MVR-transformed data approch their respective null distributions under the optimal configuration found, assuming independence and normality of all the variables.

Usage

cluster.diagnostic(obj, 
                       title = "", 
                       span = 0.75, 
                       degree = 2, 
                       family = "gaussian", 
                       device = NULL, 
                       file = "Cluster Diagnostic Plots")

Arguments

obj

Object of class "mvr" returned by mvr.

title

Title of the plot. Defaults to the empty string.

span

Span parameter of the loess() function (R package stats), which controls the degree of smoothing. Defaults to 0.75.

degree

Degree parameter of the loess() function (R package stats), which controls the degree of the polynomials to be used. Defaults to 2. (Normally 1 or 2. Degree 0 is also allowed, but see the "Note" in loess {stats} pa

family

Family distribution in {"gaussian", "symmetric"} of the loess() function (R package stats), used for local fitting . If "gaussian" fitting is by least-squares, and if "symmetric" a re-descending M estimator is used

device

Graphic display device in {NULL, "PS", "PDF"}. Defaults to NULL (screen). Currently implemented graphic display devices are "PS" (Postscript) or "PDF" (Portable Document Format).

file

File name for output graphic. Defaults to "Cluster Diagnostic Plots".

Value

None. Displays the plots on the chosen device.

Details

In a plot of a similarity statistic profile, one checks the goodness of fit of the transformed data relative to the hypothesized underlying reference distribution with mean-0 and standard deviation-1 (e.g. $N(0, 1)$). The red dashed line depicts the LOESS scatterplot smoother estimator. The subroutine internally generates reference null distributions for computing the similarity statistic under each cluster configuration. The optimal cluster configuration (indicated by the vertical red arrow) is found where the similarity statistic reaches its minimum plus/minus one standard deviation (applying the conventional one-standard deviation rule). A smaller cluster number configuration indicates under-regularization, while over-regularization starts to occur at larger numbers. This over/under-regularization must be viewed as a form of over/under-fitting (see Dazard, J-E. and J. S. Rao (2012) for more details). The quantile diagnostic plots uses empirical quantiles of the transformed means and standard deviations to check how closely they are approximated by theoretical quantiles derived from a standard normal equal-mean/homoscedastic model (solid green lines) under a given cluster configuration. To assess this goodness of fit of the transformed data, theoretical null distributions of the mean and variance are derived from a standard normal equal-mean/homoscedastic model with independence of the first two moments, i.e. assuming i.i.d. normality of the raw data. However, we do not require i.i.d. normality of the data in general: these theoretical null distributions are just used here as convenient ones to draw from. Note that under the assumptions that the raw data is i.i.d. standard normal ($N(0, 1)$) with independence of first two moments, the theoretical null distributions of means and standard deviations for each variable are respectively: $N(0, \frac{1}{n})$ and $\sqrt{\frac{\chi_{n - G}^{2}}{n - G}}$, where $G$ denotes the number of sample groups. The optimal cluster configuration found is indicated by the most horizontal red curve. The single cluster configuration, corresponding to no transformation, is the most vertical curve, while the largest cluster number configuration reaches horizontality. Notice how empirical quantiles of transformed pooled means and standard deviations converge (from red to black) to the theoretical null distributions (solid green lines) for the optimal configuration. One should see a convergence towards the target null, after which overfitting starts to occur (see Dazard, J-E. and J. S. Rao (2012) for more details). Both cluster diagnostic plots help determine (i) whether the minimum of the Similarity Statistic is observed within the range of clusters (i.e. a large enough number of clusters has been accommodated), and (ii) whether the corresponding cluster configuration is a good fit. If necessary, run the procedure again with larger value of the nc.max parameter in the mvr as well as in mvrt.test functions until the minimum of the similarity statistic profile is reached. Option file is used only if device is specified (i.e. non NULL).

References

Dazard J-E., Hua Xu and J. S. Rao (2011). "R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization." In JSM Proceedings, Section for Statistical Programmers and Analysts. Miami Beach, FL, USA: American Statistical Association IMS - JSM, 3849-3863.
Dazard J-E. and J. S. Rao (2012). "Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data." Comput. Statist. Data Anal. 56(7):2317-2333.

Examples

Run this code

#===================================================
# Loading the library and its dependencies
#===================================================
library("MVR")

#===================================================
# Loading of the Synthetic and Real datasets 
# (see description of datasets)
#===================================================
data("Synthetic", "Real", package="MVR")

#===================================================
# Mean-Variance Regularization (Real dataset)
# Multi-Group Assumption
# Assuming unequal variance between groups
# Without cluster usage
#===================================================
nc.min <- 1
nc.max <- 30
probs <- seq(0, 1, 0.01)
n <- 6
GF <- factor(gl(n = 2, k = n/2, len = n), 
             ordered = FALSE, 
             labels = c("M", "S"))
mvr.obj <- mvr(data = Real, 
               block = GF, 
               log = FALSE, 
               nc.min = nc.min, 
               nc.max = nc.max, 
               probs = probs,
               B = 100, 
               parallel = FALSE, 
               conf = NULL,
               verbose = TRUE)

#===================================================
# Summary Cluster Diagnostic Plots (Real dataset)
# Multi-Group Assumption
# Assuming unequal variance between groups
#===================================================
cluster.diagnostic(obj = mvr.obj, 
                   title = "Cluster Diagnostic Plots 
                   (Real - Multi-Group Assumption)",
                   span = 0.75, 
                   degree = 2, 
                   family = "gaussian",
                   device = "PS")

#===================================================
# Mean-Variance Regularization (Real dataset)
# Single-Group Assumption
# Assuming equal variance between groups
# Without cluster usage
#===================================================
nc.min <- 1
nc.max <- 30
probs <- seq(0, 1, 0.01)
n <- 6
mvr.obj <- mvr(data = Real, 
               block = rep(1,n), 
               log = FALSE, 
               nc.min = nc.min, 
               nc.max = nc.max, 
               probs = probs, 
               B = 100, 
               parallel = FALSE, 
               conf = NULL, 
               verbose = TRUE)

#===================================================
# Summary Cluster Diagnostic Plots (Real dataset)
# Single-Group Assumption
# Assuming equal variance between groups
#===================================================
cluster.diagnostic(obj = mvr.obj, 
                   title = "Cluster Diagnostic Plots 
                   (Real - Single-Group Assumption)",
                   span = 0.75, 
                   degree = 2, 
                   family = "gaussian",
                   device = NULL)

Run the code above in your browser using DataLab