cluster.diagnostic: Function for Plotting Summary Cluster Diagnostic Plots

Description

Plot similarity statistic profiles and the optimal joint clustering configuration for the means and the variances by group. Plot quantile profiles of means and standard deviations by group and for each clustering configuration, to check that the distributions of first and second moments of the MVR-transformed data approch their respective null distributions under the optimal configuration found, assuming independence and normality of all the variables.

Usage

cluster.diagnostic(obj, 
                       title = "", 
                       span = 0.75, 
                       degree = 2, 
                       family = "gaussian", 
                       device = NULL, 
                       file = "Summary Cluster Diagnostic Plots")

Arguments

obj

Object of class "mvr" returned by mvr.

title

Title of the plot. Defaults to the empty string.

span

Span parameter of the loess() function (R package stats), which controls the degree of smoothing. Defaults to 0.75.

degree

Degree parameter of the loess() function (R package stats), which controls the degree of the polynomials to be used. Defaults to 2. (Normally 1 or 2. Degree 0 is also allowed, but see the "Note" in loess {stats} pa

family

Family distribution in {"gaussian", "symmetric"} of the loess() function (R package stats), used for local fitting . If "gaussian" fitting is by least-squares, and if "symmetric" a re-descending M estimator is used

device

Graphic display device in {NULL, "PS", "PDF"}. Defaults to NULL (screen). Currently implemented graphic display devices are "PS" (Postscript) or "PDF" (Portable Document Format).

file

File name for output graphic. Defaults to "Summary Cluster Diagnostic Plots".

Value

None. Displays the plots on the chosen device.

Details

In the plot of similarity statistic profiles, the red dashed line depicts the LOESS scatterplot smoother estimator. The subroutine internally generates null distributions of the data with target mean-0 and standard deviation-1 (e.g. $N(0, 1)$) for computing the similarity statistic that applies to each cluster configuration. The optimal one found is indicated by the red arrow, where the similarity statistic reaches its minimum plus/minus one standard deviation (applying the conventional one-standard deviation rule). A smaller cluster number configuration indicates underfitting, while overfitting starts to occur at larger numbers.

The comparative quantile mean plot and quantile standard deviation plot check how close the empirical quantiles of means and standard deviations of the MVR-transformed data are to that of their respective theoretical null distributions (solid green lines) for each cluster configuration (the single cluster configuration, corresponding to no transformation, is the most vertical curve, while the largest cluster number configuration reaches horizontality). Under the assumption of standard normality and independence for the data under the null, the theoretical null distributions of the means and the standard deviations are respectively $N(0, 1)$, and $\sqrt{\frac{\chi_{n - G}^{2}}{n - G}}$, where $G$ denotes the number of sample groups (see Dazard, J-E. and J. S. Rao (2011) for more details). The optimal cluster configuration found is indicated by the most horizontal red curve. One should see a convergence towards the target null, after which overfitting starts to occur.

Both cluster diagnostic plots help determine whether appropriate values of the nc.min and nc.max parameters have been set in the mvr as well as in mvrt.test functions. The minimum of the similarity statistic profile has to be reached within the range nc.min:nc.max, otherwise run the procedure again with a wider range until this is the case. Option file is used only if device is specified (i.e. non NULL).

References

Dazard, J-E. and J. S. Rao (2010). "Regularized Variance Estimation and Variance Stabilization of High-Dimensional Data." JSM Proceedings. High-Dimensional Data Analysis and Variable Selection Section., Vancouver, BC. Canada, American Statistical Association.
Dazard, J-E. and J. S. Rao (2011). "Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data." Comput. Statist. Data Anal. (submitted).

Examples

Run this code

#===================================================
# Loading the library and its dependencies
#===================================================
library("MVR")
require("statmod", quietly = TRUE)
require("snow", quietly = TRUE)
require("RColorBrewer", quietly = TRUE)

#===================================================
# Loading of the Synthetic and Real datasets 
# (see description of datasets)
#===================================================
data("Synthetic", "Real", package="MVR")

#===================================================
# Mean-Variance Regularization (Real dataset)
# Multi-Group Assumption
# Assuming unequal variance between groups
# Without Rocks cluster usage
#===================================================
nc.min <- 1
nc.max <- 30
probs <- seq(0, 1, 0.01)
n <- 6
GF <- factor(gl(n = 2, k = n/2, len = n), 
             ordered = FALSE, 
             labels = c("M", "S"))
mvr.obj <- mvr(data = Real, 
               block = GF, 
               log = FALSE, 
               nc.min = nc.min, 
               nc.max = nc.max, 
               probs = probs,
               B = 100, 
               parallel = FALSE, 
               conf = NULL,
               verbose = TRUE)

#===================================================
# Summary Cluster Diagnostic Plots (Real dataset)
# Multi-Group Assumption
# Assuming unequal variance between groups
#===================================================
cluster.diagnostic(obj = mvr.obj, 
                   title = "Summary Cluster Diagnostic Plots 
                   (Real - Multi-Group Assumption)",
                   span = 0.75, 
                   degree = 2, 
                   family = "gaussian",
                   device = "PS")

#===================================================
# Mean-Variance Regularization (Real dataset)
# Single-Group Assumption
# Assuming equal variance between groups
# Without Rocks cluster usage
#===================================================
nc.min <- 1
nc.max <- 30
probs <- seq(0, 1, 0.01)
n <- 6
mvr.obj <- mvr(data = Real, 
               block = rep(1,n), 
               log = FALSE, 
               nc.min = nc.min, 
               nc.max = nc.max, 
               probs = probs, 
               B = 100, 
               parallel = FALSE, 
               conf = NULL, 
               verbose = TRUE)

#===================================================
# Summary Cluster Diagnostic Plots (Real dataset)
# Single-Group Assumption
# Assuming equal variance between groups
#===================================================
cluster.diagnostic(obj = mvr.obj, 
                   title = "Summary Cluster Diagnostic Plots 
                   (Real - Single-Group Assumption)",
                   span = 0.75, 
                   degree = 2, 
                   family = "gaussian",
                   device = NULL)

Run the code above in your browser using DataLab