Learn R Programming

MVR (version 1.30.3)

mvr: Function for Mean-Variance Regularization and Variance Stabilization

Description

End-user function for Mean-Variance Regularization (MVR) and Variance Stabilization by similarity statistic under sample group homoscedasticity or heteroscedasticity assumptions.

Return an object of class "mvr". Offers the option of parallel computation for improved efficiency.

Usage

mvr(data, block = rep(1,nrow(data)), tolog = FALSE, nc.min = 1, nc.max = 30, probs = seq(0, 1, 0.01), B = 100, parallel = FALSE, conf = NULL, verbose = TRUE)

Arguments

data
numeric matrix of untransformed (raw) data, where samples are by rows and variables (to be clustered) are by columns, or an object that can be coerced to such a matrix (such as a numeric vector or a data.frame with all numeric columns). Missing values (NA), NotANumber values (NaN) or Infinite values (Inf) are not allowed.
block
character or numeric vector or factor grouping/blocking variable of length the sample size. Defaults to single group situation (see details).
tolog
logical scalar. Is the data to be log2-transformed first? Optional, defaults to FALSE. Note that negative or null values will be changed to 1 before taking log2-transformation.
nc.min
Positive integer scalar of the minimum number of clusters, defaults to 1
nc.max
Positive integer scalar of the maximum number of clusters, defaults to 30
probs
numeric vector of probabilities for quantile diagnostic plots. Defaults to seq(0, 1, 0.01).
B
Positive integer scalar of the number of Monte Carlo replicates of the inner loop of the sim statistic function (see details).
parallel
logical scalar. Is parallel computing to be performed? Optional, defaults to FALSE.
conf
list of parameters for cluster configuration. Inputs for R package parallel function makeCluster (R package parallel) for cluster setup. Optional, defaults to NULL. See details for usage.
verbose
logical scalar. Is the output to be verbose? Optional, defaults to TRUE.

Value

Xraw
numeric matrix of original data.
Xmvr
numeric matrix of MVR-transformed data.
centering
numeric vector of centering values for standardization (cluster mean of pooled sample mean).
scaling
numeric vector of scaling values for standardization (cluster mean of pooled sample std dev).
MVR
list (of size the number of groups) containing for each group:
  • membership numeric vector of cluster membership of each variable
  • nc Positive integer scalar of number of clusters found in optimal cluster configuration
  • gap numeric vector of the similarity statistic values
  • sde numeric vector of the standard errors of the similarity statistic values
  • mu.std numeric matrix (K x p) of the vector of standardized means by groups (rows), where K = \#groups and p = \#variables
  • sd.std numeric matrix (K x p) of the vector of standardized standard deviations by groups (rows), where K = \#groups and p = \#variables
  • mu.quant numeric matrix (nc.max - nc.min + 1) x (length(probs)) of quantiles of means
  • sd.quant numeric matrix (nc.max - nc.min + 1) x (length(probs)) of quantiles of standard deviations
block
Value of argument block.
tolog
Value of argument tolog.
nc.min
Value of argument nc.min.
nc.max
Value of argument nc.max.
probs
Value of argument probs.

Details

Argument block is a vector or a factor grouping/blocking variable. It must be of length sample size with as many different character or numeric values as the number of levels or sample groups. It defaults to single group situation, i.e. under the assumption of equal variance between sample groups. All group sample sizes must be greater than 1, otherwise the program will stop.

Note that argument B is internally reset to conf$cpus*ceiling(B/conf$cpus) in case the parallelization is used (i.e. conf is non NULL), where conf$cpus denotes the total number of CPUs to be used (see below).

Argument nc.max currently defaults to 30. Empirically, we found that this is enough for most datasets tested. This depends on (i) the dimensionality/sample size ratio $\frac{p}{n}$, (ii) the signal/noise ratio, and (iii) whether a pre-transformation has been applied (see Dazard, J-E. and J. S. Rao (2012) for more details). See the cluster diagnostic function cluster.diagnostic for more details, whether larger values of nc.max may be required.

To run a parallel session (and parallel RNG) of the MVR procedures (parallel=TRUE), argument conf is to be specified (i.e. non NULL). It must list the specifications of the folowing parameters for cluster configuration: "names", "cpus", "type", "homo", "verbose", "outfile". These match the arguments described in function makeCluster of the R package parallel. All fields are required to properly configure the cluster, except for "names" and "cpus", which are the values used alternatively in the case of a cluster of type "SOCK" (socket), or in the case of a cluster of type other than "SOCK" (socket), respectively.

  • "names": names : character vector specifying the host names on which to run the job. Could default to a unique local machine, in which case, one may use the unique host name "localhost". Each host name can potentially be repeated to the number of CPU cores available on the corresponding machine.
  • "cpus": spec : integer scalar specifying the total number of CPU cores to be used across the network of available nodes, counting the workernodes and masternode.
  • "type": type : character vector specifying the cluster type ("SOCK", "PVM", "MPI").
  • "homo": homogeneous : logical scalar to be set to FALSE for inhomogeneous clusters.
  • "verbose": verbose : logical scalar to be set to FALSE for quiet mode.
  • "outfile": outfile : character vector of the output log file name for the workernodes.

The actual creation of the cluster, its initialization, and closing are all done internally. In addition, when random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes (For more details see function makeCluster (R package parallel) and/or http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html.

References

  • Dazard J-E., Hua Xu and J. S. Rao (2011). "R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization." In JSM Proceedings, Section for Statistical Programmers and Analysts. Miami Beach, FL, USA: American Statistical Association IMS - JSM, 3849-3863.
  • Dazard J-E. and J. S. Rao (2012). "Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data." Comput. Statist. Data Anal. 56(7):2317-2333.

See Also

  • makeCluster (R package parallel).
  • justvsn (R package vsn) Variance stabilization and calibration for microarray data Huber, 2002

Examples

Run this code
#===================================================
# Loading the library and its dependencies
#===================================================
library("MVR")

## Not run: 
#     #===================================================
#     # MVR package news
#     #===================================================
#     MVR.news()
# 
#     #================================================
#     # MVR package citation
#     #================================================
#     citation("MVR")
# 
#     #===================================================
#     # Loading of the Synthetic and Real datasets
#     # (see description of datasets)
#     #===================================================
#     data("Synthetic", "Real", package="MVR")
#     ?Synthetic
#     ?Real
# ## End(Not run)

#===================================================
# Mean-Variance Regularization (Synthetic dataset)
# Single-Group Assumption
# Assuming equal variance between groups
# Without cluster usage
#===================================================
nc.min <- 1
nc.max <- 10
probs <- seq(0, 1, 0.01)
n <- 10
mvr.obj <- mvr(data = Synthetic,
               block = rep(1,n),
               tolog = FALSE,
               nc.min = nc.min,
               nc.max = nc.max,
               probs = probs,
               B = 100,
               parallel = FALSE,
               conf = NULL,
               verbose = TRUE)

## Not run: 
#     #===================================================
#     # Examples of parallelization below with 
#     # a SOCKET or MPI cluster configuration
#     #===================================================
#     # 1- WINDOWS multicores PC with SOCKET communication
#     #    With a 2-Quad (8-CPUs) PC
#     #===================================================
#     if (.Platform$OS.type == "windows") {
#         cpus <- detectCores()
#         conf <- list("names" = rep("localhost", cpus),
#                      "cpus" = cpus,
#                      "type" = "SOCK",
#                      "homo" = TRUE,
#                      "verbose" = TRUE,
#                      "outfile" = "")
#     }
#     #===================================================
#     # 2- LINUX multinodes cluster with SOCKET communication
#     #    with 4-nodes (32-CPUs) cluster
#     #    with 1 masternode and 3 workernodes
#     #    All hosts run identical setups
#     #    Same number of core CPUs (8) per node
#     #===================================================
#     if (.Platform$OS.type == "unix") {
#         masterhost <- Sys.getenv("HOSTNAME")
#         slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
#         nodes <- length(slavehosts) + 1
#         cpus <- 8
#         conf <- list("names" = c(rep(masterhost, cpus),
#                                  rep(slavehosts, cpus)),
#                      "cpus" = nodes * cpus,
#                      "type" = "SOCK",
#                      "homo" = TRUE,
#                      "verbose" = TRUE,
#                      "outfile" = "")
#     }
#     #===================================================
#     # 3- LINUX multinodes cluster with MPI communication
#     #    Here, a file named ".nodes" (e.g. in the home directory)
#     #    must contain the list of nodes of the cluster
#     #===================================================
#     if (.Platform$OS.type == "unix") {
#         hosts <- scan(file=paste(Sys.getenv("HOME"), "/.nodes", sep=""), 
#                       what="", 
#                       sep="\n")
#         hostnames <- unique(hosts)
#         nodes <- length(hostnames)
#         cpus <-  length(hosts)/length(hostnames)
#         conf <- list("cpus" = nodes * cpus,
#                      "type" = "MPI",
#                      "homo" = TRUE,
#                      "verbose" = TRUE,
#                      "outfile" = "")
#     }
#     #===================================================
#     # Run:
#     # Mean-Variance Regularization (Real dataset)
#     # Multi-Group Assumption
#     # Assuming unequal variance between groups
#     #===================================================
#     nc.min <- 1
#     nc.max <- 30
#     probs <- seq(0, 1, 0.01)
#     n <- 6
#     GF <- factor(gl(n = 2, k = n/2, len = n),
#                  ordered = FALSE,
#                  labels = c("M", "S"))
#     mvr.obj <- mvr(data = Real,
#                    block = GF,
#                    tolog = FALSE,
#                    nc.min = nc.min,
#                    nc.max = nc.max,
#                    probs = probs,
#                    B = 100,
#                    parallel = TRUE,
#                    conf = conf,
#                    verbose = TRUE)
#     ## End(Not run)

Run the code above in your browser using DataLab