meammd: MEA-MMD: Multivariate Efficient Approximate Maximum Mean Discrepancy

Description

Computes maximum mean discrepancy statistics with Laplacian or Gaussian kernel. Suitable for multivariate data. Naive approach, quadratic in number of observations.

Usage

meammd(
  X,
  Y,
  beta = -0.1,
  pval = TRUE,
  type = c("proj", "dist"),
  numproj = 20,
  nmethod = c(2, 1),
  distpval = c("Hommel", "Fisher"),
  numperm = 200,
  seednum = 0,
  alternative = c("greater", "two.sided"),
  allowzeropval = FALSE,
  faster = TRUE
)

Value

A list with the following elements:

pval: The p-value of the test, if it is computed (pval=TRUE). Otherwise, it is set to NA.
stat: The statistic of the test, which is only returned when type="proj", otherwise it is set to NA.

Arguments

X: Matrix (or vector) of observations in first sample.
Y: Matrix (or vector) of observations in second sample.
beta: kernel parameter. Must be positive; if not, computes median heuristic in quadratic time for each projection. Default value is -0.1, which will force median heuristic to be used.
pval: Boolean for whether to compute p-value or not.
type: The type of projection used. Either "proj" for random projections (default) or "dist" for interpoint distances.
numproj: Number of projections (only used if type="proj"). Default is 20.
nmethod: Norm used for interpoint distances, if type="dist". Needs to be either 2 (for two-norm, default) or 1 (for one-norm).
distpval: The p-value combination procedure if type="dist". Options are "Hommel" (default) or "Fisher". The Hommel method is preferred since the Type I error does not seem to be controlled if the Fisher method is used.
numperm: Number of permutations. Default is 200.
seednum: Seed number for generating permutations. Default is 0, which means seed is set randomly. For values larger than 0, results will be reproducible.
alternative: A character string specifying the alternative hypothesis, which must be either "greater" (default) or "two.sided". In Gretton et al., the MMD test statistic is specified so that if it is significantly larger than zero, then the null hypothesis that the two samples come from the same distribution should be rejected. For this reason, "greater" is recommended. The test will still work in many cases with "two.sided" specified, but this could lead to problems in certain cases.
allowzeropval: A boolean, specifying whether we will allow zero p-values or not. Default is FALSE; then a threshold of 0.5 / (numperm+1) is used, and if the computed p-value is less than this threshold, it is then set to be this value. this avoids the possibility of zero p-values.
faster: A boolean, specifying if to use faster algorithm when computing p-value. Default is TRUE.

References

Bodenham, D. A., and Kawahara, Y. (2023) "euMMD: efficiently computing the MMD two-sample test statistic for univariate data." Statistics and Computing 33.5 (2023): 110.

Examples

Run this code

X <- matrix(c(1:12), ncol=2, byrow=TRUE)
Y <- matrix(c(13:20), ncol=2, byrow=TRUE)
# using the random projections method
mmdList <- meammd(X=X, Y=Y, pval=TRUE, type="proj", numproj=50)

# using the method were distances are computed to the various points 
mmdList <- meammd(X=X, Y=Y, pval=TRUE, type="dist")

Run the code above in your browser using DataLab