hhg.test: Heller-Heller-Gorfine Tests of Independence

Description

This function implements the Heller-Heller-Gorfine (HHG) test of independence between two random vectors (x and y) having arbitrary dimensions.

Usage

hhg.test(Dx, Dy, ties = T, w.sum = 0, w.max = 2, nr.perm = 10000, 
         is.sequential = F, seq.total.nr.tests = 1, 
         seq.alpha.hyp = NULL, seq.alpha0 = NULL, seq.beta0 = NULL, seq.eps = NULL, 
         nr.threads = 0, tables.wanted = F, perm.stats.wanted = F)

Arguments

a symmetric matrix of doubles, where element [i, j] is a norm-based distance between the i'th and j'th x samples.

same as Dx, but for distances between y's (the user may choose any norm when computing Dx, Dy).

ties

a boolean specifying whether ties in Dx and/or Dy exist and are to be properly handled (requires more computation).

w.sum

minimum expected frequency taken into account when computing the sum.chisq statistic (must be non-negative, contribution of tables having cells with smaller values will be truncated to zero).

w.max

minimum expected frequency taken into account when computing the max.chisq statistic (must be non-negative, contribution of tables having cells with smaller values will be truncated to zero).

nr.perm

number of permutations from which a p-value is to be estimated (must be non-negative). Can be specified as zero if only the observed statistics are wanted, without p-values. The actual number of permutations used may be slightly larger when using multiple

is.sequential

boolean flag specifying whether Wald's sequential test is desired (see Details), otherwise a simple Monte-Carlo computation of nr.perm permutations is performed. When this argument is TRUE, either seq.total.nr.tests

seq.total.nr.tests

the total number of hypotheses in the family of hypotheses simultaneously tested. When this optional argument is supplied, it is used to derive default values for the parameters of the Wald sequential test. The default derivation is done assuming a nomina

seq.alpha.hyp

the nominal test size for this single test within the multiple testing procedure.

seq.alpha0

the nominal test size for testing the side null hypothesis of p-value > seq.alpha.hyp.

seq.beta0

one minus the power for testing the side null hypothesis of p-value > seq.alpha.hyp.

seq.eps

approximation margin around seq.alpha.hyp that defines the p-value regions for the side null p > seq.alpha.hyp * (1 + seq.eps) and side alternative p < seq.alpha.hyp * (1 - seq.eps).

nr.threads

number of processing cores to use for p-value permutation. If left as zero, will try to use all available cores.

tables.wanted

boolean flag determining whether to output detailed local 2x2 contingency tables.

perm.stats.wanted

boolean flag determining whether to output statistics values computed for all permutations (representing null distributions).

Value

Four statistics described in the original HHG paper are returned: sum.chisq - sum of Pearson chi-squared statistics from the 2x2 contingency tables considered sum.lr - sum of liklihood ratio ("G statistic") values from the 2x2 tables max.chisq - maximum Pearson chi-squared statistic from any of the 2x2 tables max.lr - maximum G statistic from any of the 2x2 tables If nr.perm > 0, then estimated permutation p-values are returned as: perm.pval.hhg.sc perm.pval.hhg.sl perm.pval.hhg.mc perm.pval.hhg.ml In order to give information that may help localize where in the support of the distributions of x and y there is departure from independence, if tables.wanted is true, the 2x2 tables themselves are provided in: extras.hhg.tbls This is a n^2 by 4 matrix, whose columns are A11, A12, A21, A22 as denoted in the original HHG paper. Row r of the matrix corresponds to S_ij in the same paper, where i = 1 + floor((r - 1) / n), and j = 1 + ((r - 1) %% n). Since S_ij is never computed for i == j, rows (0:(n - 1)) * n + (1:n) contain NAs on purpose. Finally, as a means of estimating the null distributions of computed statistics, if perm.stats.wanted is true, the statistics computed for every permutation of the data performed during testing is outputted as: extras.perm.stats A data.frame with one variable per statistic and one sample per permutation.

Details

The HHG test (Heller et al., 2013) is a powerful nonparametric test for association (or, alternatively, independence) between two random vectors (say, x and y) of arbitrary dimensions. It is consistent against virtually any form of dependence, and has been shown to offer significantly more power than alternative approaches in the face of simple, and, more importantly, complex high-dimensional dependence. The test relies on norm-based distance metrics in x and (separately) in y. The choice of metric for either variable is up to the user (e.g. Euclidean, Mahalanobis, Manhattan, or whatever is appropriate for the data). The general implementation in hhg.test takes the distance matrices computed on an observed sample, Dx and Dy, and starts form there. When enabled by is.sequential, Wald's sequential test is implemented as suggested by Fay et al. (2007) in order to reduce the O(nr.perm * n^2 * log(n)) computational compelxity of the permutation test to a more managable size. Especially when faced with high multiplicity, say M simultaneous tests, the necessary number of iterations may be quite large. For example, if it is desired to control the family-wise error rate (FWER) at level alpha using the Bonferroni correction, one needs a p-value of alpha / M to establish significance. This seems to suggest that the minimum number of permutations required is nr.perm = M / alpha. However, if it becomes clear after a smaller number of permutations that the null cannot be rejected, no additional permutations are needed, and the p-value can be conservatively estimated as 1. Often, only a handful of hypotheses in a family are expected to be non-null. In this case the number of permutations for testing all hypotheses using Wald's procedure is expected to be much lower than the full M^2 / alpha. The target significance level of the sequential test is specified in the argument seq.alpha.hyp. It depends on the number of hypotheses M, and the type of multiplicity correction wanted. For the Bonferroni correction, the threshold is alpha / M. For the less conservative procedure of Benjamini & Hochberg (1995), it is M1 * q / M, where q is the desired false discovery rate (FDR), and M1 is the (unknwon) number of true non-null hypotheses. Although M1 is unknown, the investigator can sometimes estimate it conservatively (e.g., if at most 0.02 of the hypotheses are expected to be non-null, set M1 = 0.02 * M).

References

Heller, R., Heller, Y., & Gorfine, M. (2013). A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2), 503-510. Fay, M., Kim., H., & Hachey, M. (2007). On using truncated sequential probability ratio test boundaries for Monte Carlo implementation of hypothesis tests. Journal of Computational and Graphical Statistics, 16(4), 946-967. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289-300. Hall, P. & Tajvidi, N. (2002). Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2), 359-374. Szekely, G., Rizzo, M., & Bakirov, N. (2007). Measuring and testing independence by correlation of distances. The Annals of Statistics, 35, 2769-2794.

Examples

Run this code

## 1. The test of independence

## 1.1. A non-null univariate example

## Generate some data from the Circle example
n = 50
X = hhg.example.datagen(n, 'Circle')
plot(X[1,], X[2,])

## Compute distance matrices, on which the HHG test will be based
Dx = as.matrix(dist((X[1,]), diag = TRUE, upper = TRUE))
Dy = as.matrix(dist((X[2,]), diag = TRUE, upper = TRUE))

## Compute HHG statistics, and p-values using 1000 random permutations
hhg = hhg.test(Dx, Dy, nr.perm = 1000)

## Print the sum-chisq statistic and its permutation p-value
hhg$sum.chisq
hhg$perm.pval.hhg.sc

## 1.2. A null univariate example

n = 50
X = hhg.example.datagen(n, '4indclouds') 

Dx = as.matrix(dist((X[1,]), diag = TRUE, upper = TRUE))
Dy = as.matrix(dist((X[2,]), diag = TRUE, upper = TRUE))

hhg = hhg.test(Dx, Dy, nr.perm = 1000)

hhg$sum.chisq
hhg$perm.pval.hhg.sc

hhg$sum.lr
hhg$perm.pval.hhg.sl

## 1.3. A multivariate example
library(MASS)

n = 50
p = 5
x = t(mvrnorm(n, rep(0, p), diag(1, p)))
y = log(x ^ 2)
Dx = as.matrix(dist((t(x)), diag = TRUE, upper = TRUE))
Dy = as.matrix(dist((t(y)), diag = TRUE, upper = TRUE))

hhg = hhg.test(Dx, Dy, nr.perm = 1000)

hhg$sum.chisq
hhg$perm.pval.hhg.sc

Run the code above in your browser using DataLab