apval_Chen2014: Asymptotics-Based p-value of the Test Proposed by Chen et al (2014)

Description

Calculates p-value of the test for testing equality of two-sample high-dimensional mean vectors proposed by Chen et al (2014) based on the asymptotic distribution of the test statistic.

Usage

apval_Chen2014(sam1, sam2, eq.cov = TRUE)

Arguments

sam1

an n1 by p matrix from sample population 1. Each row represents a $p$-dimensional sample.

sam2

an n2 by p matrix from sample population 2. Each row represents a $p$-dimensional sample.

eq.cov

a logical value. The default is TRUE, indicating that the two sample populations have same covariance; otherwise, the covariances are assumed to be different.

Value

A list including the following elements:
sam.infothe basic information about the two groups of samples, including the samples sizes and dimension.
cov.assumptionthe equality assumption on the covariances of the two sample populations; this was specified by the argument eq.cov.
methodthis output reminds users that the p-values are obtained using the asymptotic distributions of test statistics.
pvalthe p-value of the test proposed by Chen et al (2014).

Details

Suppose that the two groups of $p$-dimensional independent and identically distributed samples ${X_{1i}}_{i=1}^{n_1}$ and ${X_{2j}}_{j=1}^{n_2}$ are observed; we consider high-dimensional data with $p \gg n := n_1 + n_2 - 2$. Assume that the covariances of the two sample populations are $\Sigma_1 = (\sigma_{1, ij})$ and $\Sigma_2 = (\sigma_{2, ij})$. The primary object is to test $H_{0}: \mu_1 = \mu_2$ versus $H_{A}: \mu_1 \neq \mu_2$. Let $\bar{X}_{k}$ be the sample mean for group $k = 1, 2$. For a vector $v$, we denote $v^{(i)}$ as its $i$th element. Chen et al (2014) proposed removing estimated zero components in the mean difference through thresholding; they considered $$T_{CLZ}(s) = \sum_{i = 1}^{p} \left{ \frac{(\bar{X}_1^{(i)} - \bar{X}_2^{(i)})^2}{\sigma_{1,ii}/n_1 + \sigma_{2,ii}/n_2} - 1 \right} I \left{ \frac{(\bar{X}_1^{(i)} - \bar{X}_2^{(i)})^2}{\sigma_{1,ii}/n_1 + \sigma_{2,ii}/n_2} > \lambda_{p} (s) \right},$$ where the threshold level is $\lambda_p(s) := 2 s \log p$ and $I(\cdot)$ is the indicator function. Since an optimal choice of the threshold is unknown, they proposed trying all possible threshold values, then choosing the most significant one as their final test statistic: $$T_{CLZ} = \max_{s \in (0, 1 - \eta)} { T_{CLZ}(s) - \hat{\mu}_{T_{CLZ}(s), 0}}/\hat{\sigma}_{T_{CLZ}(s), 0},$$ where $\hat{\mu}_{T_{CLZ}(s), 0}$ and $\hat{\sigma}_{T_{CLZ}(s), 0}$ are estimates of the mean and standard deviation of $T_{CLZ}(s)$ under the null hypothesis. They derived its asymptotic null distribution as an extreme value distribution.

References

Chen SX, Li J, and Zhong PS (2014). "Two-Sample Tests for High Dimensional Means with Thresholding and Data Transformation." arXiv preprint arXiv:1410.2848.

Examples

Run this code

library(MASS)
set.seed(1234)
n1 <- n2 <- 50
p <- 200
mu1 <- rep(0, p)
mu2 <- mu1
mu2[1:10] <- 0.2
true.cov <- 0.4^(abs(outer(1:p, 1:p, "-"))) # AR1 covariance
sam1 <- mvrnorm(n = n1, mu = mu1, Sigma = true.cov)
sam2 <- mvrnorm(n = n2, mu = mu2, Sigma = true.cov)
apval_Chen2014(sam1, sam2)

# the two sample populations have different covariances
true.cov1 <- 0.2^(abs(outer(1:p, 1:p, "-")))
true.cov2 <- 0.6^(abs(outer(1:p, 1:p, "-")))
sam1 <- mvrnorm(n = n1, mu = mu1, Sigma = true.cov1)
sam2 <- mvrnorm(n = n2, mu = mu2, Sigma = true.cov2)
apval_Chen2014(sam1, sam2, eq.cov = FALSE)

Run the code above in your browser using DataLab