qn.test: Rank Score k-Sample Tests

Description

This function uses the $QN$ criterion (Kruskal-Wallis, van der Waerden scores, normal scores) to test the hypothesis that $k$ independent samples arise from a common unspecified distribution.

Usage

qn.test(…, data = NULL, test = c("KW", "vdW", "NS"), 
	method = c("asymptotic", "simulated", "exact"),
	dist = FALSE, Nsim = 10000)

Arguments

…

Either several sample vectors, say $x_1, \ldots, x_k$, with $x_i$ containing $n_i$ sample values. $n_i > 4$ is recommended for reasonable asymptotic $P$-value calculation. The pooled sample size is denoted by $N=n_1+\ldots+n_k$,

or a list of such sample vectors,

or a formula y ~ g, where y contains the pooled sample values and g (same length as y) is a factor with levels identifying the samples to which the elements of y belong.

data

= an optional data frame providing the variables in formula y ~ g.

test

= c("KW", "vdW", "NS"), where

"KW" uses scores 1:N (Kruskal-Wallis test)

"vdW" uses van der Waerden scores, qnorm( (1:N) / (N+1) )

"NS" uses normal scores, i.e., expected standard normal order statistics, invoking function normOrder of package SuppDists

method

= c("asymptotic","simulated","exact"), where

"asymptotic" uses only an asymptotic chi-square approximation with k-1 degrees of freedom to approximate the $P$-value. This calculation is always done.

"simulated" uses Nsim simulated $QN$ statistics based on random splits of the pooled samples into samples of sizes $n_1, \ldots, n_k$, to estimate the $P$-value.

"exact" uses full enumeration of all sample splits with resulting $QN$ statistics to obtain the exact $P$-value. It is used only when Nsim is at least as large as the number $$ncomb = \frac{N!}{n_1!\ldots n_k!}$$ of full enumerations. Otherwise, method reverts to "simulated" using the given Nsim. It also reverts to "simulated" when $ncomb > 1e8$ and dist = TRUE.

dist

FALSE (default) or TRUE. If TRUE, the simulated or fully enumerated null distribution vector null.dist is returned for the $QN$ test statistic. Otherwise, NULL is returned. When dist = TRUE then Nsim <- min(Nsim, 1e8), to limit object size.

Nsim

= 10000 (default), number of simulation sample splits to use. It is only used when method = "simulated", or when method = "exact" reverts to method = "simulated", as previously explained.

Value

A list of class kSamples with components

test.name

"Kruskal-Wallis", "van der Waerden scores", or

"normal scores"

number of samples being compared

vector $(n_1,\ldots,n_k)$ of the $k$ sample sizes

size of the pooled samples $= n_1+\ldots+n_k$

n.ties

number of ties in the pooled sample

2 (or 3) vector containing the observed $QN$, its asymptotic $P$-value, (its simulated or exact $P$-value)

warning

logical indicator, warning = TRUE when at least one $n_i < 5$

null.dist

simulated or enumerated null distribution of the test statistic. It is NULL when dist = FALSE or when method = "asymptotic".

method

the method used.

Nsim

the number of simulations used.

warning

method = "exact" should only be used with caution. Computation time is proportional to the number of enumerations. Experiment with system.time and trial values for Nsim to get a sense of the required computing time. In most cases dist = TRUE should not be used, i.e., when the returned distribution objects become too large for R's work space.

Details

The $QN$ criterion based on rank scores $v_1,\ldots,v_N$ is $$QN=\frac{1}{s_v^2}\left(\sum_{i=1}^k \frac{(S_{iN}-n_i \bar{v}_{N})^2}{n_i}\right)$$ where $S_{iN}$ is the sum of rank scores for the $i$-th sample and $\bar{v}_N$ and $s_v^2$ are sample mean and sample variance (denominator $N-1$) of all scores.

The statistic $QN$ is used to test the hypothesis that the samples all come from the same but unspecified continuous distribution function $F(x)$. $QN$ is always adjusted for ties by averaging the scores of tied observations.

Conditions for the asymptotic approximation (chi-square with $k-1$ degrees of freedom) can be found in Lehmann, E.L. (2006), Appendix Corollary 10, or in Hajek, Sidak, and Sen (1999), Ch. 6, problems 13 and 14.

For small sample sizes exact null distribution calculations are possible (with or without ties), based on a recursively extended version of Algorithm C (Chase's sequence) in Knuth (2011), which allows the enumeration of all possible splits of the pooled data into samples of sizes of $n_1, \ldots, n_k$, as appropriate under treatment randomization. This is done in C, as is the simulation.

NA values are removed and the user is alerted with the total NA count. It is up to the user to judge whether the removal of NA's is appropriate.

The continuity assumption can be dispensed with, if we deal with independent random samples from any common distribution, or if randomization was used in allocating subjects to samples or treatments, and if the asymptotic, simulated or exact $P$-values are viewed conditionally, given the tie pattern in the pooled sample. Under such randomization any conclusions are valid only with respect to the subjects that were randomly allocated to their respective treatment samples.

References

Hajek, J., Sidak, Z., and Sen, P.K. (1999), Theory of Rank Tests (Second Edition), Academic Press.

Knuth, D.E. (2011), The Art of Computer Programming, Volume 4A Combinatorial Algorithms Part 1, Addison-Wesley

Kruskal, W.H. (1952), A Nonparametric Test for the Several Sample Problem, The Annals of Mathematical Statistics, Vol 23, No. 4, 525-540

Kruskal, W.H. and Wallis, W.A. (1952), Use of Ranks in One-Criterion Variance Analysis, Journal of the American Statistical Association, Vol 47, No. 260, 583--621.

Lehmann, E.L. (2006), Nonparametrics, Statistical Methods Based on Ranks, Revised First Edition, Springer Verlag.

Examples

Run this code

# NOT RUN {
u1 <- c(1.0066, -0.9587,  0.3462, -0.2653, -1.3872)
u2 <- c(0.1005, 0.2252, 0.4810, 0.6992, 1.9289)
u3 <- c(-0.7019, -0.4083, -0.9936, -0.5439, -0.3921)
yy <- c(u1, u2, u3)
gy <- as.factor(c(rep(1,5), rep(2,5), rep(3,5)))
set.seed(2627)
qn.test(u1, u2, u3, test="KW", method = "simulated", 
  dist = FALSE, Nsim = 1000)
# or with same seed
# qn.test(list(u1, u2, u3),test = "KW", method = "simulated", 
#  dist = FALSE, Nsim = 1000)
# or with same seed
# qn.test(yy ~ gy, test = "KW", method = "simulated", 
#  dist = FALSE, Nsim = 1000)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

warning

Details

References

See Also

Examples