This function uses the Anderson-Darling criterion to test the hypothesis that \(k\) independent samples with sample sizes \(n_1,\ldots, n_k\) arose from a common unspecified distribution function \(F(x)\) and testing is done conditionally given the observed tie pattern. Thus this is a permutation test. Both versions of the \(AD\) statistic are computed.
ad.test(…, data = NULL, method = c("asymptotic", "simulated", "exact"),
dist = FALSE, Nsim = 10000)
Either several sample vectors, say \(x_1, \ldots, x_k\), with \(x_i\) containing \(n_i\) sample values. \(n_i > 4\) is recommended for reasonable asymptotic \(P\)-value calculation. The pooled sample size is denoted by \(N=n_1+\ldots+n_k\),
or a list of such sample vectors,
or a formula y ~ g, where y contains the pooled sample values and g is a factor (of same length as y) with levels identifying the samples to which the elements of y belong.
= an optional data frame providing the variables in formula y ~ g.
= c("asymptotic","simulated","exact")
, where
"asymptotic"
uses only an asymptotic \(P\)-value approximation, reasonable
for P in [.00001, .99999] when all \(n_i > 4\).
Linear extrapolation via \(\log(P/(1-P))\)
is used outside [.00001, .99999]. This calculation is always done.
See ad.pval
for details.
The adequacy of the asymptotic \(P\)-value calculation
may be checked using pp.kSamples
.
"simulated"
uses Nsim
simulated \(AD\) statistics, based on random
splits of the pooled samples into samples of sizes
\(n_1, \ldots, n_k\), to estimate the exact conditional \(P\)-value.
"exact"
uses full enumeration of all sample splits with
resulting \(AD\) statistics to obtain the exact conditional \(P\)-values.
It is used only when Nsim
is at least as large as the number
$$ncomb = \frac{N!}{n_1!\ldots n_k!}$$
of full enumerations. Otherwise, method
reverts to "simulated"
using the given Nsim
. It also reverts
to "simulated"
when \(ncomb > 1e8\) and dist = TRUE
.
= FALSE
(default) or TRUE
. If TRUE
, the
simulated or fully enumerated distribution vectors null.dist1
and
null.dist2
are returned for the respective test statistic versions.
Otherwise, NULL
is returned. When dist = TRUE
then
Nsim <- min(Nsim, 1e8)
, to limit object size.
= 10000
(default), number of simulation sample splits to use.
It is only used when method = "simulated"
,
or when method = "exact"
reverts to method =
"simulated"
, as previously explained.
A list of class kSamples
with components
"Anderson-Darling"
number of samples being compared
vector of the \(k\) sample sizes \((n_1,\ldots,n_k)\)
size of the pooled sample \(= n_1+\ldots+n_k\)
number of ties in the pooled samples
standard deviations \(\sigma\) of version 1 of \(AD\) under the continuity assumption
2 x 3 (2 x 4) matrix containing \(AD, T.AD\), asymptotic \(P\)-value, (simulated or exact \(P\)-value), for each version of the standardized test statistic \(T\), version 1 in row 1, version 2 in row 2.
logical indicator, warning = TRUE when at least one \(n_i < 5\)
simulated or enumerated null distribution of version 1 of the test statistic, given as vector of all generated \(AD\) statistics.
simulated or enumerated null distribution of version 2 of the test statistic, given as vector of all generated \(AD\) statistics.
The method
used.
The number of simulations.
method = "exact"
should only be used with caution.
Computation time is proportional to the number of enumerations. In most cases
dist = TRUE
should not be used, i.e.,
when the returned distribution vectors null.dist1
and null.dist2
become too large for the R work space. These vectors are limited in length by 1e8.
If \(AD\) is the Anderson-Darling criterion for the \(k\) samples, its standardized test statistic is \(T.AD = (AD - \mu)/\sigma\), with \(\mu = k-1\) and \(\sigma\) representing mean and standard deviation of \(AD\). This statistic is used to test the hypothesis that the samples all come from the same but unspecified continuous distribution function \(F(x)\).
According to the reference article, two versions of the \(AD\) test statistic are provided. The above mean and standard deviation are strictly valid only for version 1 in the continuous distribution case.
NA values are removed and the user is alerted with the total NA count. It is up to the user to judge whether the removal of NA's is appropriate.
The continuity assumption can be dispensed with, if we deal with
independent random samples, or if randomization was used in allocating
subjects to samples or treatments, and if we view
the simulated or exact \(P\)-values conditionally, given the tie pattern
in the pooled samples. Of course, under such randomization any conclusions
are valid only with respect to the group of subjects that were randomly allocated
to their respective samples.
The asymptotic \(P\)-value calculation assumes distribution continuity. No adjustment
for lack thereof is known at this point. For details on the asymptotic
\(P\)-value calculation see ad.pval
.
Knuth, D.E. (2011), The Art of Computer Programming, Volume 4A Combinatorial Algorithms Part 1, Addison-Wesley
Scholz, F. W. and Stephens, M. A. (1987), K-sample Anderson-Darling Tests, Journal of the American Statistical Association, Vol 82, No. 399, 918--924.
# NOT RUN {
u1 <- c(1.0066, -0.9587, 0.3462, -0.2653, -1.3872)
u2 <- c(0.1005, 0.2252, 0.4810, 0.6992, 1.9289)
u3 <- c(-0.7019, -0.4083, -0.9936, -0.5439, -0.3921)
y <- c(u1, u2, u3)
g <- as.factor(c(rep(1, 5), rep(2, 5), rep(3, 5)))
set.seed(2627)
ad.test(u1, u2, u3, method = "exact", dist = FALSE, Nsim = 1000)
# or with same seed
# ad.test(list(u1, u2, u3), method = "exact", dist = FALSE, Nsim = 1000)
# or with same seed
# ad.test(y ~ g, method = "exact", dist = FALSE, Nsim = 1000)
# }
Run the code above in your browser using DataLab