Given a random sample \(\{X_{1}, ..., X_{n}\}\) of size n
with an empirical cdf \(F_{n}(x)\), the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as \(D_{n} = \sup | F_{n}(x) - F(x) | \), where \(F(x)\) is the cdf of a prespecified theoretical distribution under the null hypothesis \(H_{0}\), that \(\{X_{1}, ..., X_{n}\}\) comes from \(F(x)\).
The function cont_ks_c_cdf
implements the FFT-based algorithm proposed by Moscovich and Nadler (2017) to compute the complementary cdf, \(P(D_{n} \ge q)\) at a value \(q\), when \(F(x)\) is continuous.
This algorithm ensures a total worst-case run-time of order \(O(n^{2}log(n))\) which makes it more efficient and numerically stable than the algorithm proposed by Marsaglia et al. (2003).
The latter is used by many existing packages computing the cdf of \(D_{n}\), e.g., the function ks.test
in the package stats and the function ks.test
in the package dgof.
More precisely, in these packages, the exact p-value, \(P(D_{n} \ge q)\) is computed only in the case when \(q = d_{n}\), where \(d_{n}\) is the value of the KS test statistic computed based on a user provided sample \( \{x_{1}, ..., x_{n} \} \).
Another limitation of the functions ks.test
is that the sample size should be less than 100, and the computation time is \(O(n^{3})\).
In contrast, the function cont_ks_c_cdf
provides results with at least 10 correct digits after the decimal point for sample sizes \(n\) up to 100000 and computation time of 16 seconds on a machine with an 2.5GHz Intel Core i5 processor with 4GB RAM, running MacOS X Yosemite.
For n
> 100000, accurate results can still be computed with similar accuracy, but at a higher computation time.
See Dimitrova, Kaishev, Tan (2020), Appendix C for further details and examples.