ks_c_cdf_Rcpp: R function calling directly the C++ routines that compute the complementary cumulative distribution function of the two-sided (or one-sided, as a special case) Kolmogorov-Smirnov statistic, when the cdf under the null hypothesis is arbitrary (i.e., purely discrete, mixed or continuous)

Description

Function calling directly the C++ routines that compute the complementary cdf for the one-sample two-sided Kolmogorov-Smirnov statistic, given the sample size n and the file "Boundary_Crossing_Time.txt" in the working directory. The latter file contains \(A_{i}\) and \(B_{i}\), \(i = 1, ..., n\), specified in Steps 1 and 2 of the Exact-KS-FFT method (see Equation (5) in Section 2 of Dimitrova, Kaishev, Tan (2020)). The latter values form the n-dimensional rectangular region for the uniform order statistics (see Equations (3), (5) and (6) in Dimitrova, Kaishev, Tan (2020)), namely \(P(D_{n}\ge q) = 1 - P(A_{i} \le U_{(i)} \le B_{i}, 1 \le i \le n) = 1 - P(g(t) \le nU_{n}(t) \le h(t), 0 \le t \le 1)\), where the upper and lower boundary functions \(h(t)\), \(g(t)\) are defined as \(h(t) = \sum_{i=1}^{n}1_{(A_{i} < t)}\), \(g(t) = \sum_{i=1}^{n}1_{(B_{i} \le t)}\), or equivalently, noting that \(h(t)\) and \(g(t)\) are correspondingly left and right continuous functions, we have \(\sup\{t\in[0,1]: h(t) < i \} = A_{i}\) and \(\inf\{t\in[0,1]: g(t) > i-1 \} = B_{i}\).

Note that on can also compute the (complementary) cdf for the one-sided KS statistics \(D_{n}^{-}\) or \(D_{n}^{+}\) (cf., Dimitrova, Kaishev, Tan (2020)) by appropriately specifying correspondingly \(A_{i} = 0\) for all \(i\) or \(B_{i} = 1\) for all \(i\), in the function ks_c_cdf_Rcpp.

Usage

ks_c_cdf_Rcpp(n)

Arguments

the sample size

Value

Numeric value corresponding to \(P(D_{n}\ge q) = 1 - P(A_{i} \le U_{(i)} \le B_{i}, 1 \le i \le n) = 1 - P(g(t) \le \eta_{n}(t) \le h(t), 0 \le t \le 1)\) (or, as a special case, to \(P(D_{n}^{+} \ge q)\) or \(P(D_{n}^{-} \ge q)\)), given a sample size n and the file "Boundary_Crossing_Time.txt" containing \(A_{i}\) and \(B_{i}\), \(i = 1, ..., n\), specified in Steps 1 and 2 of the Exact-KS-FFT method (see Dimitrova, Kaishev, Tan (2020), Section 2).

Details

Note that all calculations here are done directly in C++ and output in R. That leads to faster computation time, as well as in some cases, possibly higher accuracy (depending on the accuracy of the pre-computed values \(A_{i}\) and \(B_{i}\), \(i = 1, ..., n\), provided in the file "Boundary_Crossing_Time.txt") compared to the functions cont_ks_c_cdf, disc_ks_c_cdf, mixed_ks_c_cdf.

Given a random sample \(\{X_{1}, ..., X_{n}\}\) of size n with an empirical cdf \(F_{n}(x)\), the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as \(D_{n} = \sup | F_{n}(x) - F(x) | \), where \(F(x)\) is the cdf of a prespecified theoretical distribution under the null hypothesis \(H_{0}\), that \(\{X_{1}, ..., X_{n}\}\) comes from \(F(x)\). The one-sided KS test statistics are correspondingly defined as \(D_{n}^{-} = \sup_{x} (F(x) - F_{n}(x))\) and \(D_{n}^{+} = \sup_{x} (F_{n}(x) - F(x))\).

The function ks_c_cdf_Rcpp implements the Exact-KS-FFT method, proposed by Dimitrova, Kaishev, Tan (2020), to compute the complementary cdf, \(P(D_{n} \ge q)\) at a value \(q\), when \(F(x)\) is arbitrary (i.e. purely discrete, mixed or continuous). It is based on expressing the complementary cdf as \(P(D_{n} \ge q) = 1 - P(A_{i} \le U_{(i)} \le B_{i}, 1 \le i \le n)\), where \(A_{i}\) and \(B_{i}\) are defined as in Step 1 of Dimitrova, Kaishev, Tan (2020).

The complementary cdf is then re-expressed in terms of the conditional probability that a homogeneous Poisson process, \(\xi_{n}(t)\) with intensity \(n\) will not cross an upper boundary \(h(t)\) and a lower boundary \(g(t)\), given that \(\xi_{n}(1) = n\) (see Steps 2 and 3 in Section 2.1 of Dimitrova, Kaishev, Tan (2020)). This conditional probability is evaluated using FFT in Step 4 of the method in order to obtain the value of the complementary cdf \(P(D_{n} \ge q)\). This algorithm ensures a total worst-case run-time of order \(O(n^{2}log(n))\) which makes it highly computationally efficient compared to other known algorithms developed for the special cases of continuous or purely discrete \(F(x)\).

The values \(A_{i}\) and \(B_{i}\), \(i = 1, ..., n\), specified in Steps 1 and 2 of the Exact-KS-FFT method (see Dimitrova, Kaishev, Tan (2020), Section 2) must be pre-computed (in R or, if needed, using alternative softwares offering high accuracy, e.g. Mathematica) and saved in a file with the name "Boundary_Crossing_Time.txt" (in the current working directory).

The function ks_c_cdf_Rcpp is called in R and it first reads the file "Boundary_Crossing_Time.txt" and then computes the value for the complementaty cdf \(P(D_{n}\ge q) = 1 - P(A_{i} \le U_{(i)} \le B_{i}, 1 \le i \le n) = 1 - P(g(t) \le nU_{n}(t) \le h(t), 0 \le t \le 1)\) in C++ and output in R (or as noted above, as a special case, computes the value of the complementary cdf \(P(D_{n}^{+} \ge q) = 1 - P(A_{i} \le U_{(i)} \le 1, 1 \le i \le n)\) or \(P(D_{n}^{-} \ge q) = 1 - P(0 \le U_{(i)} \le B_{i}, 1 \le i \le n)\)).

References

Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.

Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.

Examples

Run this code

# NOT RUN {
## Computing the complementary cdf P(D_{n} >= q)
## for n = 10 and q = 0.1, when F(x) is continuous,
## In this case,
## B_i = (i-1)/n + q
## A_i =  i/n - q


n <- 10
q <- 0.1
up_rec <- ((1:n)-1)/n + q
low_rec <- (1:n)/n - q
df <- data.frame(rbind(up_rec, low_rec))
write.table(df,"Boundary_Crossing_Time.txt", sep = ", ",
                  row.names = FALSE, col.names = FALSE)
ks_c_cdf_Rcpp(n)

# }

Run the code above in your browser using DataLab