indeptest: Robust independence test for two continuous variables of Kolmogorov-Smirnov's type

Description

Test the independence between two continuous variables based on the maximum distance between the joint empirical cumulative distribution function and the product of the marginal empirical cumulative distribution functions.

Usage

indeptest(
  x,
  y,
  N = 50000,
  simu = FALSE,
  ties.break = "none",
  nb_tiebreak = 100
)

Value

Returns the result of the test with its corresponding p-value and the value of the test statistic.

Arguments

x, y: the two continuous variables. Must be of same length.
N: the number of Monte-Carlo replications if simu=TRUE.
simu: if TRUE a Monte-Carlo simulation with N replications is used to determine the distribution of the test statistic under the null hypothesis. If FALSE, pre computed tables are used (see Details for more information).
ties.break: the method used to break ties in case there are ties in the x or y vectors. Can be "none", "random" or "rep_random".
nb_tiebreak: the number of repetition for breaking the ties when ties.break="rep_random".

Author

See Distribution Free Tests of Independence Based on the Sample Distribution Function. J. R. Blum, J. Kiefer and M. Rosenblatt, 1961.

Details

For two continuous variables, indeptest tests H0 X and Y are independent against H1 X and Y are not independent.

For observations (x1,y1), ..., (x_n,y_n), the bivariate e.c.d.f. (empirical cumulative distribution function) Fn is defined as: $$Fn(t1,t2) = sum_{i=1}^n Indicator(xi<=t1,yi<=t2)/n.$$

Let Fn(t1) and Fn(t2) be the marginals e.c.d.f. The test statistic is defined as: $$n^(1/2) sup_{t1,t2} |Fn(t1,t2)-Fn(t1)*Fn(t2)|.$$

Under H0 the test statistic is distribution free and is equivalent to the same test statistic computed for two independent continuous uniform variables in $[0,1]$, where the supremum is taken for t1,t2 in $[0,1]$. Using this result, the distribution of the test statistic is obtained using Monte-Carlo simulations. The user can either use the argument simu=TRUE to perform the Monte-Carlo simulation (with N the number of replications) or simply use the available tables by choosing simu=FALSE. In the latter case, the exact distribution is estimated for n=1, ...,150. For $151<=n<=175$, the distribution with n=150 is used. For $176<=n<=250$, the distribution with n=200 is used. For $251<=n<=400$, the distribution with n=300 is used. For $401<=n<=750$, the distribution with n=500 is used. For $n>=751$, the distribution with n=1000 is used. Those tables were computed using 2e^5 replications in Monte-Carlo simulations.

Examples

Run this code

#Simulated data 1
x<-c(0.2, 0.3, 0.1, 0.4)
y<-c(0.5, 0.4, 0.05, 0.2)
indeptest(x,y)

#Simulated data 2
n<-40 #sample size
x<-rnorm(n)
y<-x^2+0.3*rnorm(n)
plot(x,y)
indeptest(x,y)

#Application on the Evans dataset
#Description of this dataset is available in the lbreg package
data(Evans)
with(Evans,plot(CHL[CDH==1],DBP[CDH==1]))
with(Evans,cor.test(CHL[CDH==1],DBP[CDH==1])) #the standard Pearson test
with(Evans,cortest(CHL[CDH==1],DBP[CDH==1])) #the robust Pearson test
with(Evans,indeptest(CHL[CDH==1],DBP[CDH==1])) #the robust independence test
#The robust tests give very different pvalues than the standard Pearson test!

#Breaking the ties
#The ties are broken once
with(Evans,indeptest(CHL[CDH==1],DBP[CDH==1],ties.break="random"))
#The ties are broken repeatedly and the average of the test statistics and p.values
#are computed
with(Evans,indeptest(CHL[CDH==1],DBP[CDH==1],ties.break="rep_random",nb_tiebreak=100))

Run the code above in your browser using DataLab