distance correlation: Distance Correlation and Covariance Statistics

Description

Computes distance covariance and distance correlation statistics, which are multivariate measures of dependence.

Usage

dcov(x, y, index = 1.0)
dcor(x, y, index = 1.0)

Value

dcov returns the sample distance covariance and dcor returns the sample distance correlation.

Arguments

x: data or distances of first sample
y: data or distances of second sample
index: exponent on Euclidean distance, in (0,2]

Author

Maria L. Rizzo mrizzo@bgsu.edu and Gabor J. Szekely

Details

dcov and dcor compute distance covariance and distance correlation statistics.

The sample sizes (number of rows) of the two samples must agree, and samples must not contain missing values.

The index is an optional exponent on Euclidean distance. Valid exponents for energy are in (0, 2) excluding 2.

Argument types supported are numeric data matrix, data.frame, or tibble, with observations in rows; numeric vector; ordered or unordered factors. In case of unordered factors a 0-1 distance matrix is computed.

Optionally pre-computed distances can be input as class "dist" objects or as distance matrices. For data types of arguments, distance matrices are computed internally.

Distance correlation is a new measure of dependence between random vectors introduced by Szekely, Rizzo, and Bakirov (2007). For all distributions with finite first moments, distance correlation $\mathcal R$ generalizes the idea of correlation in two fundamental ways: (1) $\mathcal R(X,Y)$ is defined for $X$ and $Y$ in arbitrary dimension. (2) $\mathcal R(X,Y)=0$ characterizes independence of $X$ and $Y$.

Distance correlation satisfies $0 \le \mathcal R \le 1$, and $\mathcal R = 0$ only if $X$ and $Y$ are independent. Distance covariance $\mathcal V$ provides a new approach to the problem of testing the joint independence of random vectors. The formal definitions of the population coefficients $\mathcal V$ and $\mathcal R$ are given in (SRB 2007). The definitions of the empirical coefficients are as follows.

The empirical distance covariance $\mathcal{V}_n(\mathbf{X,Y})$ with index 1 is the nonnegative number defined by $$ \mathcal{V}^2_n (\mathbf{X,Y}) = \frac{1}{n^2} \sum_{k,\,l=1}^n A_{kl}B_{kl} $$ where $A_{kl}$ and $B_{kl}$ are $$ A_{kl} = a_{kl}-\bar a_{k.}- \bar a_{.l} + \bar a_{..} $$ $$ B_{kl} = b_{kl}-\bar b_{k.}- \bar b_{.l} + \bar b_{..}. $$ Here $$ a_{kl} = \|X_k - X_l\|_p, \quad b_{kl} = \|Y_k - Y_l\|_q, \quad k,l=1,\dots,n, $$ and the subscript . denotes that the mean is computed for the index that it replaces. Similarly, $\mathcal{V}_n(\mathbf{X})$ is the nonnegative number defined by $$ \mathcal{V}^2_n (\mathbf{X}) = \mathcal{V}^2_n (\mathbf{X,X}) = \frac{1}{n^2} \sum_{k,\,l=1}^n A_{kl}^2. $$

The empirical distance correlation $\mathcal{R}_n(\mathbf{X,Y})$ is the square root of $$ \mathcal{R}^2_n(\mathbf{X,Y})= \frac {\mathcal{V}^2_n(\mathbf{X,Y})} {\sqrt{ \mathcal{V}^2_n (\mathbf{X}) \mathcal{V}^2_n(\mathbf{Y})}}. $$ See dcov.test for a test of multivariate independence based on the distance covariance statistic.

References

Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007), Measuring and Testing Dependence by Correlation of Distances, Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
tools:::Rd_expr_doi("10.1214/009053607000000505")

Szekely, G.J. and Rizzo, M.L. (2009), Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1236-1265.
tools:::Rd_expr_doi("10.1214/09-AOAS312")

Szekely, G.J. and Rizzo, M.L. (2009), Rejoinder: Brownian Distance Covariance, Annals of Applied Statistics, Vol. 3, No. 4, 1303-1308.

Examples

Run this code

 x <- iris[1:50, 1:4]
 y <- iris[51:100, 1:4]
 dcov(x, y)
 dcov(dist(x), dist(y))  #same thing

Run the code above in your browser using DataLab