energy: Energy distance computation

Description

energy() computes the energy distance (Sz<U+00E9>kely and Rizzo, 2013) between a given dataset and a set of points in same dimensions.

Usage

energy(data, points)

Arguments

data

The dataset including both the predictors and response(s). A numeric matrix is expected. If the dataset has factor columns, the user is expected to convert them to numeric using a coding method.

points

The set of points for which the energy distance with respect to data is to be computed. A numeric matrix is expected.

Value

Energy distance.

Details

Smaller the energy distance, the more statistically similar the set of points is to the given dataset. The minimizer of energy distance is known as support points (Mak and Joseph, 2018), which is the basis of the twinning method. Computing energy distance between data and points involves Euclidean distance calculations among the rows of data, among the rows of points, and between the rows of data and points. Since, data serves as the reference, the distance calculations among the rows of data are ignored for efficiency. Before computing the energy distance, the columns of data are scaled to zero mean and unit standard deviation. The mean and standard deviation of the columns of data are used to scale the respective columns in points.

References

Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.

Sz<U+00E9>kely, G. J., & Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8), 1249-1272.

Mak, S. & Joseph, V. R. (2018). Support Points. Annals of Statistics, 46, 2562-2592.

Examples

Run this code

# NOT RUN {
## Energy distance between a dataset and a random sample
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
energy(data, data[sample(100, 20), ])

# }

Run the code above in your browser using DataLab