energy() computes the energy distance (Sz<U+00E9>kely and Rizzo, 2013) between a given dataset and a set of points in same dimensions.
Usage
energy(data, points)
Arguments
data
The dataset including both the predictors and response(s). A numeric matrix is expected. If the dataset has factor columns, the user is expected to convert them to numeric using a coding method.
points
The set of points for which the energy distance with respect to data is to be computed. A numeric matrix is expected.
Value
Energy distance.
Details
Smaller the energy distance, the more statistically similar the set of points is to the given dataset. The minimizer of energy distance is known as support points (Mak and Joseph, 2018), which is the basis of the twinning method. Computing energy distance between data and points involves Euclidean distance calculations among the rows of data, among the rows of points, and between the rows of data and points. Since, data serves as the reference, the distance calculations among the rows of data are ignored for efficiency. Before computing the energy distance, the columns of data are scaled to zero mean and unit standard deviation. The mean and standard deviation of the columns of data are used to scale the respective columns in points.
References
Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.
Sz<U+00E9>kely, G. J., & Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8), 1249-1272.
Mak, S. & Joseph, V. R. (2018). Support Points. Annals of Statistics, 46, 2562-2592.
# NOT RUN {## Energy distance between a dataset and a random sampleX = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
energy(data, data[sample(100, 20), ])
# }