shapleysobol.knn
implements the estimation of the Shapley Sobol' effect directly from scattered data.
Parallel computation is available to accelerate the estimation.
For categorical inputs, please convert them to factor before calling this function.
For large datasets, we support the use of subsample to reduce the computational cost.
shapleysobol.knn(
X,
y,
noise,
n.knn = NULL,
n.mc = nrow(X),
twin.mc = FALSE,
rescale = TRUE,
parl = NULL
)
A numeric vector of the Shapley Sobol' effect estimation.
a matrix or data frame for the factors / predictors.
a vector for the responses.
a logical indicating whether the responses are noisy.
number of nearest-neighbors for the inner loop conditional variance estimation. n.knn=2
is recommended for regression, and n.knn=3
for binary classification.
number of Monte Carlo samples for the outer loop expectation estimation.
a logical indicating whether to use twinning subsamples, otherwise random subsamples are used. It is supported when the reduction ratio is at least 2.
a logical logical indicating whether to standardize the factors / predictors.
number of cores on which to parallelize the computation. If NULL
, then no parallelization is done.
Chaofan Huang chaofan.huang@gatech.edu and V. Roshan Joseph roshan@gatech.edu
shapleysobol.knn
provides consistent estimation of the Shapley Sobol' Effect (Owen, 2014; Song et al., 2016) from scattered data.
When the output is clean/noiseless (noise=FALSE
), shapleysobol.knn
implements the Nearest-Neighbor estimator proposed in Broto et al. (2020).
When the output is noisy (noise=TRUE
), shapleysobol.knn
implements the Noise-Adjusted Nearest-Neighbor (NANNE) estimator (Huang and Joseph, 2025).
NANNE estimator can correct the estimation bias of the nearest-neighbor estimator caused by the random noise.
Please see Huang and Joseph (2025) for a more detailed discussion and comparison.
For integer/numeric output, n.knn=2
nearest-neighbors is recommended for the noisy data (Huang and Joseph, 2025),
and n.knn=3
nearest-neighbors is suggested for the clean/noiseless data (Broto et al., 2020).
For numeric inputs, it is recommended to standardize them via setting the argument rescale=TRUE
.
Categorical inputs are transformed via one-hot encoding for the nearest-neighbor search.
To speed up the nearest-neighbor search, k-d tree from the FNN package is used.
Also, parallel computation is also supported via the parallel package.
Last, for large datasets, we support the use of subsamples for further acceleration.
Use argument n.mc
to specify the number of subsamples.
Two options are available for finding the subsamples: random and twinning (Vakayil and Joseph, 2022).
Twinning is able to find subsamples that better represent the big data, i.e.,
providing a more accurate estimation, but at a slightly higher computational cost.
For more details, please see the twinning package.
Huang, C., & Joseph, V. R. (2025). Factor Importance Ranking and Selection using Total Indices. Technometrics.
Owen, A. B. (2014), “Sobol’indices and Shapley value,” SIAM/ASA Journal on Uncertainty Quantification, 2, 245–251.
Song, E., Nelson, B. L., & Staum, J. (2016), “Shapley effects for global sensitivity analysis: Theory and computation,” SIAM/ASA Journal on Uncertainty Quantification, 4, 1060-1083.
Broto, B., Bachoc, F., & Depecker, M. (2020). Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2), 693-716.
Vakayil, A., & Joseph, V. R. (2022). Data twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, 15(5), 598-610.
ishigami <- function(x) {
x <- -pi + 2*pi*x
y <- sin(x[1]) + 7*sin(x[2])^2 + 0.1*x[3]^4*sin(x[1])
return (y)
}
set.seed(123)
n <- 10000
p <- 3
X <- matrix(runif(n*p), ncol=p)
y <- apply(X,1,ishigami) + rnorm(n)
ssi <- shapleysobol.knn(X, y, noise=TRUE, n.knn=2, rescale=FALSE)
print(round(ssi,3))
Run the code above in your browser using DataLab