This function computes a confidence interval for KL divergence based on the subsampling bootstrap introduced by Politis and Romano. See Details for theoretical properties of this method.
kld_ci_subsampling(
X,
Y = NULL,
q = NULL,
estimator = kld_est_nn,
B = 500L,
alpha = 0.05,
subsample.size = function(x) x^(2/3),
convergence.rate = sqrt,
method = c("quantile", "se"),
include.boot = FALSE,
n.cores = 1L,
...
)
A list with the following fields:
"est"
(the estimated KL divergence),
"ci"
(a length 2
vector containing the lower and upper limits of the
estimated confidence interval).
"boot"
(a length B
numeric vector with KL divergence estimates on
the bootstrap subsamples), only included if include.boot = TRUE
,
n
-by-d
and m
-by-d
data frames or matrices (multivariate
samples), or numeric/character vectors (univariate samples, i.e. d = 1
),
representing n
samples from the true distribution \(P\) and m
samples from the approximate distribution \(Q\) in d
dimensions.
Y
can be left blank if q
is specified (see below).
The density function of the approximate distribution \(Q\). Either
Y
or q
must be specified. If the distributions are all continuous or
all discrete, q
can be directly specified as the probability density/mass
function. However, for mixed continuous/discrete distributions, q
must
be given in decomposed form, \(q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d)\),
specified as a named list with field cond
for the conditional density
\(q_{c|d}(y_c|y_d)\) (a function that expects two arguments y_c
and
y_d
) and disc
for the discrete marginal density \(q_d(y_d)\) (a
function that expects one argument y_d
). If such a decomposition is not
available, it may be preferable to instead simulate a large sample from
\(Q\) and use the two-sample syntax.
The Kullback-Leibler divergence estimation method; a
function expecting two inputs (X
and Y
or q
, depending on arguments
provided). Defaults to kld_est_nn
.
Number of bootstrap replicates (default: 500
), the larger, the
more accurate, but also more computationally expensive.
Error level, defaults to 0.05
.
A function specifying the size of the subsamples, defaults to \(f(x) = x^{2/3}\).
A function computing the convergence rate of the
estimator as a function of sample sizes. Defaults to \(f(x) = x^{1/2}\).
If convergence.rate
is NULL
, it is estimated empirically from the
sample(s) using kldest::convergence_rate()
.
Either "quantile"
(the default), also known as the reverse
percentile method, or "se"
for a normal approximation of the KL
divergence estimator using the standard error of the subsamples.
Boolean, TRUE
means KL divergence estimates on subsamples
are included in the returned list. Defaults to FALSE
.
Number of cores to use in parallel computing (defaults to 1
,
which means that no parallel computing is used).
To use this option, the parallel
package must be installed and the OS
must be of UNIX type (i.e., not Windows). Otherwise, n.cores
will be
reset to 1
, with a message.
Arguments passed on to estimator
, i.e. via the call
estimator(X, Y = Y, ...)
or estimator(X, q = q, ...)
.
In general terms, tetting \(b_n\) be the subsample size for a sample of size \(n\), and \(\tau_n\) the convergence rate of the estimator, a confidence interval calculated by subsampling has asymptotic coverage \(1 - \alpha\) as long as \(b_n/n\rightarrow 0\), \(b_n\rightarrow\infty\) and \(\frac{\tau_{b_n}}{\tau_n}\rightarrow 0\).
In many cases, the convergence rate of the nearest-neighbour based KL divergence estimator is \(\tau_n = \sqrt{n}\) and the condition on the subsample size reduces to \(b_n/n\rightarrow 0\) and \(b_n\rightarrow\infty\). By default, \(b_n = n^{2/3}\). In a two-sample problem, \(n\) and \(b_n\) are replaced by effective sample sizes \(n_\text{eff} = \min(n,m)\) and \(b_{n,\text{eff}} = \min(b_n,b_m)\).
Reference:
Politis and Romano, "Large sample confidence regions based on subsamples under minimal assumptions", The Annals of Statistics, Vol. 22, No. 4 (1994).
# 1D Gaussian (one- and two-sample problems)
set.seed(0)
X <- rnorm(100)
Y <- rnorm(100, mean = 1, sd = 2)
q <- function(x) dnorm(x, mean =1, sd = 2)
kld_gaussian(mu1 = 0, sigma1 = 1, mu2 = 1, sigma2 = 2^2)
kld_est_nn(X, Y = Y)
kld_est_nn(X, q = q)
kld_ci_subsampling(X, Y)$ci
kld_ci_subsampling(X, q = q)$ci
Run the code above in your browser using DataLab