Subsampling-based confidence intervals computed by kld_ci_subsampling()
require the convergence rate of the KL divergence estimator as an input. The
default rate of 0.5
assumes that the variance term dominates the bias term.
For high-dimensional problems, depending on the data, the convergence rate
might be lower. This function allows to empirically derive the convergence
rate.
convergence_rate(
estimator,
X,
Y = NULL,
q = NULL,
n.sizes = 4,
spacing.factor = 1.5,
typical.subsample = function(n) sqrt(n),
B = 500L,
plot = FALSE
)
A scalar, the parameter \(\beta\) in the empirical convergence
rate \(n^-\beta\) of the estimator
to the true KL divergence.
It can be used in the convergence.rate
argument of kld_ci_subsampling()
as convergence.rate = function(n) n^beta
.
A KL divergence estimator.
n
-by-d
and m
-by-d
data frames or matrices (multivariate
samples), or numeric/character vectors (univariate samples, i.e. d = 1
),
representing n
samples from the true distribution \(P\) and m
samples from the approximate distribution \(Q\) in d
dimensions.
Y
can be left blank if q
is specified (see below).
The density function of the approximate distribution \(Q\). Either
Y
or q
must be specified. If the distributions are all continuous or
all discrete, q
can be directly specified as the probability density/mass
function. However, for mixed continuous/discrete distributions, q
must
be given in decomposed form, \(q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d)\),
specified as a named list with field cond
for the conditional density
\(q_{c|d}(y_c|y_d)\) (a function that expects two arguments y_c
and
y_d
) and disc
for the discrete marginal density \(q_d(y_d)\) (a
function that expects one argument y_d
). If such a decomposition is not
available, it may be preferable to instead simulate a large sample from
\(Q\) and use the two-sample syntax.
Number of different subsample sizes to use (default: 4
).
Multiplicative factor controlling the spacing of sample
sizes (default: 1.5
).
A function that produces a typical subsample size,
used as the geometric mean of subsample sizes (default: sqrt(n)
).
Number of subsamples to draw per subsample size.
A boolean (default: FALSE
) controlling whether to produce a
diagnostic plot visualizing the fit.
References:
Politis, Romano and Wolf, "Subsampling", Chapter 8 (1999), for theory.
The implementation has been adapted from lecture notes by C. J. Geyer, https://www.stat.umn.edu/geyer/5601/notes/sub.pdf
# NN method usually has a convergence rate around 0.5:
set.seed(0)
convergence_rate(kld_est_nn, X = rnorm(1000), Y = rnorm(1000, mean = 1, sd = 2))
Run the code above in your browser using DataLab