Estimation of a discrete mixture complexity as well as its component weights and parameters by minimizing the squared Hellinger distance to the empirical probability mass function.
hellinger.disc(obj, j.max = 10, threshold = "SBC", control = c(trace = 0))
hellinger.boot.disc(obj, j.max = 10, B = 100, ql = 0.025, qu = 0.975,
control = c(trace = 0), ...)
object of class datMix
.
integer, stating the maximal number of components to be considered.
function or character string in c("AIC", "SBC")
specifying which threshold should be used to compare two mixture estimates of complexities \(j\) and \(j+1\). If the difference in minimized squared distances is smaller than the relevant threshold, \(j\) will be returned as complexity estimate.
control list of optimization parameters, see solnp
.
integer, specifying the number of bootstrap replicates.
numeric between \(0\) and \(1\) specifying the lower quantile to which the observed difference in minimized squared distances will be compared.
numeric between \(0\) and \(1\) specifying the upper quantile to which the observed difference in minimized squared distances will be compared.
further arguments passed to the boot
function.
Object of class paramEst
with the following attributes:
data based on which the complexity is estimated.
character string stating the (abbreviated) name of the component distribution, such that the function ddist
evaluates its density/ mass function and rdist
generates random variates.
integer specifying the number of parameters identifying the component distribution, i.e. if \(\theta\) is in R^d then ndistparams
\( = d\).
string vector specifying the names of the formal arguments identifying the distribution dist
and used in ddist
and rdist
, e.g. for a gaussian mixture (dist = norm
) amounts to mean
and sd
, as these are the formal arguments used by dnorm
and rnorm
.
logical flag indicating whether the underlying mixture distribution is discrete. Will always be TRUE
in this case.
attribute MLE.function
of obj
.
say the complexity estimate is equal to some \(j\). Then pars
is a numeric vector of size \((d+1)*j-1\) specifying the component weight and parameter estimates, given as $$(w_1, ... w_{j-1}, \theta 1_1, ... \theta 1_j, \theta 2_1, ... \theta d_j).$$
numeric vector of function values gone through during optimization at iteration \(j\), the last entry being the value at the optimum.
integer indicating whether the solver has converged (0) or not (1 or 2) at iteration \(j\).
Define the \(complexity\) of a finite discrete mixture \(F\) as the smallest integer \(p\), such that its probability mass function (pmf) \(f\) can be written as
$$f(x) = w_1*g(x;\theta_1) + \dots + w_p*g(x;\theta_p).$$
Let \(g, f\) be two probability mass functions. The squared Hellinger distance between \(g\) and \(f\) is given by
$$H^2(g,f) = \sum (\sqrt{g(x)} - \sqrt{f(x)})^2,$$
where \(\sqrt{g(x)}\) and \(\sqrt{f(x)}\) denote the square roots of the respective probability mass functions at point x.
To estimate \(p\), hellinger.disc
iteratively increases the assumed complexity \(j\) and finds the "best" estimate for the pmf of a mixture with \(j\) and the pmf of a mixture with \(j+1\) components, by calculating the parameters that minimize the sum of squared Hellinger distances to the empirical probability mass function at given points. Once these parameters have been obtained, the difference in squared distances is compared to a predefined threshold
. If this difference is smaller than the threshold, the algorithm terminates and the true complexity is estimated as \(j\), otherwise \(j\) is increased by 1. The predefined thresholds are the "AIC"
given by
$$(d+1)/n$$
and the "SBC"
given by
$$(d+1)log(n)/(2n),$$
\(n\) being the sample size and \(d\) the number of component parameters, i.e. \(\theta\) is in \(R^d\). Note that, if a customized function is to be used, it may only take the arguments j
and n
, so if the user wants to include the number of component parameters \(d\), it has to be entered explicitly.
hellinger.boot.disc
works similarly to hellinger.disc
with the exception that the difference in squared distances is compared to a value generated via a bootstrap procedure instead of being compared to a predefined threshold. At every iteration (of \(j\)), the function sequentially tests \(p = j\) versus \(p = j+1\) for \(j = 1,2, \dots\), using a parametric bootstrap to generate B
samples of size \(n\) from a \(j\)-component mixture given the previously calculated "best" parameter values. For each of the bootstrap samples, again the "best" estimates corresponding to pmfs with \(j\) and \(j+1\) components are computed, as well as their difference in squared Hellinger distances from the empirical probability mass function. The null hypothesis \(H_0: p = j\) is rejected and \(j\) increased by 1 if the original difference in squared distances lies outside of the interval \([ql, qu]\), specified by ql
and qu
, empirical quantiles of the bootstrapped differences. Otherwise, \(j\) is returned as the complexity estimate.
To calculate the minimum of the Hellinger distance (and the corresponding parameter values), the solver solnp
is used. The initial values supplied to the solver are calculated as follows: the data is clustered into \(j\) groups by the function clara
and the data corresponding to each group is given to MLE.function
(if supplied to the datMix
object obj
, otherwise numerical optimization is used here as well). The size of the groups is taken as initial component weights and the MLE's are taken as initial component parameter estimates.
M.-J. Woo and T. Sriram, "Robust estimation of mixture complexity for count data", Computational Statistics and Data Analysis 51, 4379-4392, 2007.
L2.disc
for the same estimation method using the L2 distance, hellinger.cont
for the same estimation method for continuous mixtures, solnp
for the solver, datMix
for the creation of the datMix
object.
# NOT RUN {
## create 'Mix' object
poisMix <- Mix("pois", , discrete = TRUE, w = c(0.45, 0.45, 0.1), lambda = c(1, 5, 10))
## create random data based on 'Mix' object (gives back 'rMix' object)
set.seed(0)
poisRMix <- rMix(1000, obj = poisMix)
## create 'datMix' object for estimation
# generate list of parameter bounds
poisList <- list("lambda" = c(0,Inf))
# generate MLE function
MLE.pois <- function(dat){
mean(dat)
}
# generating 'datMix' object
pois.dM <- RtoDat(poisRMix, theta.bound.list = poisList, MLE.function = MLE.pois)
## complexity and parameter estimation
set.seed(0)
res <- hellinger.disc(pois.dM)
plot(res)
# }
Run the code above in your browser using DataLab