hellinger.cont: Estimation of a Continuous Mixture Complexity Based on Hellinger Distance

Description

Estimation of a continuous mixture complexity as well as its component weights and parameters by minimizing the squared Hellinger distance to a kernel density estimate.

Usage

hellinger.cont(obj, bandwidth, j.max = 10, threshold = "SBC", sample.n = 5000,
               sample.plot = FALSE, control = c(trace = 0))
hellinger.boot.cont(obj, bandwidth, j.max = 10, B = 100, ql = 0.025,
                    qu = 0.975, sample.n = 3000, sample.plot = FALSE,
                    control = c(trace = 0), ...)

Arguments

obj

object of class datMix.

bandwidth

numeric, indicating the bandwidth to be used. Can also be set to "adaptive" if the adaptive kernel density estimator as defined by Cutler & Cordero-Brana (1996, page 1720, Equation 2) should be employed.

j.max

integer stating the maximal number of components to be considered.

threshold

function or character string in c("AIC", "SBC") specifying which threshold should be used to compare two mixture estimates of complexities $j$ and $j+1$. If the difference in minimized squared distances is smaller than the relevant threshold, $j$ will be returned as complexity estimate.

sample.n

integer, specifying the sample size to be used for approximation of the objective function (see details).

sample.plot

logical, indicating whether the histogram of the sample drawn to approximate the objective function should be plotted.

control

control list of optimization parameters, see solnp.

integer, specifying the number of bootstrap replicates.

numeric between $0$ and $1$, specifying the lower quantile to which the observed difference in minimized squared distances will be compared.

numeric between $0$ and $1$, specifying the upper quantile to which the observed difference in minimized squared distances will be compared.

…

further arguments passed to the boot function.

Value

Object of class paramEst with the following attributes:

dat

data based on which the complexity is estimated.

dist

character string stating the (abbreviated) name of the component distribution, such that the function ddist evaluates its density/ mass function and rdist generates random variates.

ndistparams

integer specifying the number of parameters identifying the component distribution, i.e. if $\theta$ is in $R^d$ then ndistparams$ = d$.

formals.dist

string vector specifying the names of the formal arguments identifying the distribution dist and used in ddist and rdist, e.g. for a gaussian mixture (dist = norm) amounts to mean and sd, as these are the formal arguments used by dnorm and rnorm.

discrete

logical indicating whether the underlying mixture distribution is discrete. Will always be FALSE in this case.

mle.fct

attribute MLE.function of obj.

pars

say the complexity estimate is equal to some $j$. Then pars is a numeric vector of size $(d+1)*j-1$ specifying the component weight and parameter estimates, given as $$(w_1, ... w_{j-1}, \theta 1_1, ... \theta 1_j, \theta 2_1, ... \theta d_j).$$

values

numeric vector of function values gone through during optimization at iteration $j$, the last entry being the value at the optimum.

convergence

integer, indicating whether the solver has converged (0) or not (1 or 2) at iteration $j$.

Details

Define the $complexity$ of a finite continuous mixture $F$ as the smallest integer $p$, such that its probability density function (pdf) $f$ can be written as $$f(x) = w_1*g(x;\theta_1) + \dots + w_p*g(x;\theta_p).$$ Further, let $g, f$ be two probability density functions. The squared Hellinger distance between $g$ and $f$ is given by $$H^2(g,f) = \int (\sqrt{g(x)}-\sqrt{f(x)})^2 = 2 - 2\int\sqrt{f(x)}\sqrt{g(x)},$$ where $\sqrt{g(x)}$, respectively $\sqrt{f(x)}$ denotes the square root of the probability density functions at point x. To estimate $p$, hellinger.cont iteratively increases the assumed complexity $j$ and finds the "best" estimate for both, the pdf of a mixture with $j$ and $j+1$ components, ideally by calculating the parameters that minimize the sum of squared Hellinger distances to a kernel density estimate evaluated at each point. Since the computational burden of optimizing over an integral to find the "best" component weights and parameters is immense, the algorithm approximates the objective function by sampling sample.n observations $Y_i$ from the kernel density estimate and using $$2 - 2\sum \sqrt{f(Y_i)} / \sqrt{g(Y_i)},$$ instead, with $f$ being the mixture density and $g$ being the kernel density estimate. Once the "best" parameters have been obtained, the difference in squared distances is compared to a predefined threshold. If this difference is smaller than the threshold, the algorithm terminates and the true complexity is estimated as $j$, otherwise $j$ is increased by 1 and the procedure is started over. The predefined thresholds are the "AIC" given by $$(d+1)/n$$ and the "SBC" given by $$(d+1)log(n)/(2n),$$ $n$ being the sample size and $d$ the number of component parameters, i.e. $\theta$ is in $R^d$. Note that, if a customized function is to be used, it may only take the arguments j and n, so if the user wants to include the number of component parameters $d$, it has to be entered explicitly. hellinger.boot.cont works similarly to hellinger.cont with the exception that the difference in squared distances is not compared to a predefined threshold but a value generated by a bootstrap procedure. At every iteration ($j$), the function sequentially tests $p = j$ versus $p = j+1$ for $j = 1,2, \dots$, using a parametric bootstrap to generate B samples of size $n$ from a $j$-component mixture given the previously calculated "best" parameter values. For each of the bootstrap samples, again the "best" estimates corresponding to pdfs with $j$ and $j+1$ components are calculated, as well as their difference in approximated squared Hellinger distances from the kernel density estimate. The null hypothesis $H_0: p = j$ is rejected and $j$ increased by 1 if the original difference in squared distances lies outside of the interval $[ql, qu]$, specified by the ql and qu empirical quantiles of the bootstrapped differences. Otherwise, $j$ is returned as the complexity estimate. To calculate the minimum of the Hellinger distance (and the corresponding parameter values), the solver solnp is used. The initial values supplied to the solver are calculated as follows: the data is clustered into $j$ groups by the function clara and the data corresponding to each group is given to MLE.function (if supplied to the datMix object obj, otherwise numerical optimization is used here as well). The size of the groups is taken as initial component weights and the MLE's are taken as initial component parameter estimates.

References

Details can be found in

M.-J. Woo and T. Sriram, "Robust Estimation of Mixture Complexity", Journal of the American Statistical Association, Vol. 101, No. 476, 1475-1486, Dec. 2006.
A. Cutler, O.I. Cordero-Brana, "Minimum Hellinger Distance Estimation for Finite Mixture Models." Journal of the American Statistical Association, Vol. 91, No. 436, 1716-1723, Dec. 1996.

Examples

Run this code

# NOT RUN {
### generating 'Mix' object
normLocMix <- Mix("norm", discrete = FALSE, w = c(0.3, 0.4, 0.3), mean = c(10, 13, 17),
                  sd = c(1, 1, 1))

### generating 'rMix' from 'Mix' object (with 1000 observations)
set.seed(1)
normLocRMix <- rMix(10000, normLocMix)

### generating 'datMix' from 'R' object

## generate list of parameter bounds

norm.bound.list <- list("mean" = c(-Inf, Inf), "sd" = c(0, Inf))

## generate MLE functions

# for "mean"
MLE.norm.mean <- function(dat) mean(dat)
# for "sd" (the sd function uses (n-1) as denominator)
MLE.norm.sd <- function(dat){
sqrt((length(dat) - 1) / length(dat)) * sd(dat)
}
# combining the functions to a list
MLE.norm.list <- list("MLE.norm.mean" = MLE.norm.mean,
                      "MLE.norm.sd" = MLE.norm.sd)

## generating 'datMix' object
normLoc.dM <- RtoDat(normLocRMix, theta.bound.list = norm.bound.list,
                     MLE.function = MLE.norm.list)


### complexity and parameter estimation
# }
# NOT RUN {
set.seed(0)
res <- hellinger.cont(normLoc.dM, bandwidth = 0.5, sample.n = 5000)
plot(res)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab