paramHankel: Estimation of Mixture Complexity (and Component Weights/Parameters) Based on Hankel Matrix Approach

Description

Estimation method of mixture complexity as well as component weights and parameters based on estimating the determinant of the Hankel matrix of the moments of the mixing distribution and comparing it to determinant values generated by a parametric bootstrap.

Usage

paramHankel(obj, j.max = 10, B = 1000, ql = 0.025, qu = 0.975, 
            control = c(trace = 0), …)
paramHankel.scaled(obj, j.max = 10, B = 100, ql = 0.025, qu = 0.975, 
                   control = c(trace = 0), …)
# S3 method for paramEst
plot(x, mixture = TRUE, components = TRUE, ylim = NULL, cex.main = 0.9, ...)
# S3 method for paramEst
print(x, ...)

Arguments

obj

object of class datMix.

j.max

integer stating the maximal number of components to be considered.

integer specifying the number of bootstrap replicates.

numeric between $0$ and $1$ specifying the lower bootstrap quantile to which the observed determinant value will be compared.

numeric between $0$ and $1$ specifying the upper bootstrap quantile to which the observed determinant value will be compared.

control

control list of optimization parameters, see solnp.

…

in paramHankel() and paramHankel.scaled():: further arguments passed to the boot function.
in plot.hankDet():: further arguments passed to the hist function plotting the data.
in print.hankDet():: further arguments passed to the printCoefmat function.

object of class paramEst.

mixture

logical flag, indicating whether the estimated mixture density should be plotted, set to TRUE by default.

components

logical flag, indicating whether the individual mixture components should be plotted, set to TRUE by default.

ylim

range of y values to use; if not specified (or containing NA), the function tries to construct reasonable default values.

cex.main

magnification to be used for main titles relative to the current setting of cex, see par.

Value

Object of class paramEst with the following attributes:

dat

data based on which the complexity is estimated.

dist

character string giving the abbreviated name of the component distribution, such that the function ddist evaluates its density/mass and rdist generates random variates.

ndistparams

integer specifying the number of parameters identifying the component distribution, i.e. if $θ$ is in $R^{d}$ then ndistparams $= d$ .

formals.dist

string vector specifying the names of the formal arguments identifying the distribution dist and used in ddist and rdist, e.g. for a gaussian mixture (dist = norm) amounts to mean and sd, as these are the formal arguments used by dnorm and rnorm.

discrete

logicalflag, indicating whether the underlying mixture distribution is discrete.

mle.fct

attribute MLE.function of obj.

pars

Say the complexity estimate is equal to some $j$ . Then pars is a numeric vector of size $(d + 1) * j - 1$ specifying the component weight and parameter estimates, given as $(w_{1}, . . . w_{j - 1}, θ 1_{1}, . . . θ 1_{j}, θ 2_{1}, . . . θ d_{j}) .$

values

numeric vector of function values gone through during optimization at iteration $j$ , the last entry being the value at the optimum.

convergence

indicates whether the solver has converged (0) or not (1 or 2) at iteration $j$ .

Details

Define $c o m p l e x i t y$ of a finite mixture $F$ as the smallest integer $p$ , such that its pdf/pmf $f$ can be written as $f (x) = w_{1} * g (x; θ_{1}) + \dots + w_{p} * g (x; θ_{p}) .$ The paramHankel procedure initially assumes that the mixture only contains one component, setting $j = 1$ , then sequentially tests $p = j$ versus $p = j + 1$ for $j = 1, 2, \dots$ . It determines the MLE for a $j$ -component mixture, generates B parametric bootstrap samples of size $n$ from the distribution the MLE corresponds to and calculates B determinants of the corresponding $(j + 1) x (j + 1)$ Hankel matrices of the first $2 j$ raw moments of the mixing distribution (for details see nonparamHankel). The null hypothesis $H_{0} : p = j$ is rejected and $j$ increased by 1 if the determinant value based on the original data lies outside of the interval $[q l, q u]$ , a range specified by ql and qu, empirical quantiles of the bootstrapped determinants. Otherwise, $j$ is returned as the complexity estimate. paramHankel.scaled functions similarly to paramHankel with the exception that the bootstrapped determinants are scaled by the empirical standard deviation of the bootstrap sample. To scale the original determinant, B nonparametric bootstrap samples of size $n$ are generated from the data, the corresponding determinants are calculated and their empirical standard deviation is used. The MLEs are calculated via the MLE.function attribute of the datMix object obj for $j = 1$ , if it is supplied. For all other $j$ (and also for $j = 1$ in case MLE.function = NULL) the solver solnp is used to calculate the minimum of the negative log-likelihood. The initial values supplied to the solver are calculated as follows: the data is clustered into $j$ groups by the function clara and the data corresponding to each group is supplied to MLE.function (if supplied to the datMix object, otherwise numerical optimization is used). The size of the groups is taken as initial component weights and the MLE's are taken as initial component parameter estimates.

Examples

Run this code

# NOT RUN {
## create 'Mix' object
poisMix <- Mix("pois", discrete = TRUE, w = c(0.45, 0.45, 0.1), lambda = c(1, 5, 10))

## create random data based on 'Mix' object (gives back 'rMix' object)
set.seed(1)
poisRMix <- rMix(1000, obj = poisMix)

## create 'datMix' object for estimation
# generate list of parameter bounds
poisList <- list("lambda" = c(0, Inf))

# generate MLE function
MLE.pois <- function(dat){
  mean(dat)
}

# generate function needed for estimating the j^th moment of the
# mixing distribution via Hankel.method "explicit"

explicit.pois <- function(dat, j){
  res <- 1
  for (i in 0:(j-1)){
    res <- res*(dat-i)
  }
  return(mean(res))
}

# generating 'datMix' object
pois.dM <- RtoDat(poisRMix, theta.bound.list = poisList, MLE.function = MLE.pois,
                  Hankel.method = "explicit", Hankel.function = explicit.pois)


## complexity and parameter estimation
# }
# NOT RUN {
set.seed(1)
res <- paramHankel(pois.dM)
plot(res)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab

Data Engineering and BI courses are free this week!