cpglm(formula, link = "log", data, weights, offset, subset,
na.action, betastart = NULL,phistart = NULL, pstart = NULL,
contrasts = NULL, control = list(), method="MCEM", ...)
cpglm_em(X, Y, weights = NULL, offset = NULL, link.power = 0,
betastart, phistart, pstart, intercept = TRUE, control = list())
cpglm_profile(X, Y, weights = NULL, offset = NULL,
link.power = 0, intercept = TRUE, contrasts, control = list())
formula
. See also in glm
.link.power
as.data.frame
to a data frame) containing the variables in the model.NULL
or a numeric vector. When it is numeric, it must be positive. Zero weights are not allowed in cpglm
.NA
s. The default is set by the na.action
setting of options, and is na.fail
if that is unset. Another possible value is NULL
, no actiobound.p
is specified, pstart
must be between the bounds set there.NULL
or a numeric vector of length equal to the number of cases. One or more offset terms can be included in the forcontrasts.arg
."MCEM"
or "profile"
, indicating whether the Monte Carlo EM algorithm or the profiled log-likelihood method should be used. See 'Details' below.cpglm_em
and cpglm_profile
: X
is a design matrix and Y
is a vector of observations.null
model?tweedie
.cpglm
returns an object of class cpglm
. See cpglm-class
for details of the return values as well as various method available for this class.cpglm
function implements two methods to fit generalized linear models with the compound Poisson distribution, namely, the Monte Carlo EM [MCEM] algorithm and the profile likelihood approach, both of which yield maximum likelihood estimates for all parameters in the model, in particular, for the dispersion and the index parameter.
The MCEM approach is in light of the fact that the compound Poisson distribution involves an unobserved Poisson process. The observed data is thus augmented by the latent Poisson variable so that one could work on the joint likelihood of the complete data and invoke the MCEM algorithm to locate the maximum likelihood estimations. Specifically, a Monte Carlo method is implemented in the E-step, where samples of the unobserved Poisson process is simulated from its posterior via rejection sampling (using a zero-truncated Poisson as a proposal), and the required expectation in the E-step is approximated by a Monte Carlo estimate. The M-step involves conditional component-wise maximization: scoring algorithm is used for the mean parameters, constrained optimization for the index parameter and close-form solution exists for the scale parameter. The E-step and the M-step are iterated until the following stopping rule is reached:
$$\max_i \frac{|\theta_i^{(r+1)}-\theta_i^{(r)}|}{\theta_i^{(r)}+\epsilon_1}<\epsilon_2,$$ where="" $\theta_i$="" is="" the="" $i_{th}$="" element="" of="" vector="" parameters,="" $r$="" represents="" $r_{th}$="" iteration,="" and="" $\epsilon_1$="" $\epsilon_2$="" are="" predefined="" constants.="" see="" description="" for="" control parameters below. Alternatively, one could use the profiled likelihood approach, where the index parameter is estimated based on the profiled likelihood first and then a traditional GLM is fitted. This is because the mean parameters and the index parameter ($p$) are orthogonal, meaning that the mean parameters vary slowly as $p$ changes. Specifically, to get a profiled maximum likelihood estimate of $p$, one needs to specify a grid of possible values of $p$, estimates the corresponding mean and dispersion parameters, and computes the associated likelihood. Then these likelihoods are compared to locate the value of $p$ corresponding to the maximum likelihood. To compute the likelihood, one has to resort to numerical methods provided in the tweedie
package that approximate the density of the compound Poisson distribution. Indeed, the function tweedie.profile
in that package makes available the profiled likelihood approach. What is included here is simply a wrapper that provides automatic estimation of $p$ to any pre-specified decimal places (see decimal
in the following). Only MLE estimate for $\phi$ is included here, while tweedie.profile
provides several other possibilities.
These two methods are specified in the method
argument in cpglm
. The MCEM algorithm is invoked by specifying method="MCEM"
in cplm
, while the profile likelihood approach is used when method="profile"
. The two methods can also be invoked by calling the function cpglm_em
and cpglm_profile
, however, the users are encouraged to use cpglm
, which uses cpglm_em
and cpglm_profile
internally.
It is to be noted that the MCEM algorithm enables one to work on the more tractable joint likelihood, thus simplifying the estimation problem down to posterior simulation and block/univariate optimizations. However, it also has some overheads:
- The simulation of the latent variable could be computationally demanding when a large number of samples is needed;
- The inherent Monte Carlo error must be accounted for;
- The EM algorithm could have slow convergence rate.
The computational demand can be largely reduced by supplementing the algorithm with importance sampling after it converges to the neighborhood of the MLEs. The importance sampling retains the latest simulated samples for use in all later iterations, thus eliminating the need for simulating new samples. To account for the Monte Carlo error, many authors have suggests increasing the sample size as the algorithm proceeds. Booth and Hobert (1999) derives a Normal approximation for the Monte Carlo error, which is further used to construct a (1-alpha
) confidence ellipsoid at each iteration. If the old parameters lie in this ellipsoid, then the sample size is increase as $m + m/k$, where $m$ is the current sample size, and $k$ is a predefined constant. This method is implemented when fixed.size
in the control argument is FALSE
. See the description for the control
parameters below.
The control
argument is a list that can supply various controlling elements used in the fitting process. Most of these elements are used in the MCEM algorithm, and those used in the profile likelihood approach are explicitly identified in the following:
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
\epsilon_2,$$>
Dunn, P.K. and Smyth, G.K. (2005). Series evaluation of Tweedie exponential dispersion models densities. Statistics and Computing, 15, 267-280.
McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association, 92, 162-170.
Zhang Y. (2011). A Latent Variable Approach for Statistical Inference in Tweedie Compound Poisson Linear Models, preprint,
cpglm-class
, glm
, tweedie
, and tweedie.profile
for related information.# MCEM fit
set.seed(10)
fit1 <- cpglm(RLD~ factor(Zone)*factor(Stock),
data=fineroot,
control=list(init.size=5,sample.iter=50,
max.size=2000,fixed.size=FALSE),
pstart=1.6)
# show the iteration history of p
plot(fit1$theta.all[,ncol(fit1$theta.all)],
type="o", cex=0.5)
# residual and qq plot
parold <- par(mfrow=c(2,2), mar=c(5,5,2,1))
# 1. regular plot
r1 <- resid(fit1)/sqrt(fit1$phi)
plot(r1~fitted(fit1), cex=0.5)
qqnorm(r1, cex=0.5)
# 2. quantile residual plot to avoid overlappling
u <- ptweedie(fit1$y, fit1$p, fitted(fit1), fit1$phi)
u[fit1$y == 0] <- runif(sum(fit1$y == 0), 0, u[fit1$y == 0])
r2 <- qnorm(u)
plot(r2~fitted(fit1), cex=0.5)
qqnorm(r2, cex=0.5)
par(parold)
# profile likelihood
fit2 <- cpglm(RLD~ factor(Zone)*factor(Stock),
data=fineroot,method="profile",
control=list(decimal=3))
# compare the two
summary(fit1)
summary(fit2)
Run the code above in your browser using DataLab