cpglm: Compound Poisson Generalized Linear Model

Description

This function fits compound Poisson generalized linear models.

Usage

cpglm(formula, link = "log", data, weights, offset, subset, 
  na.action, inits = NULL, contrasts = NULL, control = list(),
  method="profile", ...)
  
cpglm_em(X, Y, weights = NULL, offset = NULL, link.power = 0,
  inits=NULL, intercept = TRUE, control = list())  
  
cpglm_profile(X, Y, weights = NULL, offset = NULL, 
  link.power = 0, intercept = TRUE, contrasts, control = list())

Arguments

formula

an object of class formula. See also in glm.

link

a specification for the model link function. This can be either a literal character string or a numeric number. If it is a character string, it must be one of "log", "identity", "sqrt" or "inverse". If it is numeric, it is the same as the link.power

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model.

weights

an optional vector of weights. Should be NULL or a numeric vector. When it is numeric, it must be positive. Zero weights are not allowed in cpglm.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. Another possible value is NULL, no actio

inits

a named list that provides starting values for the vector of mean parameters ('beta'), the dispersion (scale) parameter ('phi') and the index parameter ('p').

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector of length equal to the number of cases. One or more offset terms can be included in the for

contrasts

an optional list. See the contrasts.arg.

control

a list of parameters for controling the fitting process. See 'Details' below.

method

a character string, either "profile" or "MCEM", indicating whether the profiled log-likelihood method or the Monte Carlo EM algorithm should be used. See 'Details' below. The default is "profile".

X, Y

used in cpglm_em and cpglm_profile: X is a design matrix and Y is a vector of observations.

intercept

logical. Should an intercept be included in the null model?

link.power

used in cpglm_em and cpglm_profile and must be numeric. See tweedie for details.

...

not used now.

Value

cpglm returns an object of class "cpglm". See cpglm-class for details of the return values as well as various methods available for this class.

Details

The Tweedie compound Poisson distribution is a subclass of the exponential dispersion family with a power variance function, where the value of the power index lies in the interval (1,2). The distribution is composed of a probability mass at the origin accompanied by a skewed continuous distribution on the positive values. Despite its ability to handle continuous zero-inflated data, the density of the compound Poisson distribution is not analytically tractable. As a result, full maximum likelihood inference remains challenging when the power index is unknown. The cpglm function implements two methods to fit generalized linear models with the compound Poisson distribution - the profile likelihood approach and the Monte Carlo EM [MCEM] algorithm. Both methods yield maximum likelihood estimates for all parameters in the model, in particular, for the dispersion and the index parameter. In the profiled likelihood approach, the index and the dispersion parameters are estimated based on the profiled likelihood first and then the mean parameters are estimated using a GLM with the above-estimated index parameter. To compute the profiled likelihood, one has to resort to numerical methods provided in the tweedie package for approximating the density of the compound Poisson distribution. Indeed, the function tweedie.profile in that package makes available the profiled likelihood approach. The function here differs from tweedie.profile in that the user does not need to specify the grid of possible values the index parameter can take. Instead, the optimization of the profiled likelihood is automated. It is also to be noted that only MLE estimate for $\phi$ is included here, while tweedie.profile provides several other possibilities.

The second approach uses the MCEM algorithm, in light of the fact that the compound Poisson distribution involves an unobserved Poisson process. The observed data is thus augmented by the latent Poisson variable so that one could work on the joint likelihood of the complete data and invoke the MCEM algorithm to locate the maximum likelihood estimations. Different from the profiled likelihood approach, this method does not require direct numerical approximation to the density, and can be generalized to mixed-effect models and other compounded distribution. However, it could be slow due to the Monte Carlo simulations and the slow convergence of the EM algorithm. So users should use the profiled likelihood approach as a first choice and consider the MCEM algorithm when numerical approximations break down (see tweedie.profile).

For the MCEM algorithm, samples of the unobserved Poisson process is simulated and the required expectation in the E-step is approximated by a Monte Carlo estimate. The M-step involves conditional component-wise maximization: scoring algorithm for the mean parameters, constrained optimization for the index parameter and close-form computation for the scale parameter. The E-step and the M-step are iterated until the following stopping rule is reached: $$\max_i \frac{|\theta_i^{(r+1)}-\theta_i^{(r)}|}{\theta_i^{(r)}+\epsilon_1}<\epsilon_2,$$ where="" $\theta_i$="" is="" the="" $i_{th}$="" element="" of="" vector="" parameters,="" $r$="" represents="" $r_{th}$="" iteration,="" and="" $\epsilon_1$="" $\epsilon_2$="" are="" predefined="" constants.="" to="" account="" for="" monte="" carlo="" error,="" we="" also="" implement="" a="" simple="" approach="" that="" allows="" probability="" increasing="" sample="" size="" at="" each="" iteration="" be="" proportional="" relative="" change="" parameter="" values.="" so="" if="" increased,="" then="" new="" set="" $m="" +="" m="" k$,="" $m$="" current="" size,="" $k$="" constant.="" this="" method="" implemented="" when="" fixed.size in the control argument is FALSE. See the description for the control parameters below.

The control argument is a list that can supply various controling elements used in the fitting process. Most of these elements are used in the MCEM algorithm, and those used in the profiled likelihood approach are explicitly identified in the following:

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

References

Booth, J. G., and Hobert, J. P. (1999). Maximizing Generalized Linear Mixed Model Likelihoods with an Automated Monte Carlo EM Algorithm. Journal of the Royal Statistical Society Series B, 61, 265-285.

Dunn, P.K. and Smyth, G.K. (2005). Series evaluation of Tweedie exponential dispersion models densities. Statistics and Computing, 15, 267-280.

McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association, 92, 162-170.

Examples

Run this code

# profile likelihood         
fit1 <- cpglm(RLD~ factor(Zone)*factor(Stock),
  data=fineroot)
     
# residual and qq plot
parold <- par(mfrow=c(2,2), mar=c(5,5,2,1))
# 1. regular plot
r1 <- resid(fit1)/sqrt(fit1$phi)
plot(r1~fitted(fit1), cex=0.5)
qqnorm(r1, cex=0.5)
# 2. quantile residual plot to avoid overlappling
u <- ptweedie(fit1$y, fit1$p, fitted(fit1), fit1$phi)
u[fit1$y == 0] <- runif(sum(fit1$y == 0), 0, u[fit1$y == 0])
r2 <- qnorm(u)
plot(r2~fitted(fit1), cex=0.5)
qqnorm(r2, cex=0.5)

par(parold)

# MCEM fit
set.seed(12)
fit2 <- cpglm(RLD~ factor(Zone)*factor(Stock),
  data=fineroot, method="MCEM",
  control=list(init.size=5,sample.iter=50,
              max.size=200,fixed.size=FALSE)) 

# show the iteration history of p
plot(fit2$theta.all[,ncol(fit2$theta.all)],
     type="o", cex=0.5)

# compare the two
summary(fit1)
summary(fit2)

# provide initial values
inits <- list(beta=rnorm(length(coef(fit1))), phi=0.3, p=1.6)
fit3 <- cpglm(RLD~ factor(Zone)*factor(Stock),
  data=fineroot, method="MCEM",
  control=list(init.size=5,sample.iter=50,
              max.size=200,fixed.size=FALSE), 
  inits = inits)

Run the code above in your browser using DataLab