cpglm: Compound Poisson Generalized Linear Model

Description

This function fits generalized linear models with the Tweedie compound Poisson distribution.

Usage

cpglm(formula, link = "log", data, weights, offset, subset, 
  na.action, betastart = NULL,phistart = NULL, pstart = NULL, 
  contrasts = NULL, control = list(), method="MCEM", ...)
  
cpglm_em(X, Y, weights = NULL, offset = NULL, link.power = 0,
  betastart, phistart, pstart, intercept = TRUE, control = list())  
  
cpglm_profile(X, Y, weights = NULL, offset = NULL, 
  link.power = 0, intercept = TRUE, contrasts, control = list())

Arguments

formula

an object of class formula. See also in glm.

link

a specification for the model link function. This can be either a literal character string or a numeric number. If it is a character string, it must be one of "log", "identity", "sqrt" or "inverse". If it is numeric, it is the same as the link.power

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model.

weights

an optional vector of weights. Should be NULL or a numeric vector. When it is numeric, it must be positive. Zero weights are not allowed in cpglm.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. Another possible value is NULL, no actio

betastart

starting values for the vector of mean parameters.

phistart

starting value for the dispersion parameter.

pstart

starting value for the index parameter. Must be between 1 and 2. When bound.p is specified, pstart must be between the bounds set there.

offset

this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector of length equal to the number of cases. One or more offset terms can be included in the for

contrasts

an optional list. See the contrasts.arg.

control

a list of parameters for controlling the fitting process. See 'Details' below.

method

a character string, either "MCEM" or "profile", indicating whether the Monte Carlo EM algorithm or the profiled log-likelihood method should be used. See 'Details' below.

X, Y

Used in cpglm_em and cpglm_profile: X is a design matrix and Y is a vector of observations.

intercept

logical. Should an intercept be included in the null model?

link.power

see tweedie.

...

not used now.

Value

cpglm returns an object of class cpglm. See cpglm-class for details of the return values as well as various method available for this class.

Details

The Tweedie compound Poisson distribution is a subclass of the exponential dispersion family with a power variance function, where the value of the power index lies in the interval (1,2). The distribution is composed of a probability mass at the origin accompanied by a skewed continuous distribution on the positive values. Despite its ability to handle continuous zero-inflated data, the density of the compound Poisson distribution is not analytically tractable. As a result, full maximum likelihood inference remains challenging when the power index is unknown. The cpglm function implements two methods to fit generalized linear models with the compound Poisson distribution, namely, the Monte Carlo EM [MCEM] algorithm and the profile likelihood approach, both of which yield maximum likelihood estimates for all parameters in the model, in particular, for the dispersion and the index parameter. The MCEM approach is in light of the fact that the compound Poisson distribution involves an unobserved Poisson process. The observed data is thus augmented by the latent Poisson variable so that one could work on the joint likelihood of the complete data and invoke the MCEM algorithm to locate the maximum likelihood estimations. Specifically, a Monte Carlo method is implemented in the E-step, where samples of the unobserved Poisson process is simulated from its posterior via rejection sampling (using a zero-truncated Poisson as a proposal), and the required expectation in the E-step is approximated by a Monte Carlo estimate. The M-step involves conditional component-wise maximization: scoring algorithm is used for the mean parameters, constrained optimization for the index parameter and close-form solution exists for the scale parameter. The E-step and the M-step are iterated until the following stopping rule is reached: $$\max_i \frac{|\theta_i^{(r+1)}-\theta_i^{(r)}|}{\theta_i^{(r)}+\epsilon_1}<\epsilon_2,$$ where="" $\theta_i$="" is="" the="" $i_{th}$="" element="" of="" vector="" parameters,="" $r$="" represents="" $r_{th}$="" iteration,="" and="" $\epsilon_1$="" $\epsilon_2$="" are="" predefined="" constants.="" see="" description="" for="" control parameters below.

Alternatively, one could use the profiled likelihood approach, where the index parameter is estimated based on the profiled likelihood first and then a traditional GLM is fitted. This is because the mean parameters and the index parameter ($p$) are orthogonal, meaning that the mean parameters vary slowly as $p$ changes. Specifically, to get a profiled maximum likelihood estimate of $p$, one needs to specify a grid of possible values of $p$, estimates the corresponding mean and dispersion parameters, and computes the associated likelihood. Then these likelihoods are compared to locate the value of $p$ corresponding to the maximum likelihood. To compute the likelihood, one has to resort to numerical methods provided in the tweedie package that approximate the density of the compound Poisson distribution. Indeed, the function tweedie.profile in that package makes available the profiled likelihood approach. What is included here is simply a wrapper that provides automatic estimation of $p$ to any pre-specified decimal places (see decimal in the following). Only MLE estimate for $\phi$ is included here, while tweedie.profile provides several other possibilities.

These two methods are specified in the method argument in cpglm. The MCEM algorithm is invoked by specifying method="MCEM" in cplm, while the profile likelihood approach is used when method="profile". The two methods can also be invoked by calling the function cpglm_em and cpglm_profile, however, the users are encouraged to use cpglm, which uses cpglm_em and cpglm_profile internally.

It is to be noted that the MCEM algorithm enables one to work on the more tractable joint likelihood, thus simplifying the estimation problem down to posterior simulation and block/univariate optimizations. However, it also has some overheads:

The simulation of the latent variable could be computationally demanding when a large number of samples is needed;
The inherent Monte Carlo error must be accounted for;
The EM algorithm could have slow convergence rate.

The computational demand can be largely reduced by supplementing the algorithm with importance sampling after it converges to the neighborhood of the MLEs. The importance sampling retains the latest simulated samples for use in all later iterations, thus eliminating the need for simulating new samples.

To account for the Monte Carlo error, many authors have suggests increasing the sample size as the algorithm proceeds. Booth and Hobert (1999) derives a Normal approximation for the Monte Carlo error, which is further used to construct a (1-alpha) confidence ellipsoid at each iteration. If the old parameters lie in this ellipsoid, then the sample size is increase as $m + m/k$, where $m$ is the current sample size, and $k$ is a predefined constant. This method is implemented when fixed.size in the control argument is FALSE. See the description for the control parameters below.

The control argument is a list that can supply various controlling elements used in the fitting process. Most of these elements are used in the MCEM algorithm, and those used in the profile likelihood approach are explicitly identified in the following:

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

References

Booth, J. G., and Hobert, J. P. (1999). Maximizing Generalized Linear Mixed Model Likelihoods with an Automated Monte Carlo EM Algorithm. Journal of the Royal Statistical Society Series B, 61, 265-285.

Dunn, P.K. and Smyth, G.K. (2005). Series evaluation of Tweedie exponential dispersion models densities. Statistics and Computing, 15, 267-280.

McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association, 92, 162-170.

Zhang Y. (2011). A Latent Variable Approach for Statistical Inference in Tweedie Compound Poisson Linear Models, preprint, http://actuaryzhang.com/publication/bayesianTweedie.pdf.

Examples

Run this code

# MCEM fit
set.seed(10)
fit1 <- cpglm(RLD~ factor(Zone)*factor(Stock),
  data=fineroot,
  control=list(init.size=5,sample.iter=50,
              max.size=2000,fixed.size=FALSE),
  pstart=1.6)

# show the iteration history of p
plot(fit1$theta.all[,ncol(fit1$theta.all)],
     type="o", cex=0.5)
     
# residual and qq plot
parold <- par(mfrow=c(2,2), mar=c(5,5,2,1))
# 1. regular plot
r1 <- resid(fit1)/sqrt(fit1$phi)
plot(r1~fitted(fit1), cex=0.5)
qqnorm(r1, cex=0.5)
# 2. quantile residual plot to avoid overlappling
u <- ptweedie(fit1$y, fit1$p, fitted(fit1), fit1$phi)
u[fit1$y == 0] <- runif(sum(fit1$y == 0), 0, u[fit1$y == 0])
r2 <- qnorm(u)
plot(r2~fitted(fit1), cex=0.5)
qqnorm(r2, cex=0.5)

par(parold)

# profile likelihood         
fit2 <- cpglm(RLD~ factor(Zone)*factor(Stock),
  data=fineroot,method="profile", 
  control=list(decimal=3))      

# compare the two
summary(fit1)
summary(fit2)

Run the code above in your browser using DataLab