Fits a generally--altered, --inflated --truncated and deflated Poisson regression by MLE. The GAITD combo model having 7 types of special values is implemented. This allows mixtures of Poissons on nested and/or partitioned support as well as a multinomial logit model for (nonparametric) altered, inflated and deflated values. Truncation may include the upper tail.

```
gaitdpoisson(a.mix = NULL, i.mix = NULL, d.mix = NULL,
a.mlm = NULL, i.mlm = NULL, d.mlm = NULL,
truncate = NULL, max.support = Inf,
zero = c("pobs", "pstr", "pdip"),
eq.ap = TRUE, eq.ip = TRUE, eq.dp = TRUE,
parallel.a = FALSE, parallel.i = FALSE, parallel.d = FALSE,
llambda.p = "loglink", llambda.a = llambda.p,
llambda.i = llambda.p, llambda.d = llambda.p,
type.fitted = c("mean", "lambdas", "pobs.mlm", "pstr.mlm",
"pdip.mlm", "pobs.mix", "pstr.mix", "pdip.mix",
"Pobs.mix", "Pstr.mix", "Pdip.mix", "nonspecial",
"Numer", "Denom.p", "sum.mlm.i", "sum.mix.i",
"sum.mlm.d", "sum.mix.d", "ptrunc.p",
"cdf.max.s"), gpstr.mix = ppoints(7) / 3,
gpstr.mlm = ppoints(7) / (3 + length(i.mlm)),
imethod = 1, mux.init = c(0.75, 0.5, 0.75),
ilambda.p = NULL, ilambda.a = ilambda.p,
ilambda.i = ilambda.p, ilambda.d = ilambda.p,
ipobs.mix = NULL, ipstr.mix = NULL, ipdip.mix = NULL,
ipobs.mlm = NULL, ipstr.mlm = NULL, ipdip.mlm = NULL,
byrow.aid = FALSE, ishrinkage = 0.95, probs.y = 0.35)
```

truncate, max.support

Vector of truncated values, i.e., nonnegative integers.
For the first seven arguments (for the *special* values)
a `NULL`

stands for an empty set, and
the seven sets must be mutually disjoint.
Argument `max.support`

enables RHS-truncation,
i.e., something equivalent to
`truncate = (U+1):Inf`

for some upper support point `U`

specified by `max.support`

.

a.mix, i.mix, d.mix

Vector of altered and inflated values corresponding
to finite mixture models.
These are described as *parametric* or structured.

The parameter `lambda.p`

is always estimated.
If `length(a.mix)`

is 1 or more then the parameter
`pobs.mix`

is estimated.
If `length(i.mix)`

is 1 or more then the parameter
`pstr.mix`

is estimated.
If `length(d.mix)`

is 1 or more then the parameter
`pdip.mix`

is estimated.

If `length(a.mix)`

is 2 or more then the parameter
`lambda.a`

is estimated.
If `length(i.mix)`

is 2 or more then the parameter
`lambda.i`

is estimated.
If `length(d.mix)`

is 2 or more then the parameter
`lambda.d`

is estimated.

If `length(a.mix) == 1`

, `length(i.mix) == 1`

or
`length(d.mix) == 1`

then `lambda.a`

,
`lambda.i`

and `lambda.d`

are unidentifiable and
therefore ignored. In such cases
it would be equivalent to moving `a.mix`

into
`a.mlm`

, etc.

Due to its great flexibility, it is easy to misuse this function
and ideally the values of the above arguments should be well
justified by the application on hand.
Adding inappropriate or
unnecessary values to these arguments willy-nilly
is a recipe for disaster, especially for
`i.mix`

and `d.mix`

.
Using `a.mlm`

effectively removes a subset of the data
from the main analysis, therefore may result in a substantial
loss of efficiency.
For seeped values, `a.mix`

, `a.mlm`

,
`d.mix`

and `d.mlm`

can be used only.
Heaped values can be handled by `i.mlm`

and `i.mix`

,
as well as `a.mix`

and `a.mlm`

.
Because of the NBP reason below, it sometimes may be necessary
to specify deflated values to altered values.

a.mlm, i.mlm, d.mlm

Vector of altered, inflated and deflated values corresponding
to the multinomial logit model (MLM) probabilities of
observing those values---see `multinomial`

.
These are described as *nonparametric* or unstructured.

llambda.p, llambda.a, llambda.i, llambda.d

Link functions for the parent,
altered, inflated and deflated distributions respectively.
See `Links`

for more choices and information.

eq.ap, eq.ip, eq.dp

Single logical each.
Constrain the rate parameters to be equal?
See `CommonVGAMffArguments`

for information.
Having all three arguments `TRUE`

gives greater stability in
the estimation because of fewer parameters and therefore
fewer initial values needed,
however if so then one should try relax some of the arguments later.

For the GIT--Pois--Pois submodel,
after plotting the responses,
if the distribution of the spikes
above the nominal probabilities
has roughly the same shape
as the ordinary values then setting
`eq.ip = TRUE`

would be a good idea
so that `lambda.i == lambda.p`

.
And if `i.mix`

is of length 2 or a bit more, then
`TRUE`

should definitely be entertained.
Likewise, for heaped or seeped data, setting
`eq.ap = TRUE`

(so that `lambda.p == lambda.p`

)
would be a good idea for the
GAT--Pois--Pois if the shape of the altered probabilities
is roughly the same as the parent distribution.

parallel.a, parallel.i, parallel.d

Single logical each.
Constrain the MLM probabilities to be equal?
If so then this applies to all
`length(a.mlm)`

`pobs.mlm`

probabilities
or all
`length(i.mlm)`

`pstr.mlm`

probabilities
or all
`length(d.mlm)`

`pdip.mlm`

probabilities.
See `CommonVGAMffArguments`

for information.
The default means that the probabilities are generally
unconstrained and unstructured and will follow the shape
of the data.
See `constraints`

.

type.fitted

See `CommonVGAMffArguments`

and below for information.
The first value is the default, and this is usually the
unconditional mean.
Choosing an irrelevant value may result in
an `NA`

being returned and a warning, e.g.,
`"pstr.mlm"`

for a GAT--MLM model.

The choice `"lambdas"`

returns a matrix with at least
one column and up to three others,
corresponding to all those estimated.
In order, their `colnames`

are
`"lambda.p"`

, `"lambda.a"`

, `"lambda.i"`

and `"lambda.d"`

.
For other distributions such as `gaitdlog`

`type.fitted = "shapes"`

is permitted and the
`colnames`

are
`"shape.p"`

, `"shape.a"`

, `"shape.i"`

and
`"shape.d"`

, etc.

Option `"Pobs.mix"`

provides more detail about
`"pobs.mix"`

by returning a matrix whose columns
correspond to each altered value; the row sums
(`rowSums`

)
of this matrix is `"pobs.mix"`

.
Likewise `"Pstr.mix"`

about `"pstr.mix"`

and `"Pdip.mix"`

about `"pdip.mix"`

.

The choice `"cdf.max.s"`

is the CDF evaluated
at `max.support`

using the parent distribution,
e.g., `ppois(max.support, lambda.p)`

for
`gaitdpoisson`

.
The value should be 1 if `max.support = Inf`

(the default).
The choice `"nonspecial"`

is the probability of a
nonspecial value.
The choices `"Denom.p"`

and `"Numer"`

are quantities
found in the GAITD combo PMF and are for convenience only.

The choice `type.fitted = "pobs.mlm"`

returns
a matrix whose columns are
the altered probabilities (Greek symbol \(\omega_s\)).
The choice `"pstr.mlm"`

returns
a matrix whose columns are
the inflated probabilities (Greek symbol \(\phi_s\)).
The choice `"pdip.mlm"`

returns
a matrix whose columns are
the deflated probabilities (Greek symbol \(\psi_s\)).

The choice `"ptrunc.p"`

returns the probability of having
a truncated value with respect to the parent distribution.
It includes any truncated values in the upper tail
beyond `max.support`

.
The probability of a value less than or equal to
`max.support`

with respect to the parent distribution
is `"cdf.max.s"`

.

The choice `"sum.mlm.i"`

adds two terms.
This gives the probability of an inflated value,
and the formula can be loosely written down
as something like
`"pstr.mlm" + "Numer" * dpois(i.mlm, lambda.p) / "Denom.p"`

.
The other three `"sum.m*"`

arguments are similar.

gpstr.mix, gpstr.mlm

See `CommonVGAMffArguments`

for information.
Gridsearch values for the two parameters.
If failure occurs try a finer grid, especially closer to 0,
and/or experiment with `mux.init`

.

imethod, ipobs.mix, ipstr.mix, ipdip.mix

See `CommonVGAMffArguments`

for information.
Good initial values are difficult to compute because of
the great flexibility of GAITD regression, therefore
it is often necessary to use these arguments.
A careful examination of a `spikeplot`

of the data should lead to good choices.

ipobs.mlm, ipstr.mlm, ipdip.mlm

See `CommonVGAMffArguments`

for information.

mux.init

Numeric, of length 3. General downward multiplier for initial values for the sample proportions (MLEs actually). This is under development and more details are forthcoming. In general, 1 means unchanged and values should lie in (0, 1], and values about 0.5 are recommended. The elements apply in order to altered, inflated and deflated (no distinction between mix and MLM).

ilambda.p, ilambda.a, ilambda.i, ilambda.d

Initial values for the rate parameters;
see `CommonVGAMffArguments`

for information.

probs.y, ishrinkage

See `CommonVGAMffArguments`

for information.

byrow.aid

Details are at `Gaitdpois`

.

zero

See `CommonVGAMffArguments`

for information.
By default, all the MLM probabilities are
modelled as simple as possible (intercept-only) to
help avoid numerical problems, especially when there
are many covariates.
The Poisson means are modelled by the covariates, and
the default `zero`

vector is pruned of any irrelevant values.
To model all the MLM probabilities with covariates
set `zero = NULL`

, however, the number of regression
coefficients could be excessive.

For the MLM probabilities,
to model `pobs.mix`

only with covariates
set `zero = c('pstr', 'pobs.mlm', 'pdip')`

.
Likewise,
to model `pstr.mix`

only with covariates
set `zero = c('pobs', 'pstr.mlm', 'pdip')`

.

It is noted that, amongst other things,
`zipoisson`

and `zipoissonff`

differ
with respect to `zero`

, and ditto for
`zapoisson`

and `zapoissonff`

.

An object of class `"vglmff"`

(see `vglmff-class`

).
The object is used by modelling functions such as `vglm`

,
`rrvglm`

and `vgam`

.

The `fitted.values`

slot of the fitted object,
which should be extracted by the generic function `fitted`

,
returns the mean \(\mu\) by default.
See the information above on `type.fitted`

.

Amateurs tend to be overzealous fitting
zero-inflated models when the fitted mean is low---the
warning of `ziP`

should be heeded.
For GAITD regression the warning applies
more strongly and generally;
here to *all*
`i.mix`

, `i.mlm`

,
`d.mix`

and `d.mlm`

values,
not just 0.
Even one misspecified special value usually will cause
convergence problems.

Default values for this and similar family functions
may change in the future, e.g., `eq.ap`

and `eq.ip`

.
Important internal changes might occur too, such as the
ordering of the linear/additive predictors and
the quantities returned as the fitted values.

Using `i.mlm`

requires more caution than `a.mlm`

because
gross inflation is ideally needed for it to work safely.
Ditto for `i.mix`

versus `a.mix`

.
Data exhibiting deflation or little to no inflation will produce
numerical problems,
hence set `trace = TRUE`

to monitor convergence.
More than c.10 IRLS iterations should raise suspicion.

Ranking the four operators by difficulty, the easiest is
truncation followed by alteration, then inflation and
the most difficult is deflation.
The latter needs good initial values and the current
default will probably not work on some data sets.
Studying the spikeplot is time very well spent.
In general it is very easy to specify an *overfitting*
model so it is a good idea to split the data into
training and test sets.

This function is quite memory-hungry with respect to
`length(c(a.mix, i.mix, d.mix, a.mlm, i.mlm, d.mlm))`

.
On consuming something different,
because all values of the NBP vector need to be positive
it pays to be economical with respect to `d.mlm`

especially so that
one does not consume up probabilities unnecessarily
so to speak.

It is often a good idea to set `eq.ip = TRUE`

,
especially when `length(i.mix)`

is not much more than
2 or the values
of `i.mix`

are not spread over the range of the response.
This way the estimation can borrow strength from both the
inflated and non-inflated values.
If the `i.mix`

values form a single small
cluster then this can easily create estimation difficulties---the
idea is somewhat similar to multicollinearity.
The same holds for `d.mix`

.

The full
GAITD--Pois--Pois--MLM--Pois-MLM--Pois-MLM combo model
may be fitted with this family function.
There are seven types of special values and all arguments for these
may be used in a single model.
Here, the MLM represents the nonparametric while the Pois
refers to the Poisson mixtures.
The defaults for this function correspond to an ordinary Poisson
regression so that `poissonff`

is called instead.
A MLM with only one probability to model is equivalent to
logistic regression
(`binomialff`

and `logitlink`

).

The order of the linear/additive predictors is best explained by
an example.
Suppose a combo model has
`length(a.mix) > 2`

and
`length(i.mix) > 2`

,
`length(d.mix) > 2`

,
`a.mlm = 3:5`

,
`i.mlm = 6:9`

and
`d.mlm = 10:12`

, say.
Then `loglink(lambda.p)`

is the first.
The second is `multilogitlink(pobs.mix)`

followed
by `loglink(lambda.a)`

because `a.mix`

is long enough.
The fourth is `multilogitlink(pstr.mix)`

followed
by `loglink(lambda.i)`

because `i.mix`

is long enough.
The sixth is `multilogitlink(pdip.mix)`

followed
by `loglink(lambda.d)`

because `d.mix`

is long enough.
Next are the probabilities for the `a.mlm`

values.
Then are the probabilities for the `i.mlm`

values.
Lastly are the probabilities for the `d.mlm`

values.
All the probabilities are estimated by one big MLM and effectively
the `"(Others)"`

column of left over probabilities is
associated with the nonspecial values.
These might be called the
*nonspecial baseline probabilities* (NBP).
The dimension of the vector of linear/additive predictors here
is \(M=17\).

Two mixture submodels that may be fitted can be abbreviated
GAT--Pois--Pois or
GIT--Pois--Pois.
For the GAT model
the distribution being fitted is a (spliced) mixture
of two Poissons with differing (partitioned) support.
Likewise, for the GIT model
the distribution being fitted is a mixture
of two Poissons with nested support.
The two rate parameters may be constrained to be equal using
`eq.ap`

and `eq.ip`

.

A good first step is to apply `spikeplot`

for selecting
candidate values for altering, inflating and deflating.
Deciding between
parametrically or nonparametrically can also be determined from
examining the spike plot. Misspecified
`a.mix`

/`a.mlm`

/`i.mix`

/`i.mlm`

/`d.mix`

/`d.mlm`

will result in convergence problems
(setting `trace = TRUE`

is a *very* good idea.)
This function currently does not handle multiple responses.
Further details are at `Gaitdpois`

.

A well-conditioned data--model combination should pose no
difficulties for the automatic starting value selection
being successful.
Failure to obtain initial values from this self-starting
family function indicates the degree of inflation/deflation
may be marginal and/or a misspecified model.
If this problem is worth surmounting
the arguments to focus on especially are
`mux.init`

,
`gpstr.mix`

, `gpstr.mlm`

,
`ipdip.mix`

and `ipdip.mlm`

.
See below for the stepping-stone trick.

Apart from the order of the linear/additive predictors,
the following are (or should be) equivalent:
`gaitdpoisson()`

and `poissonff()`

,
`gaitdpoisson(a.mix = 0)`

and `zapoisson(zero = "pobs0")`

,
`gaitdpoisson(i.mix = 0)`

and `zipoisson(zero = "pstr0")`

,
`gaitdpoisson(truncate = 0)`

and `pospoisson()`

.
Likewise, if
`a.mix`

and `i.mix`

are assigned a scalar then
it effectively moves that scalar to `a.mlm`

and `i.mlm`

because there is no `lambda.a`

or `lambda.i`

being estimated.
Thus
`gaitdpoisson(a.mix = 0)`

and `gaitdpoisson(a.mlm = 0)`

are the effectively same, and ditto for
`gaitdpoisson(i.mix = 0)`

and `gaitdpoisson(i.mlm = 0)`

.

A nonparametric special case submodel is the GAIT--Pois--MLM--MLM--MLM, which is where the ordinary values have a Poisson distribution, and there are altered, inflated and deflated values having unstructured probabilities. Thus the distribution being fitted is a mixture of a Poisson and three MLMs with the support of one of the MLMs being equal to the set of altered values, another of the MLMs being equal to the set of inflated values and the last MLM for the deflated values which are subtracted. Hence the probability for each inflated value comes from two sources: the parent distribution and a MLM.

Yee, T. W. and Ma, C. (2022).
Generally--altered, --inflated, --truncated and --deflated
regression, with application to heaped and seeped data.
*In preparation*.

`Gaitdpois`

,
`multinomial`

,
`rootogram4`

,
`specials`

,
`plotdgaitd`

,
`spikeplot`

,
`meangaitd`

,
`KLD`

,
`goffset`

,
`Trunc`

,
`gaitdnbinomial`

,
`gaitdlog`

,
`gaitdzeta`

,
`multilogitlink`

,
`multinomial`

,
`poissonff`

,
`zapoisson`

,
`zipoisson`

,
`pospoisson`

,
`CommonVGAMffArguments`

,
`simulate.vlm`

.

```
# NOT RUN {
i.mix <- c(5, 10) # Inflate these values parametrically
i.mlm <- c(14, 15) # Inflate these values
a.mix <- c(1, 13) # Alter these values
tvec <- c(3, 11) # Truncate these values
pstr.mlm <- 0.1 # So parallel.i = TRUE
pobs.mix <- pstr.mix <- 0.1
max.support <- 20; set.seed(1)
gdata <- data.frame(x2 = runif(nn <- 1000))
gdata <- transform(gdata, lambda.p = exp(2 + 0.0 * x2))
gdata <- transform(gdata,
y1 = rgaitdpois(nn, lambda.p, a.mix = a.mix, i.mix = i.mix,
pobs.mix = pobs.mix, pstr.mix = pstr.mix,
i.mlm = i.mlm, pstr.mlm = pstr.mlm,
truncate = tvec, max.support = max.support))
gaitdpoisson(a.mix = a.mix, i.mix = i.mix, i.mlm = i.mlm)
with(gdata, table(y1))
fit1 <- vglm(y1 ~ 1, crit = "coef", trace = TRUE, data = gdata,
gaitdpoisson(a.mix = a.mix, i.mix = i.mix, i.mlm = i.mlm,
parallel.i = TRUE, eq.ap = TRUE, eq.ip = TRUE,
truncate = tvec, max.support = max.support))
head(fitted(fit1, type.fitted = "Pstr.mix"))
head(predict(fit1))
t(coef(fit1, matrix = TRUE)) # Easier to see with t()
summary(fit1) # No HDE test by default; HDEtest = TRUE would be the ideal
# }
# NOT RUN {
spikeplot(with(gdata, y1), lwd = 2)
plotdgaitd(fit1, new.plot = FALSE, offset.x = 0.2, all.lwd = 2)
# }
```

Run the code above in your browser using DataCamp Workspace