test.fit.iqr: Goodness-of-Fit Test

Description

Goodness-of-fit test for a model fitted with iqr. The Kolmogorov-Smirnov statistic and the Cramer-Von Mises statistic are computed. Their distribution under the null hypothesis is evaluated with Monte Carlo.

Usage

# S3 method for iqr
test.fit(object, R = 100, zcmodel, icmodel, trace = FALSE, ...)

Value

a matrix with columns statistic and p.value, reporting the Kolmogorov-Smirnov and Cramer-Von Mises statistic and the associated p-values evaluated with Monte Carlo.

Arguments

object: an object of class “iqr”.
R: number of Monte Carlo replications. If R = 0, the function only returns the test statistics.
zcmodel: a numeric value indicating how to model the joint distribution of censoring ($C$) and truncation ($Z$). See ‘Details’.
icmodel: a list of operational parameters to simulate interval-censored data. See ‘Details’.
trace: logical. If TRUE, the progress will be printed.
...: for future arguments.

Author

Paolo Frumento paolo.frumento@unipi.it

Details

This function permits assessing goodness of fit by testing the null hypothesis that the CDF values follow a $U(0,1)$ distribution, indicating that the model is correctly specified. Since the fitted CDF values depend on estimated parameters, the distribution of the test statistic is not known. To evaluate it, the model is fitted on R simulated datasets generated under the null hypothesis.

The testing procedures are described in details by Frumento and Bottai (2016, 2017) and Frumento and Corsini (2024).

Right-censored and left-truncated data. If the data are censored and truncated, object$CDF is as well a censored and truncated outcome, and its quantiles must be computed by using a suitable version of Kaplan-Meier product-limit estimator. The fitted survival curve is then compared with that of a $U(0,1)$ distribution.

To run Monte Carlo simulations when data are censored or truncated, it is necessary to estimate the distribution of the censoring and that of the truncation variable. To this goal, the function pchreg from the pch package is used, with default settings.

The joint distribution of the censoring variable ($C$) and the truncation variable ($Z$) can be specified in two ways:

If zcmodel = 1, it is assumed that $C = Z + U$, where $U$ is a positive variable and is independent of $Z$, given covariates. This is the most common situation, and is verified when censoring occurs at the end of the follow-up. Under this scenario, $C$ and $Z$ are correlated with $P(C > Z) = 1$.
If zcmodel = 2, it is assumed that $C$ and $Z$ are conditionally independent. This situation is more plausible when all censoring is due to drop-out.

Interval-censored data.

If the data are interval-censored, object$CDF is composed of two columns, left and right. A nonparametric estimator is applied to the interval-censored pair (left, right) using the icenReg R package. The fitted quantiles are then compared with those of a $U(0,1)$ distribution.

To simulate interval-censored data, additional information is required about the censoring mechanism. This testing procedure assumes that interval censoring occurs because each individual is only examined at discrete time points, say t[1], t[2], t[3],... If this is not the mechanism that generated your data, you should not use this function.

In the ideal situation, one can use t[1], t[2], t[3],... to estimate the distribution of the time between visits, t[j + 1] - t[j]. If, however, one only knows time1 and time2, the two endpoints of the interval, things are more complicated. The empirical distribution of time2 - time1 is NOT a good estimator of the distribution of t[j + 1] - t[j], because the events are likely contained in longer intervals, a fact that obviously generates selection bias. There are two common situations: either t[j + 1] - t[j] is a constant (e.g., one month), or it is random. If t[j + 1] - t[j] is random and has an Exponential distribution with scale lambda, then time2 - time1 has a Gamma(shape = 2, scale = lambda) distribution. This is due to the property of memoryless of the Exponential distribution, and may only be an approximation if there is a floor effect (i.e., if lambda is larger than the low quantiles of the time-to-event).

The icmodel argument must be a list with four elements, model, lambda (optional), t0, and logscale:

model. A character string, either 'constant' or 'exponential'.
lambda. If model = 'constant', lambda will be interpreted as a constant time between visits. If model = 'exponential', instead, it will be interpreted as the mean (not the rate) of the Exponential distribution that is assumed to describe the time between visits.

If you either know lambda, or you can estimate it by using additional information (e.g., individual data on all visit times t[1], t[2], t[3], ...), you can supply a scalar value, that will be used for all individuals, or a vector, allowing lambda to differ across individuals.

If, instead, lambda is not supplied or is NULL, the algorithm proceeds as follows. If model = 'constant', the time between visits is assumed to be constant and equal to lambda = mean(time2 - time1). If model = 'exponential', times between visits are generated from an Exponential distribution in which the mean, lambda, is allowed to depend on covariates according to a log-linear model, and is estimated by fitting a Gamma model on time2 - time1 as described earlier.
t0. If t0 = 0, data will be simulated assuming that the first visit occurs at time = 0 (the “onset”), i.e., when the individual enters the risk set. This mechanism cannot generate left censoring. If t0 = 1, instead, the first visit occurs after time zero. This mechanism generates left censoring whenever the event occurs before the first visit. Finally, if t0 = -1, visits start before time 0. Under this scenario, it is assumed that not only the time at the event, but also the time at onset is interval-censored. If the event occurs in the interval (time1, time2), and the onset is in (t01, t02), then the total duration is in the interval (time1 - t02, time2 - t01).
logscale. Logical: is the response variable on the log scale? If this is the case, the Monte Carlo procedure will act accordingly. Note that lambda will always be assumed to describe the time between visits on the natural scale.

The mechanism described above can automatically account for the presence of left censoring. In order to simulate right-censored observations (if present in the data), the distribution of the censoring variable is estimated with the function pchreg from the pch package.

References

Frumento, P., and Bottai, M. (2016). Parametric modeling of quantile regression coefficient functions. Biometrics, 72 (1), pp 74-84, doi: 10.1111/biom.12410.

Frumento, P., and Bottai, M. (2017). Parametric modeling of quantile regression coefficient functions with censored and truncated data. Biometrics, doi: 10.1111/biom.12675.

Frumento, P., and Corsini, L. (2024). Using parametric quantile regression to investigate determinants of unemployment duration. Unpublished manuscript.

Examples

Run this code

y <- rnorm(1000)
m1 <- iqr(y ~ 1, formula.p = ~ I(qnorm(p))) # correct
m2 <- iqr(y ~ 1, formula.p = ~ p)  # misspecified
# \donttest{
test.fit(m1)
test.fit(m2)
# }

Run the code above in your browser using DataLab