mbinomial: The Matched Binomial Distribution Family Function

Description

Estimation of a binomial regression in a matched case-control study.

Usage

mbinomial(mvar=NULL, link="logit", earg=list(),
          parallel = TRUE, smallno = .Machine$double.eps^(3/4))

Arguments

mvar

Formula specifying the matching variable. This shows which observation belongs to which matching set. The intercept should be suppressed from the formula, and the term must be a factor.

link

Parameter link function applied to the probability. See Links for more choices.

earg

List. Extra arguments for the links. See earg in Links for general information.

parallel

This should always be set TRUE otherwise there will be too many parameters to estimate. See CommonVGAMffArguments for more information.

smallno

Numeric, a small positive value. For a specific observation, used to nullify the linear/additive predictors that are not needed.

Value

An object of class "vglmff" (see vglmff-class). The object is used by modelling functions such as vglm and vgam.

Warning

Both the memory requirements and computational time of this VGAM family function grows very quickly with respect to the number of matched sets. For example, the large model matrix of a data set with 100 matched sets consisting of one case and one control per set will take up at least (about) 20Mb of memory. For a constant number of cases and controls per matched set, the memory requirements are $O(C^3)$ and the the computational time is $O(C^4)$ flops.

The example below has been run successfully with n=700 (this corresponds to $C=350$) but only on a big machine and it took over 10 minutes. The large model matrix was 670Mb.

Details

By default, this VGAM family function fits a logistic regression model to a binary response from a matched case-control study. Here, each case $(Y=1$) is matched with one or more controls $(Y=0$) with respect to some matching variables (confounders). For example, the first matched set is all women aged from 20 to 25, the second matched set is women aged between 26 to 30, etc. The logistic regression has a different intercept for each matched set but the other regression coefficients are assumed to be the same across matched sets (parallel=TRUE).

Let $C$ be the number of matched sets. This VGAM family function uses a trick by allowing $M$, the number of linear/additive predictors, to be equal to $C$, and then nullifying all but one of them for a particular observation. The term specified by the mvar argument must be a factor. Consequently, the model matrix contains an intercept plus one column for each level of the factor (except the first (this is the default in R)). Altogether there are $C$ columns. The algorithm here constructs a different constraint matrix for each of the $C$ columns.

References

Section 8.2 of Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models, London: Chapman & Hall.

Pregibon, D. (1984) Data analytic methods for matched case-control studies. Biometrics, 40, 639--651.

Chapter 7 of Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research I: The Analysis of Case-Control Studies. Lyon: International Agency for Research on Cancer.

Holford, T. R. and White, C. and Kelsey, J. L. (1978) Multivariate analysis for matched case-control studies. American Journal of Epidemiology, 107, 245--256.

Examples

Run this code

# Cf. Hastie and Tibshirani (1990) p.209. The variable n must be even.
# Here, the intercept for each matched set accounts for x3 which is
# the confounder or matching variable.
n = 700 # Requires a big machine with lots of memory. Expensive wrt time
n = 100 # This requires a reasonably big machine.
mydat = data.frame(x2 = rnorm(n), x3 = rep(rnorm(n/2), each=2))
xmat = with(mydat, cbind(x2,x3))
mydat = transform(mydat, eta = -0.1 + 0.2 * x2 + 0.3 * x3)
etamat = with(mydat, matrix(eta, n/2, 2))
condmu = exp(etamat[,1]) / (exp(etamat[,1]) + exp(etamat[,2]))
y1 = ifelse(runif(n/2) < condmu, 1, 0)
y = cbind(y1, 1-y1)
mydat = transform(mydat, y = c(y1, 1-y1),
                         ID = factor(c(row(etamat))))
fit = vglm(y ~ 1 + ID + x2, trace=TRUE,
           fam = mbinomial(mvar = ~ ID - 1), data=mydat)
dimnames(coef(fit, mat=TRUE))
coef(fit, mat=TRUE)
summary(fit)
head(fitted(fit))
objsizemb = function(object) round(object.size(object) / 2^20, dig=2)
objsizemb(fit) # in Mb

VLMX = model.matrix(fit, type="vlm")  # The big model matrix
dim(VLMX)
objsizemb(VLMX) # in Mb
rm(VLMX)

Run the code above in your browser using DataLab