cvam: Log-Linear Models for Incomplete Categorical Variables

Description

Fits log-linear models to categorical variables by three methods: maximizing the loglikelihood or log-posterior density by Expectation-Maximization (EM) algorithms, simulating the posterior distribution by a Markov chain Monte Carlo (MCMC) algorithms, and creating random draws of parameters from an approximate Bayesian posterior distribution. The factors in the model may have missing or coarsened values.

Usage


cvam(obj, ...)
# S3 method for formula
cvam(obj, data, freq, weight, subPop, 
    stratum, cluster, nest = FALSE, prior = cvamPrior(),
    method = c("EM", "MCMC", "approxBayes", "mfSeen", "mfTrue",
       "mfPrior", "modelMatrix"), control = list(), omitData = FALSE,
    saturated = FALSE, modelMatrix = NULL, offset = NULL,
    strZero = NULL, startVal = NULL, estimate = NULL, ...)
# S3 method for cvam
cvam(obj, method = obj$method, control = NULL, startVal = NULL, 
    estimate = NULL, ...)

Value

if method is "EM", "MCMC" or "approxBayes", an object of class c("cvam","list")

containing the results of a model fit. For other values of method, the requested object is returned without fitting a model.

Arguments

obj: an object used to select a method: either a model formula or the result from a previous call to cvam.
data: an optional data frame, list or environment (or object coercible to a data frame by as.data.frame) containing the variables in the model. If not found in data, the variables are taken from environment(obj), typically the environment from which cvam is called.
freq: an optional variable for holding integer frequencies when the observations are grouped. If freq is not given, then the observations are assumed to represent microdata, and all frequencies are set to one.
weight: an optional numeric variable containing survey weights, which are used when computing pseudo-maximum likelihood (PML) estimates from survey data. If weight is given, then the data supplied are interpreted as microdata, with each row having a frequency of one.
subPop: an optional logical variable indicating membership in a subpopulation for computing PML estimates from survey data.
stratum: an optional factor variable indicating the sampling stratum to which a unit belongs, used when computing linearized variance estimates for parameter estimates under a with-replacement (WR) survey design; see DETAILS.
cluster: an optional factor variable indicating the primary (first-stage) sampling cluster to which a unit belongs, used when computing linearized variance estimates for parameters under a with-replacement (WR) survey design; see DETAILS.
nest: if TRUE, duplicate values of the cluster variable appearing in different strata are assumed to refer to different clusters.
prior: an object produced by cvamPrior to represent prior information incorporated into the model fit.
method: a procedure for fitting the model: "EM" computes a maximum-likelihood (ML) estimate, penalized ML estimate, posterior mode, or (if survey weights are provided) a pseudo-maximum likelihood (PML) estimate; "MCMC" runs a Markov chain Monte Carlo algorithm to simulate a sequence of correlated random draws from the posterior distribution of the unknown parameters; "approxBayes" creates independent draws from an approximate posterior distribution. The other alternatives return various objects without fitting the model.
control: a named list containing control parameters which are passed to cvamControl. Control parameters determine the maximum number of iterations, criteria for judging convergence, proposal distributions for MCMC, and so on. Control parameters that are not found in this list are set to default values.
omitData: if TRUE, then the observations supplied through data and freq are ignored, and the fitted model is based only the prior information supplied through prior. Combining omitData=TRUE with method="MCMC" will simulate random draws from the prior distribution.
saturated: if TRUE, then a saturated model is fit to the cell means without defining a model matrix or log-linear coefficients.
modelMatrix: an optional model matrix that defines the log-linear model. In ordinary circumstances, cvam creates the model matrix automatically by interpreting terms in the model formula and referring to the contrast attributes of the model factors. In rare circumstances, a user may want to supply a different model matrix. The model matrix should have one row for every cell in the complete-data table. If a model matrix is supplied, the model formula is used only to identify the variables that are included the model, not to define the associations among them.
offset: an optional numeric vector of length NROW(modelMatrix) containing an offset for the log-linear model. If omitted, the offset is assumed to be zero for every cell.
strZero: an optional logical vector of length NROW(modelMatrix) containing TRUE for every cell to be considered a structural zero and FALSE elsewhere. Structural zeros are assumed to have zero probability and are omitted from the model fitting. If strZero is omitted, all elements are assumed to be FALSE.
startVal: an optional vector of starting values for the model parameters. If saturated=FALSE, this should be a vector of length NCOL(modelMatrix) containing log-linear coefficients; if saturated=FALSE, it should be a vector of length NROW(modelMatrix) containing cell probabilities or cell means, which are automatically rescaled to become probabilities.
estimate: an optional formula or list of formulas of the kind expected by cvamEstimate specifying marginal or conditional probabilities to be estimated, bypassing the need for a subsequent call to that function.
...: values to be passed to the methods.

Author

Joe Schafer Joseph.L.Schafer@census.gov

Details

A log-linear model is specified by a one-sided formula that determines which associations among the variables are allowed. For example, ~ A + B + C implies that A, B and C are mutually independent; ~ A*B + A*C implies that B and C are conditionally independent given A; and so on. Variables in a model may be factors or coarsened factors, and missing values are permitted. All models are fit using a surrogate Poisson formulation which is appropriate for Poisson, multinomial or product-multinomial sampling. A formula may contain a vertical bar to specify variables to be regarded as fixed; for example, ~ A*B + A*C | A fixes the variable A. Fixing variables does not change the model fitting procedure; the only difference is that, after the model has been fit, the cell probabilities are scaled to sum to one within every combination of levels of the fixed variables.

If cvam is called with a cvam object as its first argument, then the data, model and prior distribution will be taken from the previous run, and (unless startVal is supplied), starting values will be set to the final parameter values from the previous run.

If method is "EM" and survey weights are supplied through weight, then cvam performs pseudo-maximum likelihood (PML) estimation. The target of PML is the set of parameters that would be obtained if the given model were fit to all units in the finite population (or, if subPop is given, the subpopulation). If saturated=FALSE, then standard errors for log-linear coefficients are computed using a linearization method that assumes the first stage of sampling within strata was carried out with replacement (WR). Although WR sampling is rarely done in actual surveys, it is often assumed for variance estimation, and if the first-stage sampling was actually done without replacement the resulting standard errors tend to be conservative. The WR survey design information is provided through weight, stratum and cluster. The stratum and cluster variables are coerced to factors. If stratum is omitted, then the population is regarded as a single stratum. If cluster is omitted, then each sample unit is treated as a cluster.

References

Extended descriptions and examples for all major functions are provided in two vignettes, Understanding Coarsened Factors in cvam and Log-Linear Modeling with Missing and Coarsened Values Using the cvam Package.

Examples

Run this code

# convert U.C. Berkeley admissions three-way table to data frame,
# fit model of conditional independence, display summary
# compare the fit to the saturated model
dF <- as.data.frame(UCBAdmissions)
fit <- cvam( ~ Dept*Gender + Dept*Admit, data=dF, freq=Freq )
summary(fit)
fitSat <- cvam( ~ Dept*Gender*Admit, data=dF, freq=Freq )
anova(fit, fitSat, pval=TRUE)

# fit non-independence model to crime data; then run MCMC for
# 5000 iterations, creating 10 multiple imputations of the frequencies
# for the 2x2 complete-data table
fit <- cvam( ~ V1 * V2, data=crime, freq=n )
set.seed(56182)
fitMCMC <- cvam(fit, method="MCMC", 
   control=list( iterMCMC=5000, imputeEvery=500) )
get.imputedFreq(fitMCMC)

Run the code above in your browser using DataLab