emNorm: EM algorithm for incomplete multivariate normal data

Description

Computes maximum likelihood estimates and posterior modes from incomplete multivariate data under a normal model.

Usage


emNorm(obj, …)

# S3 method for default
emNorm(obj, x = NULL, intercept = TRUE,
   iter.max = 1000, criterion = NULL, estimate.worst = TRUE, 
   prior = "uniform", prior.df = NULL, prior.sscp = NULL,
   starting.values = NULL, …)

# S3 method for formula
emNorm(formula, data, iter.max = 1000, 
   criterion = NULL, estimate.worst = TRUE, prior = "uniform",
   prior.df = NULL, prior.sscp = NULL, starting.values = NULL, …)

# S3 method for norm
emNorm(obj, iter.max = 1000, 
   criterion = obj$criterion, estimate.worst = obj$estimate.worst,
   prior = obj$prior, prior.df = obj$prior.df,
   prior.sscp = obj$prior.sscp, starting.values = obj$param, …)

Arguments

obj

an object used to select a method. It may be y, a numeric matrix, vector or data frame containing response variables to be modeled as multivariate normal. Missing values (NAs) are allowed. If y is a data frame, any factors or ordered factors will be replaced by their internal codes, and a warning will be given. Alternatively, this first argument may be a formula as described below, or an object of class "norm" resulting from a call to emNorm or mcmcNorm; see DETAILS.

a numeric matrix, vector or data frame of covariates to be used as model predictors. Missing values (NA's) are not allowed. If x is a matrix, it must have the same number of rows as y. If x is a data frame, any factors or ordered factors will be replaced by their internal codes, and a warning is given. If NULL, it defaults to x = rep(1,nrow(y)), an intercept-only model.

intercept

if TRUE, then a column of 1's is appended to x. Ignored if x = NULL.

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model which is provided in lieu of y and x. The details of model specification are given under DETAILS.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which emNorm is called.

iter.max

maximum number of iterations to be performed. Each iteration consists of an Expectation or E-step followed by a Maximization or M-step. The procedure halts if it has not converged by this many iterations.

criterion

convergence criterion. The procedure halts if the maximum relative change in all parameters from one iteration to the next falls below this value. If NULL, then the default criterion of 1e-05 is used.

estimate.worst

if TRUE, then upon convergence of the EM algorithm, a procedure is attempted to numerically estimate the worst fraction of missing information and the worst linear function of the parameters; see DETAILS.

prior

should be "uniform", "jeffreys", "ridge" or "invwish". If "ridge" then prior.df must be supplied. If "invwish" then prior.df and prior.sscp must be supplied. For more information, see DETAILS.

prior.df

prior degrees of freedom for a ridge (prior="ridge") or inverted Wishart (prior="invwish") prior.

prior.sscp

prior sums of squares and cross-products (SSCP) matrix for an inverted Wishart prior (prior="invwish").

starting.values

optional starting values for the model parameters. This must be a list with two named components, beta and sigma, which are numeric matrices with correct dimensions; see DETAILS.

…

values to be passed to the methods.

Value

a list whose class attribute has been set to "norm". This object may be passed as the first argument in subsequent calls to emNorm, mcmcNorm, impNorm, loglikNorm or logpostNorm. The object also carries the original data and specifies the prior distribution, so that these do not need to be provided again.

To see a summary of this object, use the generic function summary, which passes the object to summary.norm.

Components of the list may also be directly accessed and examined by the user. Components which may be of interest include:

iter

number of EM iterations actually performed.

rel.diff

maximum relative difference between the parameters the last two iterations.

converged

logical value indicating whether the algorithm converged by iter iterations. Will be TRUE if rel.diff<=criterion.

loglik

a numeric vector of length iter reporting the logarithm of the observed-data likelihood function at the start of each iteration. If prior="uniform" then the loglikelihood values will be non-decreasing.

logpost

a numeric vector of length iter reporting the logarithm of the observed-data posterior density function at the start of each iteration. The log-posterior density values will be non-decreasing. If prior="uniform" then the log-posterior density and loglikelihood will be identical.

param

a list with elements beta and sigma containing the estimated parameters after the final iteration of EM. This may be supplied as starting values to emNorm or mcmcNorm, or as an argument to impNorm, loglikNorm or logpostNorm .

miss.patt

logical matrix with ncol(y) columns reporting the missingness patterns seen in y. Each row of miss.patt corresponds to a distinct missingness pattern, with TRUE indicating that the y-variable is missing and FALSE indicating that the y-variable is observed.

miss.patt.freq

integer vector of length nrow(miss.patt) indicating, for each missingness pattern, the number of cases or rows of y having that pattern.

which.patt

integer vector of length nrow(y) indicating the missingness pattern for each row of y. Thus is.na( y[i,] ) is the same thing as miss.patt[ which.patt[i], ].

Details

There are three different ways to specify the data and model when calling emNorm:

by directly supplying as the initial argument a matrix of numeric response variables y, along with an optional matrix of predictor variables x;
by supplying a model specification formula, along with an optional data frame data; or
by supplying an object of class "norm", which was produced by an earlier call to emNorm or mcmcNorm.

In the first case, the matrix y is assumed to have a multivariate normal linear regression on x with coefficients beta and covariance matrix sigma, where dim(beta)=c(ncol(x),ncol(y)) and dim(sigma)=c(ncol(y),ncol(y)). Missing values NA are allowed in y but not in x.

In the second case, formula is a formula for a (typically multivariate) linear regression model in the manner expected by lm. A formula is given as y ~ model, where y is either a single numeric variable or a matrix of numeric variables bound together with the function cbind. The right-hand side of the formula (everything to the right of ~) is a linear predictor, a series of terms separated by operators +, : or * to specify main effects and interactions. Factors are allowed on the right-hand side and will enter the model as contrasts among the levels. The intercept term 1 is included by default; to remove the intercept, use -1.

In the third case, the initial argument to emNorm is an object of class "norm" returned by a previous call to emNorm or mcmcNorm. The value of the parameters carried in this object (the estimates from the last iteration of EM or the simulated values from the last iteration of MCMC) will be used as the starting values. This can be useful, for example, if a previous run of EM did not converge by max.iter iterations. Supplying the result from that EM run as the sole argument to emNorm allows the algorithm to continue from where it was halted.

If prior="uniform" (the default), the EM algorithm computes a maximum-likelihood estimate for the parameters beta and sigma; otherwise it computes a posterior mode.

If prior="invwish" then an inverted Wishart prior distribution is applied to sigma with hyperparameters prior.df (a scalar) and prior.sscp (a symmetric, positive definite matrix of the same dimension as sigma). Using the device of imaginary results, we can interpret prior.sscp/prior.df as a prior guess for sigma, and prior.df as the prior degrees of freedom on which this guess is based.

The usual noninformative prior for the normal regression model (prior="jeffreys") is equivalent to the inverted Wishart density with prior.df equal to 0 and prior.sscp equal to a matrix of 0's.

The ridge prior (prior="ridge") is a special case of the inverted Wishart (Schafer, 1997). The prior guess for sigma is a diagonal matrix with diagonal elements estimated by regressing the observed values in each column of y on the corresponding rows of x. When prior="ridge", the user must supply a value for prior.df, which determines how strongly the estimated correlations are smoothed toward zero.

If estimate.worst, then upon convergence, a procedure is run to numerically estimate the worst fraction of missing information and the worst linear function of the parameters. The worst fraction of missing information is closely related to EM's convergence rate. Values near one correspond to slow convergence, and values near zero indicate fast convergence. If there are no missing values in the response variables, the worst fraction of missing information is exactly zero, and EM converges after one step from any starting values. The worst linear function is a linear combination of the parameters (elements of beta and sigma) for which the rate of missing information is highest.

For details of the EM algorithm, see the manual distributed with the norm2 package in the library subdirectory doc.

References

Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. London: Chapman & Hall/CRC Press.

For more information about this function and other functions in norm2, see the manual NORM Package for R, Version 2 in the library subdirectory doc.

Examples

Run this code

# NOT RUN {
## run EM for marijuana data with strict convergence criterion
data(marijuana)
result <- emNorm(marijuana, criterion=1e-06)

## re-run with ridge prior and examine results
result <- emNorm(marijuana, prior="ridge", prior.df=0.5)
summary(result)
# }

Run the code above in your browser using DataLab