mcmcNorm: MCMC algorithm for incomplete multivariate normal data

Description

Simulates parameters and missing values from a joint posterior distribution under a normal model using Markov chain Monte Carlo.

Usage


mcmcNorm(obj, …)

# S3 method for default
mcmcNorm(obj, x = NULL, intercept = TRUE,
   starting.values, iter = 1000, multicycle = NULL,
   seeds = NULL, prior = "uniform",
   prior.df = NULL, prior.sscp = NULL, save.all.series = TRUE,
   save.worst.series = FALSE, worst.linear.coef = NULL,
   impute.every = NULL, …)

# S3 method for formula
mcmcNorm(formula, data, starting.values, 
   iter = 1000, multicycle = NULL, seeds = NULL, prior = "uniform", 
   prior.df = NULL, prior.sscp = NULL, save.all.series = TRUE, 
   save.worst.series = FALSE, worst.linear.coef = NULL,
   impute.every=NULL, …)

# S3 method for norm
mcmcNorm(obj, starting.values = obj$param,
   iter = 1000, multicycle = obj$multicycle, 
   seeds = NULL, prior = obj$prior, prior.df = obj$prior.df, 
   prior.sscp = obj$prior.sscp,
   save.all.series = !(obj$method=="MCMC" & is.null( obj$series.beta )), 
   save.worst.series = !is.null( obj$worst.linear.coef ),
   worst.linear.coef = obj$worst.linear.coef,
   impute.every = obj$impute.every, …)

Arguments

obj

an object used to select a method. It may be y, a numeric matrix, vector or data frame containing response variables to be modeled as normal. Missing values (NAs) are allowed. If y is a data frame, any factors or ordered factors will be replaced by their internal codes, and a warning will be given. Alternatively, this first argument may be a formula as described below, or an object of class "norm" resulting from a call to emNorm or mcmcNorm; see DETAILS.

a numeric matrix, vector or data frame of covariates to be used as predictors for y. Missing values (NA's) are not allowed. If x is a matrix, it must have the same number of rows as y. If x is a data frame, any factors or ordered factors are replaced by their internal codes, and a warning is given. If NULL, it defaults to x = rep(1,nrow(y)), an intercept-only model.

intercept

if TRUE, then a column of 1's is appended to x. Ignored if x = NULL.

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model which is provided in lieu of y and x. The details of model specification are given under DETAILS.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which mcmcNorm is called.

starting.values

starting values for the model parameters. This must be a list with two named components, beta and sigma, which are numeric matrices with correct dimensions. In most circumstances, the starting values will be obtained from a prior run of emNorm or mcmcNorm; see DETAILS.

iter

number of iterations to be performed. By default, each iteration consists of one Imputation or I-step followed by one Posterior or P-step, but this can be changed by multicycle.

multicycle

number of cycles per iteration, with NULL equivalent to multicycle=1. Specifying multicycle=k for some k>1 instructs mcmcNorm to perform the I-step and P-step cycle k times within each iteration; see DETAILS.

seeds

two integers to initialize the random number generator; see DETAILS.

prior

should be "uniform", "jeffreys", "ridge" or "invwish". If "ridge" then prior.df must be supplied. If "invwish" then prior.df and prior.sscp must be supplied. For more information, see DETAILS.

prior.df

prior degrees of freedom for a ridge (prior="ridge") or inverted Wishart (prior="invwish") prior.

prior.sscp

prior sums of squares and cross-products (SSCP) matrix for an inverted Wishart prior (prior="invwish").

save.all.series

if TRUE, then the simulated values of all parameters at all iterations will be saved.

save.worst.series

if TRUE, then the simulated values of the worst linear function of the parameters will be saved. Under ordinary circumstances, this function will have been estimated by emNorm after the EM algorithm converged.

worst.linear.coef

vector or coefficients that define the worst linear function of the parameters. Under ordinary circumstances, these are provided automatically in the result from emNorm.

impute.every

how many iterations to perform between imputations? If impute.every=k, then the simulated values for the missing data after every k iterations will be saved, resulting in floor(iter/impute.every) multiple imputations. If NULL, then no imputations will be saved.

…

values to be passed to the methods.

Value

a list whose class attribute has been set to "norm". This object may be passed as the first argument in subsequent calls to emNorm, mcmcNorm, impNorm, loglikNorm or logpostNorm. The object also carries the original data and specifies the prior distribution, so that these do not need to be provided again.

To see a summary of this object, use the generic function summary, which passes the object to summary.norm.

Components of the list may also be directly accessed and examined by the user. Components which may be of interest include:

iter

number of MCMC iterations performed.

param

a list with elements beta and sigma containing the estimated parameters after the final iteration of MCMC. This may be supplied as starting values to emNorm or mcmcNorm, or as an argument to impNorm, loglikNorm or logpostNorm.

loglik

a numeric vector of length iter reporting the logarithm of the observed-data likelihood function at the start of each iteration.

logpost

a numeric vector of length iter reporting the logarithm of the observed-data posterior density function at the start of each iteration.

series.worst

a time-series object (class "ts") which contains the simulated values of the worst linear function of the parameters from all iterations. This will be present if the starting values provided to mcmcNorm were close enough to the mode to provide a reliable estimate of the worst linear function. The dependence in this series tends to be higher than for other parameters, so examining the dependence by plotting the series with plot or its autocorrelation function with acf may help the user to judge how quickly the Markov chain achieves stationarity. For the definition of the worst linear function, see the manual accompanying the NORM package in the subdirectory doc.

series.beta

a multivariate time-series object (class "ts") which contains the simulated values of the coefficients beta from all iterations. This will present if save.all.series=TRUE.

series.sigma

a multivariate time-series object (class "ts") which contains the simulated values of the variances and covariances (elements of the lower triangle of sigma) from all iterations. This will be present if save.all.series=TRUE.

imp.list

a list containing the multiple imputations. Each component of this list is a data matrix resembling y, but with NA's replaced by imputed values. The length of the list depends on the values of iter and impute.every.

miss.patt

logical matrix with ncol(y) columns reporting the missingness patterns seen in y. Each row of miss.patt corresponds to a distinct missingness pattern, with TRUE indicating that the variable is missing and FALSE indicating that the variable is observed.

miss.patt.freq

integer vector of length nrow(miss.patt) indicating, for each missingness pattern, the number of cases or rows of y having that pattern.

which.patt

integer vector of length nrow(y) indicating the missingness pattern for each row of y. Thus is.na( y[i,] ) is the same thing as miss.patt[ which.patt[i], ].

Details

There are three different ways to specify the data and model when calling mcmcNorm:

by directly supplying as the initial argument a matrix of numeric response variables y, along with an optional matrix of predictor variables x;
by supplying a model specification formula, along with an optional data frame data; or
by supplying an object of class "norm", which was produced by an earlier call to emNorm or mcmcNorm.

In the first case, the matrix y is assumed to have a multivariate normal linear regression on x with coefficients beta and covariance matrix sigma, where dim(beta)=c(ncol(x),ncol(y)) and dim(sigma)=c(ncol(y),ncol(y)). Missing values NA are allowed in y but not in x.

In the second case, formula is a formula for a (typically multivariate) linear regression model in the manner expected by lm. A formula is given as y ~ model, where y is either a single numeric variable or a matrix of numeric variables bound together with the function cbind. The right-hand side of the formula (everything to the right of ~) is a linear predictor, a series of terms separated by operators +, : or * to specify main effects and interactions. Factors are allowed on the right-hand side and will enter the model as contrasts among the levels. The intercept term 1 is included by default; to remove the intercept, use -1.

In the third case, the initial argument to mcmcNorm is an object of class "norm" returned by a previous call to emNorm or mcmcNorm. The value of the parameters carried in this object (the estimates from the last iteration of EM or the simulated values from the last iteration of MCMC) will be used as the starting values.

The matrix y is assumed to have a multivariate normal linear regression on x with coefficients beta and covariance matrix sigma, where dim(beta)=c(ncol(x),ncol(y)) and dim(sigma)=c(ncol(y),ncol(y)).

Starting values for the parameters must be provided. In most cases these will be the result of a previous call to emNorm or mcmcNorm. If the starting values are close to the mode (i.e., if they are the result of an EM run that converged) then the worst linear function of the parameters will be saved at each iteration. If the starting values are the result of a previous run of MCMC, then the new run will be a continuation of the same Markov chain.

If multicycle=k for some k>1, then the length of the saved parameter series will be reduced by a factor of k, and the serial correlation in the series will also be reduced. This option is useful in large problems with many parameters and in slowly converging problems for which many iterations are needed.

norm2 functions use their own internal random number generator which is seeded by two integers, for example, seeds=c(123,456), which allows results to be reproduced in the future. If seeds=NULL then the function will seed itself with two random integers from R. Therefore, results can also be made reproducible by calling set.seed beforehand and taking seeds=NULL.

If prior="invwish" then an inverted Wishart prior distribution is applied to sigma with hyperparameters prior.df (a scalar) and prior.sscp (a symmetric, positive definite matrix of the same dimension as sigma). Using the device of imaginary results, we can interpret prior.sscp/prior.df as a prior guess for sigma, and prior.df as the prior degrees of freedom on which this guess is based.

The usual noninformative prior for the normal regression model (prior="jeffreys") is equivalent to the inverted Wishart density with prior.df equal to 0 and prior.sscp equal to a matrix of 0's.

The ridge prior (prior="ridge") is a special case of the inverted Wishart (Schafer, 1997). The prior guess for sigma is a diagonal matrix with diagonal elements estimated by regressing the observed values in each column of y on the corresponding rows of x. When prior="ridge", the user must supply a value for prior.df, which determines how strongly the estimated correlations are smoothed toward zero.

If the first argument to mcmcNorm is an object of class "norm", then the parameter values stored in that object will automatically be used as starting values.

For details of the MCMC algorithm, see the manual distributed with the NORM package in the subdirectory doc.

References

Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data. London: Chapman & Hall/CRC Press.

For more information about this function and other functions in the NORM package, see User's Guide for norm2 in the library subdirectory doc.

Examples

Run this code

# NOT RUN {
## run EM for marijuana data with ridge prior
data(marijuana)
emResult <- emNorm(marijuana, prior="ridge", prior.df=0.5)

## run MCMC for 5,000 iterations starting from the 
## posterior mode using the same prior
mcmcResult <- mcmcNorm(emResult, iter=5000)

## summarize and plot worst linear function
summary(mcmcResult)
plot(mcmcResult$series.worst)
acf(mcmcResult$series.worst, lag.max=50)

## generate 25 multiple imputations, taking 
## 100 steps between imputations, and look st
## the first imputed dataset
mcmcResult <- mcmcNorm(emResult, iter=2500, impute.every=100)
mcmcResult$imp.list[[1]]

# }

Run the code above in your browser using DataLab