monomvn: Maximum Likelihood Estimation for Multivariate Normal Data with Monotone Missingness

Description

Maximum likelihood estimation of the mean and covariance matrix of multivariate normal (MVN) distributed data with a monotone missingness pattern. Through the use of parsimonious/shrinkage regressions (e.g., plsr, pcr, ridge, lasso, etc.), where standard regressions fail, this function can handle an (almost) arbitrary amount of missing data.

Usage

monomvn(y, pre = TRUE, method = c("plsr", "pcr", "lasso", "lar",
        "forward.stagewise", "stepwise", "ridge"), p = 0.9,
        ncomp.max = Inf, validation = c("CV", "LOO", "Cp"),
        obs = FALSE, verb = 0, quiet = TRUE)

Arguments

data matrix were each row is interpreted as a random sample from a MVN distribution with missing values indicated by NA

pre

logical indicating whether pre-processing of the y is to be performed. This sorts the columns so that the number of NAs is non-decreasing with the column index

method

describes the type of parsimonious (or shrinkage) regression to be performed when standard least squares regression fails. From the pls package we have "plsr" (plsr<

when performing regressions, p is the proportion of the number of columns to rows in the design matrix before an alternative regression method (those above) is performed as if least-squares regression failed

ncomp.max

maximal number of (principal) components to include in a method---only meaningful for the "plsr" or "pcr" methods. Large settings can cause the execution to be slow as it drastically increases the cross-

validation

method for cross validation when applying a parsimonious regression method. The default setting of "CV" (randomized 10-fold cross-validation) is the faster method, but does not yield a deterministic result and does not apply

obs

logical indicating whether or not to (additionally) compute a mean vector and covariance matrix based only on the observed data, without regressions. I.e., means are calculated via averages of each non-NA entry in each column of

verb

whether or not to print progress indicators. The default (verb = 0) keeps quiet, while any positive number causes brief statement about dimensions of each regression to print to the screen as it happens. verb = 2 ca

quiet

causes warnings about regressions to be silenced when TRUE

Value

monomvn returns an object of class "monomvn", which is a list containing a subset of the components below.
calla copy of the function call as used
muestimated mean vector with columns corresponding to the columns of y
Sestimated covariance matrix with rows and columns corresponding to the columns of y
nawhen pre = TRUE this is a vector containing number of NA entries in each column of y
owhen pre = TRUE this is a vector containing the index of each column in the sorting of the columns of y obtained by o <- order(na)
methodmethod of regression used on each column, or "complete" indicating that no regression was necessary
ncompnumber of components in a plsr or pcr regression, or NA if such a method was not used. This field is used to record $\lambda$ when lm.ridge is used
mu.obswhen obs = TRUE this is the observed mean vector
S.obswhen obs = TRUE this is the observed covariance matrix, as described above. Note that S.obs is usually not positive definite

Details

If pre = TRUE then monomvn first re-arranges the columns of y into nondecreasing order with respect to the number of missing (NA) entries. Then (at least) the first column should be completely observed. The mean components and covariances between the first set of complete columns are obtained through the standard mean and cov routines.

Next each successive group of column with the same missingness pattern is processed in sequence. Suppose a total of j columns have been processed this way already. Let y2 represent the non-missing contingent of the next group of k columns of y with identical missingness pattern, and let y1 be the previously processed j-1 columns of y containing only the rows corresponding to each non-NA entry in y2. I.e., nrow(y1) = nrow(y2). Note that y1 contains no NA entries since the missing data pattern is monotone. The k next entries (indices j:(j+k)) of the mean vector, and the j:(j+k) rows and columns of the covariance matrix are obtained by multivariate regression of y2 on y1. The regression method used depends on the number of rows and columns in y1 and on the p parameter. Whenever ncol(y1) < p*nrow(y1) least-squares regression is used, otherwise method = c("pcr", "plsr"). If ever a least-squares regression fails due to co-linearity the one of the other methods is tried. All methods require a scheme for estimating the amount of variability explained by increasing numbers of coefficients (or principal components) in the model. Towards this end, the pls and lars packages support 10-fold cross validation (CV) or leave-one-out (LOO) CV estimates of root mean squared error (ERROR). See pls and lars for more details. monomvn uses CV in all cases except when nrow(y1) <= 10<="" code="">, in which case CV fails and LOO is used. Whenever nrow(y1) <= 3<="" code=""> pcr fails, so plsr is used instead. If quiet = FALSE then a warning is given whenever the first choice for a regression fails. For pls methods, RMSEs are calculated for a number of components in 1:ncomp.max where is.null(ncomp.max) it is replaced with

ncomp.max <- min(ncomp.max, ncol(y2), nrow(y1)-1)

which is the max allowed by the pls package. Simple heuristics are used to select a small number of components (ncomp for pls), or number of coefficients (for lars) which explains a large amount of the variability (RMSE). The lars methods use a one-standard error rule outlined in Section 7.10, page 216 of HTF, in the reference below. The pls package does not currently support the calculation of standard errors for CV estimates of RMSE, so a simple linear penalty for increasing ncomp is used instead. The ridge constant (lambda) for lm.ridge is set using the optimize function on the GCV output.

Based on the MLE ncol(y1)+1 regression coefficients (including intercept) obtained for each of the columns of y2, and on the corresponding matrix of residual sum of squares, and on the previous j-1 means and rows/cols of the covariance matrix, the j:(j+k) entries and rows/cols can be filled in as described by Little and Rubin, section 7.4.3.

Once every column has been processed the entries of the mean vector, and rows/cols of the covariance matrix are re-arranged into their original order.

References

Robert B. Gramacy and Joo Hee Lee (2007). On estimating covariances between many assets with histories of highly variable length. Preprint available on arXiv:0710.5837: http://arxiv.org/abs/0710.5837

Roderick J.A. Little and Donald B. Rubin (2002). Statistical Analysis with Missing Data, Second Edition. Wilely.

Bjorn-Helge Mevik and Ron Wehrens (2007). The pls Package: Principal Component and Partial Least Squares Regression in R. Journal of Statistical Software 18(2)

Bradley Efron, Trevor Hastie, Ian Johnstone and Robert Tibshirani (2003). Least Angle Regression (with discussion). Annals of Statistics 32(2); see also http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf Trevor Hastie, Robert Tibshirani and Jerome Friedman (2002). Elements of Statistical Learning. Springer, NY. [HTF] Some of the code for monomvn, and its subroutines, was inspired by code found on the world wide web, written by Daniel Heitjan, http://www.cceb.upenn.edu/pages/heitjan/courses/bsta782/examples/fcn.q

http://www.statslab.cam.ac.uk/~bobby/monomvn.html

Examples

Run this code

## standard usage, duplicating the results in
## Little and Rubin, section 7.4.3 -- try adding 
## verb=3 argument for a step-by-step breakdown
data(cement.miss)
out <- monomvn(cement.miss)
out
out$mu
out$S

##
## A bigger example, comparing the various methods
##

## generate N=1000 samples from a random MVN
xmuS <- randmvn(1000, 100)

## randomly impose monotone missingness
xmiss <- rmono(xmuS$x)

## plsr
out.plsr <- monomvn(xmiss)
out.plsr
kl.norm(out.plsr$mu, out.plsr$S, xmuS$mu, xmuS$S)

## plcr
out.pcr <- monomvn(xmiss, method="pcr")
kl.norm(out.pcr$mu, out.pcr$S, xmuS$mu, xmuS$S)

## ridge regression
out.ridge <- monomvn(xmiss, method="ridge")
kl.norm(out.ridge$mu, out.ridge$S, xmuS$mu, xmuS$S)

## lasso
out.lasso <- monomvn(xmiss, method="lasso")
kl.norm(out.lasso$mu, out.lasso$S, xmuS$mu, xmuS$S)

## lar
out.lar <- monomvn(xmiss, method="lar")
kl.norm(out.lar$mu, out.lar$S, xmuS$mu, xmuS$S)

## forward.stagewise
out.fs <- monomvn(xmiss, method="forward.stagewise")
kl.norm(out.fs$mu, out.fs$S, xmuS$mu, xmuS$S)

## stepwise
out.step <- monomvn(xmiss, method="stepwise")
kl.norm(out.step$mu, out.step$S, xmuS$mu, xmuS$S)

Run the code above in your browser using DataLab