Perform the modified EM algorithm imputation on a normal multivariate dataset
mnimput(formula, dataset, by = NULL, log = FALSE, log.offset = 1,
eps = 1e-3, maxit = 1e2, ts = TRUE, method = "spline",
sp.control = list(df = NULL, weights = NULL), ar.control =
list(order = NULL, period = NULL), ga.control = list(formula,
weights = NULL), f.eps = 1e-6, f.maxit = 1e3, ga.bf.eps = 1e-6,
ga.bf.maxit = 1e3, verbose = FALSE, digits = getOption("digits"))
formula indicating the missing data frame, for instance, ~X1+X2+X3+...+Xp
data with missing values to be imputated
factor for variance windows. Default is NULL
for a single variance matrix
logical. If TRUE
data will be transformed into log
scale. Default is FALSE
If log
is TRUE
, log values will be shifted by this offset. Default is 1
stop criterion
maximum number of iterations
logical. TRUE
if is time series
method for univariate time series filtering. It may be smooth
, gam
or arima
. See Details
list for Spline smooth control. See Details
list for ARIMA fitting control. See Details
list for GAM fitting control. See Details
convergence criterion for the ARIMA filter. See arima
maximum number of iterations for the ARIMA filter. See arima
covergence criterion for the backfitting algorithm of GAM models. See gam
maximum number of iterations for the backfitting algorithm of GAM models. See gam
if TRUE
convergence information on each iteration is printed. Default is FALSE
an integer indicating the decimal places. If not supplied, it is taken from options
The function returns an object of class mtsdi
containing
function call
imputed dataset
estimated mean vector
estimated covariance matrix
vector holding the number of missing values on each row
number of iterations until convergence or reach maxit
convergence value. See Details
a logical indicating if the algorithm converged
elapsed time of the process
This is a modified version of the EM algorithm for imputation of missing values. It is also applicable to time series data. When it is explicited the time series attribute through the argument ts
, missing values are estimated accounting for both correlation between time series and time structure of the series itself. Several filters can be used for prediction of the mean vector in the E-step.
One can select the method for the univariate time series filtering by the argument method
. The default method is "spline"
. In this case a smooth spline is fitted to each of the time series at each iteration. Some parameters can be passed to smooth.spline
through sm.control
. df
is a vector as long as the number of columns in dataset
holding fixed degrees of freedom of the splines. If NULL
, the degrees of freedom of each spline are chosen by cross-validation. If df
has length 1, this values is recycled for all the covariates. weights
must be a matrix of the same size of dataset
with the weights for smooth.spline
. If NULL
, all the observations will have weights equal to \(1\).
Other possibity for time series filtering is to fitting an ARIMA model for each of the time series by setting method
to "arima"
. The ARIMA models must be identified before using this function, nonetheless. arima
function can be partially controlled through ar.control
. Each column of order
must hold the corresponding \((p,d,q)\) parameters for each univariate time series if period
is NULL
. If period
is not NULL
, order
must also hold the multiplicative seasonality parameters, so each column of order
takes the form \((p,d,q,P,D,Q)\). period
is the multiplicative seasonality period. f.eps
and f.maxit
control de convergence of the ARIMA fitting algorithm. Convergence problems due non stationarity may arise when using this option.
Last but not least, a very interesting approach to modelling temporal patterns to use a full fledged regression model. It is possible to use generalised aditive (or linear) models with exogenous variates to proper filtering of time patterns. One must set method to gam
and supply a vector of formulas in ga.control
. One must supply one formula for each covariate. Using covariates that are part of the formula of the imputation model may yield some colinearity among the variates. See gam
and glm
for details. In order to use regression models for the level, set method
to "gam"
Simulations have shown that the algorithm is stable and yields good results on imputation of normal data.
Junger, W.L. and Ponce de Leon, A. (2015) Imputation of Missing Data in Time Series for Air Pollutants. Atmospheric Environment, 102, 96-104.
Johnson, R., Wichern, D. (1998) Applied Multivariate Statistical Analysis. Prentice Hall.
Dempster, A., Laird, N., Rubin, D. (1977) Maximum Likelihood from Incomplete Data via the Algorithm EM. Journal of the Royal Statistical Society 39(B)), 1--38.
McLachlan, G. J., Krishnan, T. (1997) The EM algorithm and extensions. John Wiley and Sons.
Box, G., Jenkins, G., Reinsel, G. (1994) Time Series Analysis: Forecasting and Control. 3 ed. Prentice Hall.
Hastie, T. J.; Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.
# NOT RUN {
data(miss)
f <- ~c31+c32+c33+c34+c35
## one-window covariance
i <- mnimput(f,miss,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7)))
summary(i)
## two-window covariances
b<-c(rep("year1",12),rep("year2",12))
ii <- mnimput(f,miss,by=b,eps=1e-3,ts=TRUE, method="spline",sp.control=list(df=c(7,7,7,7,7)))
summary(ii)
# }
Run the code above in your browser using DataLab