TestMCARNormality: Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random

Description

The main purpose of this package is to test whether the missing data mechanism, for an incompletely observed data set, is one of missing completely at random (MCAR). As a by product, however, this package has the capabilities of imputing incomplete data, performing a test to determine whether data have a multivariate normal distribution, performing a test of equality of covariances for groups, and obtaining normal-theory maximum likelihood estimates for mean and covariance when data are incomplete. The test of MCAR follows the methodology proposed by Jamshidian and Jalal (2010). It is based on testing equality of covariances between groups having identical missing data patterns. The data are imputed, using two options of normality and distribution free, and the test of equality of covariances between groups with identical missing data patterns is performed also with options of assuming normality (Hawkins test) or non-parametrically. Users can optionally use their own method of data imputation as well. Multiple imputation is an additional feature of the program that can be used as a diagnostic tool to help identify cases or variables that contribute to rejection of MCAR, when the MCAR test is rejecetd (See Jamshidian and Jalal, 2010 for details). As explained in Jamshidian, Jalal, and Jansen (2014), this package can also be used for imputing missing data, test of multivariate normality, and test of equality of covariances between several groups when data are completly observed.

Usage

TestMCARNormality(
  data,
  del.lesscases = 6,
  imputation.number = 1,
  method = "Auto",
  imputation.method = "Dist.Free",
  nrep = 10000,
  n.min = 30,
  seed = 110,
  alpha = 0.05,
  imputed.data = NA
)

Value

analyzed.data: The data that were used in the analysis. If del.lesscases=0, this is the same as the orginal data inputted. If del.lesscases > 0, then this is the data with cases removed.
imputed.data: The analyzed.data after imputation. If imputation.number > 1, the first imputed data set is returned.
ordered.data: The analyzed.data ordered according to missing data pattern, usin the function OrderMissing.
caseorder: A mapping of case number indices from ordered.data to the original data. More specifically, the j-th row of the ordered.data is the caseorder[j]-th (the j-th element of caseorder) row of the original data.
pnormality: p-value for the nonparametric test: When imputation.number > 1, this is a vector with each element corresponding to each of the imputed data sets.
adistar: A matrix consisting of the Anderson-Darling test statistic for each group (columns) and each imputation (rows).
adstar: Sum of adistar: When imputation.number >1, this is a vector with each element corresponding to each of the imputed data sets.
pvalcomb: p-value for the Hawkins test: When imputation.number >1, this is a vector with each element corresponding to each of the imputed data sets.
pvalsn: A matrix consisting of Hawkins test statistics for each group (columns) and each imputation (rows).
g: Number of patterns used in the analysis.
combp: Hawkins test statistic: When imputation.number > 1, this is a vector with each element corresponding to each of the imputed data sets.
alpha: The significance level at which the hypothesis tests are performed.
patcnt: A vector consisting the number of cases corresponding to each pattern in patused.
patused: A matrix indicating the missing data patterns in the data set, using 1 and NA's.
imputation.number: A value greater than or equal to 1. If a value larger than 1 is used, data will be imputed imputation.number times.
mu: The normal-theory maximum likelihood estimate of the variables means.
sigma: The normal-theory maximum likelihood estimate of the variables covariance matrix.

Arguments

data

A matrix or data frame consisting of at least two columns. Values must be numerical with missing data indicated by NA.

del.lesscases

Missing data patterns consisting of del.lesscases number of cases or less will be removed from the data set.

imputation.number

Number of imputations to be used, if data are to be multiply imputed.

method

method is an option that allows the user to select one of the methods of Hawkins or nonparametric for the test. If the user is certain that data have multivariate normal distribution, the method="Hawkins" should be selected. On the other hand if data are not normally distributed, then method="Nonparametric" should be used. If the user is unsure, then the default value of method="Auto" will be used, in which case both the Hawkins and the nonparametric tests will be run, and the default output follows the recommendation by Jamshidian and Jalal (2010) outlined in their flowchart given in Figure 7 of their paper.

imputation.method

"Dist.Free": Missing data are imputed nonparametrically using the method of Sirvastava and Dolatabadi (2009); also see Jamshidian and Jalal (2010).

"Normal": Missing data are imputed assuming that the data come from a multivariate normal distribution. The maximum likelihood estimate of the mean and covariance obtained from Mls is used for generating imputed values. The imputed values are based on the conditional distribution of the missing variables given the observed variables; see Jamshidian and Jalal (2010) for more details.

nrep

Number of replications used to simulate the Neyman distribution to determine the cut off value for the Neyman test in the program SimNey. Larger values increase the accuracy of the Neyman test.

n.min

The minimum number of cases in a group that triggers the use of asymptotic Chi distribution in place of the emprical distribution in the Neyman test of uniformity.

seed

An initial random number generator seed. The default is 110 that can be reset to a user selected number. If the value is set to NA, a system selected seed is used.

alpha

The significance level at which tests are performed.

imputed.data

The user can optionally provide an imputed data set. In this case the program will not impute the data and will use the imputed data set for the tests performed. Note that the order of cases in the imputed data set should be the same as that of the incomplete data set.

Author

Mortaza Jamshidian, Siavash Jalal, and Camden Jansen

Details

Theoretical, technical and prcatical details about this program and its uses can be found in Jamshidian and Jalal (2010) and Jamshidian, Jalal, and Jansen (2014).

References

Jamshidian, M. and Bentler, P. M. (1999). ``ML estimation of mean and covariance structures with missing data using complete data routines.'' Journal of Educational and Behavioral Statistics, 24, 21-41, tools:::Rd_expr_doi("10.2307/1165260").

Jamshidian, M. and Jalal, S. (2010). ``Tests of homoscedasticity, normality, and missing at random for incomplete multivariate data,'' Psychometrika, 75, 649-674, tools:::Rd_expr_doi("10.1007/s11336-010-9175-3").

Jamshidian, M. Jalal, S., and Jansen, C. (2014). ``MissMech: An R Package for Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random (MCAR),'' Journal of Statistical Software, 56(6), 1-31, tools:::Rd_expr_doi("10.18637/jss.v056.i06").

Examples

Run this code

#-- Example 1: Data are MCAR and normally distributed
# \donttest{
n <- 300
p <- 5
pctmiss <- 0.2
set.seed(1010)
y <- matrix(rnorm(n * p),nrow = n)
missing <- matrix(runif(n * p), nrow = n) < pctmiss
y[missing] <- NA
out <- TestMCARNormality(data=y)
print(out)

# --- Prints the p-value for both the Hawkins and the nonparametric test
summary(out)

# --- Uses more cases
out1 <- TestMCARNormality(data=y, del.lesscases = 1)
print(out1)

#---- performs multiple imputation
Out <- TestMCARNormality (data = y, imputation.number = 10)
summary(Out)
boxplot(Out)
# }
#-- Example 2: Data are MCAR and non-normally distributed (t distributed with d.f. = 5)
# \donttest{
n <- 300
p <- 5
pctmiss <- 0.2
set.seed(1010)
y <- matrix(rt(n * p, 5), nrow = n)
missing <- matrix(runif(n * p), nrow = n) < pctmiss
y[missing] <- NA
out <- TestMCARNormality(data=y)
print(out)

# Perform multiple imputation
Out_m <- TestMCARNormality (data = y, imputation.number = 20)
boxplot(Out_m)
# }
#-- Example 3: Data are MAR (not MCAR), but are normally distributed
# \donttest{
n <- 300
p <- 5
r <- 0.3
mu <- rep(0, p)
sigma <- r * (matrix(1, p, p) - diag(1, p))+ diag(1, p)
set.seed(110)
eig <- eigen(sigma)
sig.sqrt <- eig$vectors %*%  diag(sqrt(eig$values)) %*%  solve(eig$vectors)
sig.sqrt <- (sig.sqrt + sig.sqrt) / 2
y <- matrix(rnorm(n * p), nrow = n) %*%  sig.sqrt
tmp <- y
for (j in 2:p){
  y[tmp[, j - 1] > 0.8, j] <- NA 
}
out <- TestMCARNormality(data = y, alpha =0.1)
print(out)
# }
#-- Example 4: Multiple imputation; data are MAR (not MCAR), but are normally distributed
# \donttest{
n <- 300
p <- 5
pctmiss <- 0.2
set.seed(1010)
y <- matrix (rnorm(n * p), nrow = n)
missing <- matrix(runif(n * p), nrow = n) < pctmiss
y[missing] <- NA
Out <- OrderMissing(y)
y <- Out$data
spatcnt <- Out$spatcnt
g2 <- seq(spatcnt[1] + 1, spatcnt[2])
g4 <- seq(spatcnt[3] + 1, spatcnt[4])
y[c(g2, g4), ] <- 2 * y[c(g2, g4), ]
out <- TestMCARNormality(data = y, imputation.number = 20)
print(out)
boxplot(out)

# Removing Groups 2 and 4
y1= y[-seq(spatcnt[1]+1,spatcnt[2]),]
out <- TestMCARNormality(data=y1,imputation.number = 20)
print(out)
boxplot(out)
# }
#-- Example 5: Test of homoscedasticity for complete data
# \donttest{
n <- 50
p <- 5
r <- 0.4
sigma <- r * (matrix(1, p, p) - diag(1, p)) + diag(1, p)
set.seed(1010)
eig <- eigen(sigma)
sig.sqrt <- eig$vectors %*%  diag(sqrt(eig$values)) %*%  solve(eig$vectors)
sig.sqrt <- (sig.sqrt + sig.sqrt) / 2
y1 <- matrix(rnorm(n * p), nrow = n) %*%  sig.sqrt
n <- 75
p <- 5
y2 <- matrix(rnorm(n * p), nrow = n)
n <- 25
p <- 5
r <- 0
sigma <- r * (matrix(1, p, p) - diag(1, p)) + diag(2, p)
y3 <- matrix(rnorm(n * p), nrow = n) %*%  sqrt(sigma)
ycomplete <- rbind(y1 ,y2 ,y3)
y1 [ ,1] <- NA
y2[,c(1 ,3)] <- NA
y3 [ ,2] <- NA
ygroup <- rbind(y1, y2, y3)
out <- TestMCARNormality(data = ygroup, method = "Hawkins", imputed.data = ycomplete)
print(out)
# }
# ---- Example 6, real data
# \donttest{
data(agingdata)
TestMCARNormality(agingdata, del.lesscases = 1)
# }

Run the code above in your browser using DataLab