grmsd: Generalized Root Mean Square Distance Between Observed and Imputed Values

Description

Computes the root mean square distance between predicted and corresponding observed values in an orthogonal multivariate space. This value is the mean Mahalanobis distance between observed and imputed values in a space defined by observations and variables were observed and predicted values are defined. The statistic provides a way to compare imputation (or prediction) results. While it is designed to work with imputation, the function can be used with objects that inherit from lm or with matrices and data frames that follow the column naming convention described in the details.

Usage

grmsd(...,ancillaryData=NULL,vars=NULL,wts=NULL,rtnVectors=FALSE,imputeMethod="closest")

Value

When rtnVectors=FALSE, a sorted named vector of mean distances is returned; the names are taken from the arguments.

When rtnVectors=TRUE the function returns vectors of distances, sorted and named as done wnen this argument is FALSE.

Arguments

...: objects created by any combination of yai, impute.yai, ensembleImpute, buildConsensus, lm and data frames or matrices that follow the column naming convention described in the details below. If an object is of class yai, a call to impute.yai is generated internally.
ancillaryData: a data frame that defines variables, passed to impute.yai.
vars: a list of variable names you want to include; if NULL all available variables are included (note that if codeimpute.yai the Y-variables are returned when vars=NULL).
wts: A vector of weights used to compute the mean distances, see details below.
rtnVectors: The vectors of individual distances are returned (see Value) rather than the mean Mahalanobis distance.
imputeMethod: passed as method to impute.yai.

Author

Nicholas L. Crookston ncrookston.fs@gmail.com

Details

This function is designed to compute the root mean square distance between observed and predicted observations over several variables at once. It is the Mahalanobis distance between observed and predicted but the name emphasizes the similarities to root mean square difference (or error, see rmsd). Here are some notable characteristics.

In the univariate case this function returns the same value as rmsd with scale=TRUE. In that case the root mean square difference is computed after scale has been called on the variable.
Like rmsd, grmsd is zero if the imputed values are exactly the same as the observed values over all variables.
Like rmsd, grmsd is ~1.0 when the mean of each variable is imputed in place of a near neighbor (it would be exactly 1.0 if the maximum likelihood estimate of the covariance were used rather than the unbiased estimate -- it approaches 1 as n gets large.) This situation corresponds to regression where the slope is zero. It indicates that the imputation error is, over all, the same as it would be if the means of the variables were imputed rather than near neighbors (it does not signal that the means were imputed).
Like rmsd, values of grmsd > 1.0 indicate that, on average, the errors in the imputation are greater than they would be if the mean of the corresponding variables were imputed for each observation.
Note that individual rmsd values can be computed even when the variance of the variable is zero. In contrast, grmsd can only be computed in the situation where the observed data matrix is full rank. Rank is determined using qr and columns are removed from the analysis to create this condition if necessary (with a warning).

Observed and predicted are matched using the column names. Column names that have ".o" are matched to those that do not. Columns that do not have matching observed and imputed (predicted) values are ignored.

Several objects may be passed as "...". Function impute.yai is called for any objects that were created by yai; ancillaryData and vars are passed to impute.yai when it is used.

When objects inherit from lm, a suitable matrix is formed using by calling the predict and resid functions.

Factors, if found, are removed (with a warning).

When argument wts is defined there must be one value for each pair of observed and predicted variables. If the values are named (preferred), then the names are matched to the names of predicted variables (no .o suffix). The matched values effectively scale the axes in which distances are computed. When this is done, the resulting distances are not Mahalanobis distances.

Examples

Run this code

require(yaImpute)

data(iris)
set.seed(12345)

# form some test data
refs=sample(rownames(iris),50)
x <- iris[,1:2]      # Sepal.Length Sepal.Width
y <- iris[refs,3:4]  # Petal.Length Petal.Width

# build yai objects using 2 methods
msn <- yai(x=x,y=y)
mal <- yai(x=x,y=y,method="mahalanobis")

# compute the average distances between observed and imputed (predicted)
grmsd(msn,mal,lmFit=lm(as.matrix(y) ~ ., data=x[refs,]))

# use the all variables and observations in iris
# Species is a factor and is automatically deleted with a warning
grmsd(msn,mal,ancillaryData=iris)

# here is an example using lm, and another using column
# means as predictions.

impMean <- y 
colnames(impMean) <- paste0(colnames(impMean),".o")
impMean <- cbind(impMean,y)
# set the predictions to the mean's of the variables
impMean[,"Petal.Length"] <- mean(impMean[,"Petal.Length"])
impMean[,"Petal.Width"]  <- mean(impMean[,"Petal.Width"])

grmsd(msn, mal, lmFit=lm(as.matrix(y) ~ ., data=x[refs,]), impMean )

# compare to using function rmsd (values match):
msnimp <- na.omit(impute(msn))
grmsd(msnimp[,c("Petal.Length","Petal.Length.o")])   
rmsd(msnimp[,c("Petal.Length","Petal.Length.o")],scale=TRUE)

# these are multivariate cases and they don't match
# because the covariance of the two variables is > 0.
grmsd(msnimp)
colSums(rmsd(msnimp,scale=TRUE))/2

# get the vectors and make a boxplot, identify outliers
stats <- boxplot(grmsd(msn,mal,ancillaryData=iris[,-5],rtnVectors=TRUE),
                 ylab="Mahalanobis distance")
stats$out
#     118      132 
#2.231373 1.990961

Run the code above in your browser using DataLab