notablyDifferent: Finds obervations with large differences between observed and imputed values

Description

This routine identifies observations with large errors as measured by scaled root mean square error (see rmsd.yai). A threshold is used to detect observations with large differences.

Usage

notablyDifferent(object,vars=NULL,threshold=NULL,p=.05,...)

Value

A named list of several items. In all cases vectors are named using the observation ids which are the row names of the data used to build the yaiobject.

call: The call.
vars: The variables used (may be fewer than requested).
threshold: The threshold value.
notablyDifferent.refs: A sorted named vector of references that exceed the threshold.
notablyDifferent.trgs: A sorted named vector of targets that exceed the threshold.
rmsdS.refs: A sorted named vector of scaled RMSD references.
rmsdS.trgs: A sorted named vector of scaled RMSD targets.

Arguments

object: an object of class yai.
vars: a vector of character strings naming the variables to use, if null the X-variables form object are used.
threshold: a threshold that if exceeded the observations are listed as notably different.
p: (1-p)*100 is the percentile point in the distribution of differences used to compute the threshold (used when threshold is NULL).
...: additional arguments passed to impute.yai.

Author

Nicholas L. Crookston ncrookston.fs@gmail.com

Details

The scaled differences are computed a follows:

A matrix of differences between observed and imputed values is computed for each observation (rows) and each variable (columns).
These differences are scaled by dividing by the standard deviation of the observed values among the reference observations.
The scaled differences are squared.
Row means are computed resulting in one value for each observation.
The square root of each of these values is taken.

These values are Euclidean distances between the target observations and their nearest references as measured using specified variables. All the variables that are used must have observed and imputed values. Generally, this will be the X-variables and not the Y-variables.

When threshold is NULL, the function computes one using the quantile function with its default arguments and probs=1-p.

Examples

Run this code

data(iris)

set.seed(12345)

# form some test data
refs=sample(rownames(iris),50)
x <- iris[,1:3]      # Sepal.Length Sepal.Width Petal.Length
y <- iris[refs,4:5]  # Petal.Width Species

# build an msn run, first build dummy variables for species.

sp1 <- as.integer(iris$Species=="setosa")
sp2 <- as.integer(iris$Species=="versicolor")
y2 <- data.frame(cbind(iris[,4],sp1,sp2),row.names=rownames(iris))
y2 <- y2[refs,]

names(y2) <- c("Petal.Width","Sp1","Sp2")

msn <- yai(x=x,y=y2,method="msn")

notablyDifferent(msn)

Run the code above in your browser using DataLab