simEval: A function for evaluating similarity/dissimilarity matrices (simEval)

Description

This function searches for the most similar sample of each sample in a given data set based on a similarity/dissimilarity (e.g. distance matrix). The samples are compared against their corresponding most similar samples in terms of the side information provided. The root mean square of differences and the correlation coefficient are computed for continuous variables and for discrete variables the kappa index is calculated.

Usage

simEval(d, sideInf, lower.tri = FALSE, cores = 1, ...)

Arguments

a vector or a square symmetric matrix (or data.frame) of similarity/dissimilarity scores between samples of a given dataset (see lower.tri).

sideInf

a vector containing the side information corresponding to the samples in the dataset from which the similarity/dissimilarity matrix was computed. It can be either a numeric vector (continuous variable) or a factor (discrete variable). If it i

lower.tri

a logical indicating whether the input similarities/dissimilarities are given as a vector of the lower triangle of the distance matrix (as returned e.g. by base::dist) or as a square symmetric matrix (or

cores

number of cores used to find the neareast neighbours of similarity/dissimilarity scores (default = 1). See details.

...

additional parameters (for internal use only).

Value

simEval returns a list with the following components:
- "eval
{ either the RMSD (and the correlation coefficient) or the kappa index}
firstNNa data.frame containing the original side informative variable in the first column and the side informative values of the corresponding nearest neighbours in the second column

Details

For the evaluation of similarity/dissimilarity matrices this function uses side information (information about one variable which is available for a group of samples, Ramirez-Lopez et al., 2013). It is assumed that there is a correlation (or at least an indirect or secondary correlation) between this side informative variable and the spectra. In other words, this approach is based on the assumption that the similarity measures between the spectra of a given group of samples should be able to reflect their similarity also in terms of the side informative variable (e.g. compositional similarity). If sideInf is a numeric vector the root mean square of differences (RMSD) is used for assessing the similarity between the samples and their corresponding most similar samples in terms of the side information provided. It is computed as follows: It can be computed as:

R M S D = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\ddot{y}}_{i})^{2}}

where $y_i$ is the value of the side variable of the $i$th sample, $\ddot{y}_i$ is the value of the side variable of the nearest neighbour of the $i$th sample and $n$ is the total number of observations. If sideInf is a factor the kappa index ($\kappa$) is used instead the RMSD. It is computed as follows:

κ = \frac{p_{o} - p_{e}}{1 - p_{e}}

where both $p_o$ and $p_e$ are two different agreement indexes between the the side information of the samples and the side information of their corrresponding nearest samples (i.e. most similar samples). While $p_o$ is the relative agreement $p_e$ is the the agreement expected by chance. Multi-threading for the computation of dissimilarities (see cores parameter) is based on OpenMP and hence works only on windows and linux.

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279. Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples

Run this code

require(prospectr)

data(NIRsoil)

Yr <- NIRsoil$Nt[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train),]

# Example 1
# Compute a principal components distance
pca.d <- orthoDiss(Xr = Xr, pcSelection = list("cumvar", 0.999), 
                   method = "pca", 
                   local = FALSE, 
                   center = TRUE, scaled = TRUE)

# The final number of pcs used for computing the distance 
# matrix of objects in Xr
pca.d$n.components

# The final distance matrix 
ds <- pca.d$dissimilarity

# Example 1.1
# Evaluate the distance matrix on the baisis of the 
# side information (Yr) associated with Xr
se <- simEval(d = ds, sideInf = Yr)

# The final evaluation results
se$eval

# The final values of the side information (Yr) and the values of 
# the side information corresponding to the first nearest neighbours 
# found by using the distance matrix
se$firstNN

# Example 1.2
# Evaluate the distance matrix on the baisis of two side 
# information (Yr and Yr2) 
# variables associated with Xr
Yr2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]
se2 <- simEval(d = ds, sideInf = cbind(Yr, Yr2))

# The final evaluation results
se2$eval

# The final values of the side information variables and the values 
# of the side information variables corresponding to the first 
# nearest neighbours found by using the distance matrix
se2$firstNN

###
# Example 2
# Evaluate the distances produced by retaining different number of 
# principal components (this is the same principle used in the 
# optimized principal components approach ("opc"))

# first project the data
pca <- orthoProjection(Xr = Xr, method = "pca", 
                       pcSelection = list("manual", 30), 
                       center = TRUE, scaled = TRUE)

# standardize the scores
scores.s <- sweep(pca$scores, MARGIN = 2, 
                  STATS = pca$sc.sdv, FUN = "/")
rslt <-  matrix(NA, ncol(scores.s), 3)
colnames(rslt) <- c("pcs", "rmsd", "r")
rslt[,1] <- 1:ncol(scores.s)
for(i in 1:ncol(scores.s))
{
  sc.ipcs <- scores.s[ ,1:i, drop = FALSE]
  di <- fDiss(Xr = sc.ipcs, method = "euclid", 
              center = FALSE, scaled = FALSE)
  se <- simEval(d = di, sideInf = Yr)
  rslt[i,2:3] <- unlist(se$eval)
}
plot(rslt) 

###
# Example 3
# Example 3.1
# Evaluate a dissimilarity matrix computed using a moving window 
# correlation method
mwcd <- mcorDiss(Xr = Xr, ws = 35, center = FALSE, scaled = FALSE)
se.mw <- simEval(d = mwcd, sideInf = Yr)
se.mw$eval

# Example 3.2
# Evaluate a dissimilarity matrix computed using the correlation 
# method
cd <- corDiss(Xr = Xr, center = FALSE, scaled = FALSE)
se.nc <- simEval(d = cd, sideInf = Yr)
se.nc$eval

Run the code above in your browser using DataLab