resemble (version 1.2.2)

orthoDiss: A function for computing dissimilarity matrices from orthogonal projections (orthoDiss)

Description

This function computes dissimilarities (in an orthogonal space) between either observations in a given set or between observations in two different sets. The dissimilarities are computed based on either principal component projection or partial least squares projection of the data. After projecting the data, the Mahalanobis distance is applied.

Usage

orthoDiss(Xr, X2 = NULL, Yr = NULL, pcSelection = list("cumvar", 0.99), method = "pca", local = FALSE, k0, center = TRUE, scaled = FALSE, return.all = FALSE, cores = 1, ...)

Arguments

Xr
a matrix (or data.frame) containing the (reference) data.
X2
an optional matrix (or data.frame) containing data of a second set of observations(samples).
Yr
either if the method used in the pcSelection argument is "opc" or if the sm argument is either "pls" or "loc.pls", then it must be a vector containing the side information corresponding to the spectra in Xr. It is equivalent to the sideInf parameter of the simEval function. It can be a numeric vector or matrix (regarding one or more continuous variables). The root mean square of differences (rmsd) is used for assessing the similarity between the samples and their corresponding most similar samples in terms of the side information provided. When sm = "pc", this parameter can also be a single discrete variable of class factor. In such a case the kappa index is used. See simEval function for more details.
pcSelection
a list which specifies the method to be used for identifying the number of principal components to be retained for computing the Mahalanobis distance of each sample in sm = "Xu" to the centre of sm = "Xr". It also specifies the number of components in any of the following cases: sm = "pc", sm = "loc.pc", sm = "pls" and sm = "loc.pls". This list must contain two objects in the following order:
  • method:the method for selecting the number of components. Possible options are: "opc" (optimized pc selection based on Ramirez-Lopez et al. (2013a, 2013b). See the orthoProjection function for more details; "cumvar" (for selecting the number of principal components based on a given cumulative amount of explained variance); "var" (for selecting the number of principal components based on a given amount of explained variance); and "manual" (for specifying manually the desired number of principal components)
  • value:a numerical value that complements the selected method. If "opc" is chosen, it must be a value indicating the maximal number of principal components to be tested (see Ramirez-Lopez et al., 2013a, 2013b). If "cumvar" is chosen, it must be a value (higher than 0 and lower than 1) indicating the maximum amount of cumulative variance that the retained components should explain. If "var" is chosen, it must be a value (higher than 0 and lower than 1) indicating that components that explain (individually) a variance lower than this threshold must be excluded. If "manual" is chosen, it must be a value specifying the desired number of principal components to retain.

The default method for the pcSelection argument is "opc" and the maximal number of principal components to be tested is set to 40. Optionally, the pcSelection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default for "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used the default "value" is set to 0.01.

method
the method for projecting the data. Options are: "pca" (principal component analysis using the singular value decomposition algorithm), "pca.nipals" (principal component analysis using the non-linear iterative partial least squares algorithm) and "pls" (partial least squares). See the orthoProjection function for further details on the projection methods.
local
a logical indicating whether or not to compute the distances locally (i.e. projecting locally the data) by using the $k0$ nearest neighbour samples of each sample. Default is FALSE. See details.
k0
if local = TRUE a numeric integer value which indicates the number of nearest neighbours($k0$) to retain in order to recompute the local orthogonal distances.
center
a logical indicating if the spectral data Xr (and X2 if specified) must be centered. If X2 is specified the data is centered on the basis of $Xr \cup Xu$. For dissimilarity computations based on pls, the data is always centered for the projections.
scaled
a logical indicating if Xr (and X2 if specified) must be scaled. If X2 is specified the data is scaled on the basis of $Xr \cup Xu$.
return.all
a logical. In case X2 is specified it indicates whether or not the distances between all the elements resulting from $Xr \cup Xu$ must be computed.
cores
number of cores used when method in pcSelection is "opc" (which can be computationally intensive) and local = FALSE (default = 1). Dee details.
...
additional arguments to be passed to the orthoProjection function.

Value

a list of class orthoDiss with the following components:
  • n.components the number of components (either principal components or partial least squares components) used for computing the global distances.
  • global.variance.info the information about the expalined variance(s) of the projection. When local = TRUE, the information corresponds to the global projection done prior computing the local projections.
  • loc.n.components if local = TRUE, a data.frame which specifies the number of local components (either principal components or partial least squares components) used for computing the dissimilarity between each target sample and its neighbour samples.
  • dissimilarity the computed dissimilarity matrix. If local = FALSE a distance matrix. If local = TRUE a matrix of class orthoDiss. In this case each column represent the dissimilarity between a target sample and its neighbourhood.
Multi-threading for the computation of dissimilarities (see cores parameter) is based on OpenMP and hence works only on windows and linux.

Details

When local = TRUE, first a global distance matrix is computed based on the parameters specified. Then, by using this matrix for each target observation, a given set of nearest neighbours ($k0$) are identified. These neighbours (together with the target observation) are projected (from the original data space) onto a (local) orthogonal space (using the same parameters specified in the function). In this projected space the Mahalanobis distance between the target sample and the neighbours is recomputed. A missing value is assigned to the samples that do not belong to this set of neighbours (non-neighbour samples). In this case the dissimilarity matrix cannot be considered as a distance metric since it does not necessarily satisfies the symmetry condition for distance matrices (i.e. given two samples $x_i$ and $x_j$, the local dissimilarity ($d$) between them is relative since generally $d(x_i, x_j) \neq d(x_j, x_i)$). On the other hand, when local = FALSE, the dissimilarity matrix obtained can be considered as a distance matrix.

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

orthoProjection, simEval

Examples

Run this code
## Not run: 
# require(prospectr)
# 
# data(NIRsoil)
# 
# Xu <- NIRsoil$spc[!as.logical(NIRsoil$train),]
# Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
# Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
# Xr <- NIRsoil$spc[as.logical(NIRsoil$train),]
# 
# Xu <- Xu[!is.na(Yu),]
# Yu <- Yu[!is.na(Yu)]
# 
# Xr <- Xr[!is.na(Yr),]
# Yr <- Yr[!is.na(Yr)] 
# 
# # Computation of the orthogonal dissimilarity matrix using the 
# # default parameters
# ex1 <- orthoDiss(Xr = Xr, X2 = Xu)
# 
# # Computation of a principal component dissimilarity matrix using 
# # the "opc" method for the selection of the principal components
# ex2 <- orthoDiss(Xr = Xr, X2 = Xu, 
#                  Yr = Yr, 
#                  pcSelection = list("opc", 40), 
#                  method = "pca", 
#                  return.all = TRUE)
# 
# # Computation of a partial least squares (PLS) dissimilarity 
# # matrix using the "opc" method for the selection of the PLS 
# # components
# ex3 <- orthoDiss(Xr = Xr, X2 = Xu, 
#                  Yr = Yr, 
#                  pcSelection = list("opc", 40), 
#                  method = "pls")
# 
# # Computation of a partial least squares (PLS) local dissimilarity 
# # matrix using the "opc" method for the selection of the PLS 
# # components
# ex4 <- orthoDiss(Xr = Xr, X2 = Xu, 
#                  Yr = Yr, 
#                  pcSelection = list("opc", 40), 
#                  method = "pls",
#                  local = TRUE,
#                  k0 = 200)
# ## End(Not run)

Run the code above in your browser using DataLab