neigCleaning: A function for identifying samples that do not belong to any of the neighbourhoods of a given set of samples (neigCleaning)

Description

This function can be used to identify the samples in a spectral dataset $Xr$ that do not belong to the neighbourhood of any sample in another spectral dataset $Xu$.

Usage

neigCleaning(Xr, Xu,  sm = "pc", pcSelection = list("cumvar", 0.99), pcMethod = "svd", Yr = NULL, ws, k0, center = TRUE, scaled = TRUE, k.thr,  k.dist.thr, k.range, returnDiss = FALSE,  cores = 1)

Arguments

input (spectral) matrix (or data.frame) in which the neighbours of the samples in Xu shall be searched.

input (spectral) matrix (or data.frame) containing the samples for which their neighbours will be searched in Xr.

a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbours of each observation for which a prediction is required (see mbl). Options are:

"euclid": Euclidean dissimilarity.
"cosine": Cosine dissimilarity.
"sidF": Spectral information divergence computed on the spectral variables.
"sidD": Spectral information divergence computed on the density distributions of the spectra.
"cor": Correlation dissimilarity.
"movcor": Moving window correlation dissimilarity.
"pc": Principal components dissimilarity: Mahalanobis dissimilarity computed on the principal components space.
"loc.pc": Dissimilarity estimation based on local principal components.
"pls": Partial least squares dissimilarity: Mahalanobis dissimilarity computed on the partial least squares space.
"loc.pls" Dissimilarity estimation based on local partial least squares.

The "pc" spectral dissimilarity metric is the default. If the "sidD" is chosen, the default parameters of the sid function are used however they cab be modified by specifying them as additional arguments in the mbl function. This argument can also be set to NULL, in such a case, a dissimilarity matrix must be specified in the dissimilarityM argument of the mbl function.

pcSelection

if sm = "pc", sm = "loc.pc", sm = "pls" or sm = "loc.pls" a list which specifies the method to be used for identifying the number of principal components to be retained for computing the Mahalanobis distance of each sample in sm = "Xu" to the centre of sm = "Xr". It also specifies the number of components in any of the following cases: sm = "pc", sm = "loc.pc", sm = "pls" and sm = "loc.pls". This list must contain two objects in the following order:

method:the method for selecting the number of components. Possible options are: "opc" (optimized pc selection based on Ramirez-Lopez et al. (2013a, 2013b). See the orthoProjection function for more details; "cumvar" (for selecting the number of principal components based on a given cumulative amount of explained variance); "var" (for selecting the number of principal components based on a given amount of explained variance); and "manual" (for specifying manually the desired number of principal components)
value:a numerical value that complements the selected method. If "opc" is chosen, it must be a value indicating the maximal number of principal components to be tested (see Ramirez-Lopez et al., 2013a, 2013b). If "cumvar" is chosen, it must be a value (higher than 0 and lower than 1) indicating the maximum amount of cumulative variance that the retained components should explain. If "var" is chosen, it must be a value (higher than 0 and lower than 1) indicating that components that explain (individually) a variance lower than this treshold must be excluded. If "manual" is chosen, it must be a value specifying the desired number of principal components to retain.

The default method for the pcSelection argument is "opc" and the maximal number of principal components to be tested is set to 40. Optionally, the pcSelection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default for "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used the default "value" is set to 0.01.

pcMethod

a character string indicating the principal component analysis algorithm to be used. Options are: "svd" (default) and "nipals". See orthoDiss.

either if the method used in the pcSelection argument is "opc" or if the sm argument is either "pls" or "loc.pls", then it must be a vector containing the side information corresponding to the spectra in Xr. It is equivalent to the sideInf parameter of the simEval function. It can be a numeric vector or matrix (regarding one or more continuous variables). The root mean square of differences (rmsd) is used for assessing the similarity between the samples and their corresponding most similar samples in terms of the side information provided. When sm = "pc", this parameter can also be a single discrete variable of class factor. In such a case the kappa index is used. See simEval function for more details.

an odd integer value which specifies the window size when the moving window correlation similarity/dissimilarity is used (i.e sm = "movcor"). The default value is 41.

if any of the local similarity/dissimilarity methods is used (i.e. either sm = "loc.pc" or sm = "loc.pls") a numeric integer value. This argument controls the number of initial neighbours($k0$) to retain in order to compute the local principal components (at each neighbourhood).

center

a logical indicating if Xr and Xu must be centered (on the basis of $Xr \cup Xu$).

scaled

a logical indicating if Xr and Xu must be scaled (on the basis of $Xr \cup Xu$).

k.thr

an integer value indicating the k-nearest neighbours of each sample in Xu that must be selected from Xr.

k.dist.thr

an integer value indicating a distance treshold. When the distance between a sample in Xr and a sample in Xu is below the given treshold, the sample in sample in Xr is retained, otherwise it is ignored. The treshold depends on the corresponding similarity/dissimilarity metric specified in sm. Either k.thr or k.dist.thr must be specified.

k.range

a vector of length 2 which specifies the minimum (first value) and the maximum (second value) number of neighbours allowed when the k.dist.thr argument is used.

returnDiss

a logical indicating if the similarity/dissimilarity matrix must be returned. Default is FALSE.

cores

number of cores used when method in pcSelection is "opc" (which can be computationally intensive) (default = 1).

Value

neigCleaning returns a list containing the following objects:

select the indices of the observations in Xr that belong to the negihborhood of the samples in Xu.
reject the indices of the observations in Xr that do not belong to the negihborhood of the samples in Xu.
rn.lower.k.dist a data.frame that is returned only if the k.dist.thr argument was used. It comprises three columns, the first one (sampleIndex) indicates the index of the samples in Xu, the second column (nk) indicates the number of neighbours found in Xr for each sample in Xr and the third column (neighbours.used) indicates whether the original number of neighbours (below the distance treshold) was used or if the number of neighbours was reset to one of the range values specified in the k.range argument.
dissimilarity the distance matrix used.

Details

This function may be specially useful when the reference set (Xr) is very large. In some cases the number of observations in the reference set can be reduced by removing irrelevant samples (i.e. samples that are not neighbours of a particular target set). If Xr is very large, it is recommended to consider the use this function prior using the mbl function.

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples

Run this code

## Not run: 
# require(prospectr)
# 
# data(NIRsoil)
# 
# Xu <- NIRsoil$spc[!as.logical(NIRsoil$train),]
# Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
# Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
# Xr <- NIRsoil$spc[as.logical(NIRsoil$train),]
# 
# Xu <- Xu[!is.na(Yu),]
# Yu <- Yu[!is.na(Yu)]
# 
# Xr <- Xr[!is.na(Yr),]
# Yr <- Yr[!is.na(Yr)] 
# 
# # Identify the non-neighbour samples using the default parameters
# # (In this example all the samples in Xr belong at least to the 
# # first 100 neighbours of one sample in Xu)
# ex1 <- neigCleaning(Xr = Xr, Xu = Xu, 
#                             k.thr = 100)
# 
# # Identify the non-neighbour samples using principal component(PC) 
# # and partial least squares (PLS) distances, and using the "opc" 
# # approach for selecting the number of components
# ex2 <- neigCleaning(Xr = Xr, Xu = Xu, 
#                             Yr = Yr,
#                             sm = "pc",
#                             pcSelection = list("opc", 40),
#                             k.thr = 150)
# 
# ex3 <- neigCleaning(Xr = Xr, Xu = Xu, 
#                             Yr = Yr,
#                             sm = "pls",
#                             pcSelection = list("opc", 40),
#                             k.thr = 150)
# 
# # Identify the non-neighbour samples using distances computed 
# # based on local PC analysis and using the "cumvar" and "var" 
# # approaches for selecting the number of PCs
# ex4 <- neigCleaning(Xr = Xr, Xu = Xu, 
#                             sm = "loc.pc",
#                             pcSelection = list("cumvar", 0.999),
#                             k0 = 200,
#                             k.thr = 150)
# 
# ex5 <- neigCleaning(Xr = Xr, Xu = Xu, 
#                             sm = "loc.pc",
#                             pcSelection = list("var", 0.001),
#                             k0 = 200,
#                             k.thr = 150)
# ## End(Not run)

Run the code above in your browser using DataLab