mbl: A function for memory-based learning (mbl)

Description

This function is implemented for memory-based learning (a.k.a. instance-based learning or local regression) which is a non-linear lazy learning approach for predicting a given response variable from a set of (spectral) predictor variables. For each sample in an prediction set a specific local regression is carried out based on a subset of similar samples (nearest neighbours) selected from a reference set. The local model is then used to predict the response value of the target (prediction) sample. Therefore this function does not yield a global regression model.

Usage

mbl(Yr, Xr, Yu = NULL, Xu,
    mblCtrl = mblControl(), 
    dissimilarityM,
    group = NULL,
    dissUsage = "predictors", 
    k, k.diss, k.range,
    method, 
    pls.c, pls.max.iter = 1, pls.tol = 1e-6,
    noise.v = 0.001,
    ...)

Arguments

a numeric vector containing the values of the response variable corresponding to the reference data

input matrix (or data.frame) of predictor variables of the reference data (observations in rows and variables in columns).

an optional numeric vector containing the values of the response variable corresponding to the data to be predicted

input matrix (or data.frame) of predictor variables of the data to be predicted (observations in rows and variables in columns).

mblCtrl

a list (created with the mblControl function) which contains some parameters that control the some aspects of the mbl function. See the mblControl

dissimilarityM

(optional) a dissimilarity matrix. This argument can be used in case a user-defined dissimilarity matrix is preferred over the automatic dissimilarity matrix computation specified in the sm argument of the

group

an optional factor (or vector that can be coerced to a factor by as.factor) to be taken into account for internal validations. The length of the vector must be equal to nrow(Xr)

dissUsage

specifies how the dissimilarity information shall be used. The possible options are: "predictors", "weights" and "none" (see details below). Default is "predictors".

a numeric (integer) vector containing the sequence of k nearest neighbours to be tested. Either k or k.diss must be specified. Numbers with decimal values will be coerced to their next higher integer values. This

k.diss

a vector containing the sequence of dissimilarity thresholds to be tested. When the dissimilarity between a sample in Xr and a sample Xu is below the given threshold, the sample in sample in Xr is treate

k.range

a vector of length 2 which specifies the minimum (first value) and the maximum (second value) number of neighbours allowed when the k.diss argument is used.

method

a character string indicating the method to be used at each local multivariate regression. Options are: "pls", "wapls1" and "gpr" (see details below). Note: "wapls2" from the previos version of the packa

pls.c

the number of pls components to be used in the regressions if either "pls" or "wapls1" is used. When "pls" is used, this argument must be a single numerical value. When "wapls1" is used, this argument m

pls.max.iter

maximum number of iterations for the partial least squares methods.

pls.tol

limit for convergence in the partial orthogonal scores partial least squares regressions using the nipals algorithm. Default is 1e-6.

noise.v

a value indicating the variance of the noise for Gaussian process regression. Default is 0.001.

...

additional arguments to be passed to other functions.

Value

a list of class mbl with the following components (sorted by either k or k.diss according to the case):
- call:
{ the call used.}
cntrlParam:the list with the control parameters used. If one or more control parameters were reset automatically, then a list containing a list with the initial control parameters specified and a list with the parameters which were finally used.
dissimilarities:a list with the method used to obtain the dissimilarity matrices and the dissimilarity matrices corresponding to $D(Xr, Xu)$ and $D(Xr,Xr)$ if dissUsage = "predictors". This object is returned only if the returnDiss argument in the mblCtrl list was set to TRUE in the the call used.
totalSamplesPredictedthe total number of samples predicted.
pcAnalysis:a list containing the results of the principal component analysis. The first two objects (scores_Xr and scores_Xu) are the scores of the Xr and Xu matrices. It also contains the number of principal components used (n.componentsUsed) and another object which is a vector containing the standardized Mahalanobis dissimilarities (also called GH, Global H distance) between each sample in Xu and the centre of Xr.
components:a list containing either the number of principal components or partial least squares components used for the computation of the orthogonal dissimilarities. This object is only returned if the dissimilarity meausre specified in mblCtrl is any of the following options: 'pc', 'loc.pc', "pls", 'loc.pls'. If any of the local orthogonal dissimilarities was used ('loc.pc' or "pls") a data.frame is also returned in his list. This object is equivalent to the loc.n.components object returned by the orthoDiss function. It specifies the number of local components (either principal components or partial least squares components) used for computing the dissimilarity between each query sample and its neighbour samples, as returned by the orthoDiss function.
nnValStats:a data frame containing the statistics of the nearest neighbour cross-validation for each either k or k.diss depending on the arguments specified in the call. It is returned only if 'NNv' or 'both' were selected as validation method
localCrossValStats:a data frame containing the statistics of the local leave-group-out cross validation for each either k or k.diss depending on the arguments specified in the call. It is returned only if 'local_crossval' or 'both' were selected as validation method
YuPredictionStats:a data frame containing the statistics of the cross-validation of the prediction of Yu for each either k or k.diss depending on the arguments specified in the call. It is returned only if Yu was provided.
results:a list of data frames which contains the results of the predictions for each either k or k.diss. Each data.frame contains the following columns:
- o.index:
{ The index of the sample predicted in the input matrix}
k.diss:This column is only ouput if the k.diss argument is used. It indicates the corresponding dissimilarity threshold for selecting the neighbors used to predict a given sample.
distance:This column is only ouput if the k.diss argument is used. It is a logical that indicates whether the neighbors selected by the given dissimilarity threshold were outside the boundaries specified in the k.range argument. In that case the number of neighbors used is coerced to on of the boundaries.
k.org:This column is only ouput if the k.diss argument is used. It indicates the number of neighbors that are retained when the given dissimilarity threshold is used.
pls.comp:This column is only ouput if pls regression was used. It indicates the final number of pls components used. If no optimization was set, it retrieves the original pls components specified in the pls.c argument.
min.pls:This column is only ouput if wapls1 regression was used. It indicates the final number of minimum pls components used. If no optimization was set, it retrieves the original minimum pls components specified in the pls.c argument.
max.pls:This column is only ouput if wapls1 regression was used. It indicates the final number of maximum pls components used. If no optimization was set, it retrieves the original maximum pls components specified in the pls.c argument.
yu.obs:This column is only ouput if the Yu argument is used. It indicates the input values given in Yu (the response variable corresponding to the data to be predicted).
pred:The predicted values
yr.min.obs:The minimum reference value (of the response variable) in the neighborhood.
yr.max.obs:The maximum reference value (of the response variable) in the neighborhood.
index.nearest.in.refThe index in Xr of the nearest neighbor.
y.nearest:The reference value (of the response variable) of the nearest neighbor in Xr.
y.nearest.pred:This column is only ouput if the validation method (selected with the mblControl function) is equal to 'NNv'. It represents the predicted value of the nearest neighbor sample in Xr using the neighborhood of the predicted sample in Xu.
loc.rmse.cv:This column is only ouput if the validation method (selected with the mblControl function) is equal to 'loc_crossval'. It represents the cross validation RMSE value computed in for the neighborhood of the sample of the predicted sample in Xu.
loc.st.rmse.cv:This column is only ouput if the validation method (selected with the mblControl function) is equal to 'loc_crossval'. It represents the cross validation standardized RMSE value computed in for the neighborhood of the sample of the predicted sample in Xu.
dist.nearest:The distance to the nearest neighbor.
dist.k.farthest:The distance to the farthest neighbor selected.

code

mbl

Details

By using the group argument one can specify observations (spectra) groups of samples that have something in common e.g. spectra collected from the same batch of measurements, from the same sample, from samples with very similar origin, etc) which could produce biased cross-validation results due to pseudo-replication. This argument allows to select calibration points that are independent from the validation ones in cross-validation. In this regard, when valMethod = "loc_crossval" (used in mblControl function), then the p argument refer to the percentage of groups of samples (rather than single samples) to be retained in each resampling iteration at each local segment. The dissUsage argument is used to specifiy whether the dissimilarity information must be used within the local regressions and (if so), how. When dissUsage = "predictors" the local (square symmetric) dissimilarity matrix corresponding the selected neighbourhood is used as source of additional predictors (i.e the columns of this local matrix are treated as predictor variables). In some cases this may result in an improvement of the prediction performance (Ramirez-Lopez et al., 2013a). If dissUsage = "weights", the neighbours of the query point ($xu_{j}$) are weighted according to their dissimilarity (e.g. distance) to $xu_{j}$ prior carrying out each local regression. The following tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for computing the final weights based on the measured dissimilarities: $$W_{j} = (1 - v^{3})^{3}$$ where if $xr_{i} \in$ neighbours of $xu_{j}$: $$v_{j}(xu_{j}) = d(xr_{i}, xu_{j})$$ otherwise: $$v_{j}(xu_{j}) = 0$$ In the above formulas $d(xr_{i}, xu_{j})$ represents the dissimilarity between the query point and each object in $Xr$. When dissUsage = "none" is chosen the dissimilarity information is not used. The possible options for performing regressions at each local segment implemented in the mbl function are described as follows:

Partial least squares ("pls"):

{ It uses the orthogonal scores (non-linear iterative partial least squares, nipals) algorithm. The only parameter which needs to be optimized is the number of pls components. This can be done by cross-validation at each local segment.} Weighted average pls ("wapls1"):{ It uses multiple models generated by multiple pls components (i.e. between a minimum and a maximum number of pls components). At each local partition the final predicted value is a weighted average of all the predicted values generated by the multiple pls models. The weight for each component is calculated as follows: $$w_{j} = \frac{1}{s_{1:j}\times g_{j}}$$ where $s_{1:j}$ is the root mean square of the spectral residuals of the unknown (or target) sample when a total of $j$ pls components are used and $g_{j}$ is the root mean square of the regression coefficients corresponding to the $j$th pls component (see Shenk et al., 1997 for more details). "wapls1" is not compatible with valMethod = "loc_crossval" since the weights are computed based on the sample to be predicted at each local iteration.} Gaussian process with dot product covariance ("gpr"):{ Gaussian process regression is a probabilistic and non-parametric Bayesian approach. It is commonly described as a collection of random variables which have a joint Gaussian distribution and it is characterized by both a mean and a covariance function (Williams and Rasmussen, 1996). The covariance function used in the implemented method is the dot product, which inplies that there are no parameters to be optimized for the computation of the covariance. Here, the process for predicting the response variable of a new sample ($y_{new}$) from its predictor variables ($x_{new}$) is carried out first by computing a prediction vector ($A$). It is derived from a set of reference spectra ($X$) and their respective response vector ($Y$) as follows: $$A = (X X^\textup{T} + \sigma^2 I)^{-1} Y$$ where $\sigma^2$ denotes the variance of the noise and $I$ the identity matrix (with dimensions equal to the number of observations in $X$). The prediction of $y_{new}$ is then carried out by: $$y_{new} = (x_{new}x_{new}^\textup{T}) A$$ }

References

Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596-610. Fernandez Pierna, J.A., Dardenne, P. 2008. Soil parameter quantification by NIRS as a Chemometric challenge at "Chimiomitrie 2006". Chemometrics and Intelligent Laboratory Systems 91, 94-98 Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62, 664-673. Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279. Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53. Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006. Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.

Examples

Run this code

require(prospectr)

data(NIRsoil)

# Filter the data using the Savitzky and Golay smoothing filter with 
# a window size of 11 spectral variables and a polynomial order of 3 
# (no differentiation).
sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0) 

# Replace the original spectra with the filtered ones
NIRsoil$spc <- sg

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train),]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]

Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train),]

Xu <- Xu[!is.na(Yu),]
Xr <- Xr[!is.na(Yr),]

Yu <- Yu[!is.na(Yu)]
Yr <- Yr[!is.na(Yr)]

# Example 1
# A mbl implemented in Ramirez-Lopez et al. (2013, 
# the spectrum-based learner)
# Example 1.1
# An exmaple where Yu is supposed to be unknown, but the Xu 
# (spectral variables) are known 
ctrl1 <- mblControl(sm = "pc", pcSelection = list("opc", 40), 
                    valMethod = "NNv", 
                    scaled = FALSE, center = TRUE)

sbl.u <- mbl(Yr = Yr, Xr = Xr, Yu = NULL, Xu = Xu,
             mblCtrl = ctrl1, 
             dissUsage = "predictors", 
             k = seq(40, 150, by = 10), 
             method = "gpr")
sbl.u
plot(sbl.u)


 
# Example 1.2
# If Yu is actually known... 
sbl.u2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
              mblCtrl = ctrl1, 
              dissUsage = "predictors", 
              k = seq(40, 150, by = 10), 
              method = "gpr")
sbl.u2

# Example 1.3
# A variation of the spectrum-based learner implemented in 
# Ramirez-Lopez et al. (2013) where the dissimilarity matrices are 
# recomputed based on partial least squares scores
ctrl_1.3 <- mblControl(sm = "pls", pcSelection = list("opc", 40), 
                       valMethod = "NNv", 
                       scaled = FALSE, center = TRUE)
                          
sbl_1.3 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
               mblCtrl = ctrl_1.3,
               dissUsage = "predictors",
               k = seq(40, 150, by = 10), 
               method = "gpr")
sbl_1.3

# Example 2
# A mbl similar to the ones implemented in 
# Ramirez-Lopez et al. (2013) 
# and Fernandez Pierna and Dardenne (2008)
ctrl.mbl <- mblControl(sm = "cor", 
                       pcSelection = list("cumvar", 0.999), 
                       valMethod = "NNv", 
                       scaled = FALSE, center = TRUE)
                          
local.mbl <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                 mblCtrl = ctrl.mbl,
                 dissUsage = "none",
                 k = seq(40, 150, by = 10), 
                 pls.c = c(5, 15),
                 method = "wapls1")
local.mbl

# Example 3
# A variation of the previous example (using the optimized pc 
# dissmilarity matrix) using the control list of the example 1
                         
local.mbl2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                  mblCtrl = ctrl1,
                  dissUsage = "none",
                  k = seq(40, 150, by = 10), 
                  pls.c = c(5, 15),
                  method = "wapls1")
local.mbl2

# Example 4
# Using the function with user-defined dissimilarities:
# Examples 4.1 - 4.2: Compute a square symetric matrix of 
# dissimilarities between 
# all the elements in Xr and Xu (dissimilarities will be used as 
# additional predictor variables later in the mbl function)

# Examples 4.3 - 4.4: Derive a dissimilarity value of each element 
# in Xu to each element in Xr (in this case dissimilarities will 
# not be used as additional predictor variables later in the 
# mbl function)

# Example 4.1
# the manhattan distance 
manhattanD <- dist(rbind(Xr, Xu), method = "manhattan") 
manhattanD <- as.matrix(manhattanD)

ctrl.udd <- mblControl(sm = "none", 
                       pcSelection = list("cumvar", 0.999), 
                       valMethod = c("NNv", "loc_crossval"), 
                       resampling = 10, p = 0.75,
                       scaled = FALSE, center = TRUE)

mbl.udd1 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                mblCtrl = ctrl.udd, 
                dissimilarityM = manhattanD,
                dissUsage = "predictors",
                k = seq(40, 150, by = 10), 
                method = "gpr")
mbl.udd1

#Example 4.2
# first derivative spectra
Xr.der.sp <- t(diff(t(rbind(Xr, Xu)), lag = 7, differences = 1)) 
Xu.der.sp <- t(diff(t(Xu), lag = 7, differences = 1)) 

# The principal components dissimilarity on the derivative spectra 
der.ortho <- orthoDiss(Xr = Xr.der.sp, X2 = Xu.der.sp,
                       Yr = Yr,
                       pcSelection = list("opc", 40),
                       method = "pls",
                       center = FALSE, scale = FALSE) 

der.ortho.diss <- der.ortho$dissimilarity

# mbl applied to the absorbance spectra
mbl.udd2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                mblCtrl = ctrl.udd, 
                dissimilarityM = der.ortho.diss,
                dissUsage = "none",
                k = seq(40, 150, by = 10), 
                method = "gpr")
                                
#Example 4.3
# first derivative spectra
der.Xr <- t(diff(t(Xr), lag = 1, differences = 1)) 
der.Xu <- t(diff(t(Xu), lag = 1, differences = 1))
# the sid on the derivative spectra
der.sid <- sid(Xr = der.Xr, X2 = der.Xu, mode = "density", 
               center = TRUE, scaled = FALSE) 
der.sid <- der.sid$sid

mbl.udd3 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                mblCtrl = ctrl.udd, 
                dissimilarityM = der.sid,
                dissUsage = "none",
                k = seq(40, 150, by = 10), 
                method = "gpr")
mbl.udd3

# Example 5
# For running the mbl function in parallel
n.cores <- detectCores() - 1
if(n.cores == 0) n.cores <- 1

# Set the number of cores according to the OS
if (.Platform$OS.type == "windows") {
  require(doParallel)
  clust <- makeCluster(n.cores)   
  registerDoParallel(clust)
}else{
  require(doSNOW)
  clust <- makeCluster(n.cores, type = "SOCK")
  registerDoSNOW(clust)
  ncores <- getDoParWorkers()
}

ctrl <- mblControl(sm = "pc", pcSelection = list("opc", 40), 
                   valMethod = "NNv",
                   scaled = FALSE, center = TRUE)

mbl.p <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
             mblCtrl = ctrl, 
             dissUsage = "none",
             k = seq(40, 150, by = 10), 
             method = "gpr")
registerDoSEQ()
try(stopCluster(clust))
mbl.p

Run the code above in your browser using DataLab