regr: Regression functions for soil spectral analysis

Description

The next update of this package will fix some problems observed for this method which gives different regression functions (partial least-squares, boosted regression trees, support vector machines) can be chosen for calibrations of one or more constituents. Function settings are optimized for soil spectral analysis, but can be varied. Possible spectral transformations are described in the trans function.

Arguments

a numerical data.frame or matrix containing the raw spectra in regr. An object of class "regr" in plot.regr.

a numerical data.frame or matrix containing the constituents.

a character vector giving the path where to save the function output. If "NL" (default), the current working directory is taken. As well used in pred.regr.

a character giving the desired spectral transformation. Available are "raw" (raw spectra), "derivative" (derivative spectra), "continuum removed" (continuum removed spectra) and "wt" (wavelet coefficien

a character giving the regression method. Available are "pls" (partial least-squares), "brt" (boosted regression trees) and "svm" (support vector machines).

a logical indicating whether validation samples should be chosen as a percentage from x (given in pn). If "FALSE" object n is taken.

a numeric between 0 and 1 giving the percentage of validation samples to choose.

an integer giving the number of validation samples when p is "FALSE".

a character naming the model output.

drv

an integer between 0 and 3 giving the order of derivative. The value 0 performs smoothing based on bandwidth.

bandwidth

an integer between 1 and 30 defining the smoothing interval in wavebands.

val

a character defining the type of cross-validation procedure when r is equal to "pls". Available are "none" (no cross-validation procedure), "CV" (cross-validation in 10 segments) and "LOO" (

a character defining the wavelet ft in dwt function from wavelets package.

a character defining the lv of wavelet coefficients extraction (1 to 10 possible; 1 yields 512 coefficients, 2 yields 256 coefficients...).

dis

a character giving in the distribution in the gbm.fit function.

an integer giving the total number of trees to fit in the gbm.fit function.

a character giving in the s parameter in the gbm.fit function.

a character giving in the k used in the svm function.

...

additional arguments.

object

an object of class "regr".

new

a numerical data frame or matrix containing the new spectra.

model

an object of class "regr".

output.name

a character naming the prediction output csv-file in pred.regr.

Value

regr returns a list with class "regr" containing the following components excluding the last four ones. pred.regr returns a list with class "pred.regr" containing the last four components output.name, predicted.values, method, and spectral.transformation (see below):
modela character naming how the model output was named.
modela list containing the regression output of class "mvr", "gbm" or "svm".
x.tra matrix containing the transformed spectra.
spectral.transformationa character naming the spectral transformation.
constituentsa character naming the constituents.
constituents.transformationa character naming the constituent transformations. Needed for pred.regr.
lambdaa numeric giving the lambda values in case the box-cox transformation was chosen as constituents transformation. Needed for pred.regr.
methoda character naming the used regression method.
cal.samplesa list containing the row names of the calibration samples for each soil constituent.
val.samplesa list containing the row names of the validation samples for each soil constituent.
cal.statisticsa matrix containing the calibration statistics for all constituents. See details.
cal.mea.prea data frame containing the calibration set measured and predicted values for all constituents.
val.statisticsa matrix containing the validation statistics for all constituents. See details.
val.mea.prea data frame containing the validation set measured and predicted values for all constituents.
cal.pcaa list containing objects of the class "prcomp" for each constituent calibration set. Needed for pred.regr.
mahalanobisa list containing numeric vectors having the spectral mahalanobis distance of the constituents calibration sets. Needed for pred.regr.
cal.rangea list containing numeric vectors having the ranges of the constituents calibration sets. Needed for pred.regr.
rmsepa list containing numeric vectors having for each constituent the root median square error of prediction for each validation set sample. See details for further explanation. Needed for pred.regr.
lma list containing numeric vectors having for each constituent validation set the fitted values calculated by linear regression of measured against predicted values. Needed for pred.regr.
wavebandsa numeric vector containing the wavebands of x. Needed for pred.regr.
drvan integer giving the order of derivative. Needed for pred.regr.
bandwidthan integer defining the smoothing interval in wavebands. Needed for pred.regr.
fta character defining the wavelet ft. Needed for pred.regr.
lva character defining the lv of wavelet coefficients extraction. Needed for pred.regr.
output.namea character string giving the name of the saved csv-file from pred.regr.
predicted.valuesa matrix containing the predicted values and its respective confidence interval limits.
methoda character naming the used regression method.
spectral.transformationa character naming the used spectral transformation method.

Note

Please note that the usage section was removed due too mant model parameters which exceeds the limit of characters in a single line. The explaination for the concept adopted in the model is included here to provide readers with information about capability of this function. We are committed in revising this usage section to conform with the CRAN system. The code for this has been tested and works and those who are intrested can contact the author for the code file which can be called locally the same way local functions are loaded in R system

Details

Missing values in y are allowed.

regr uses the mvr function in the pls package for partial least-squares regression, the gbm.fit function in the gbm package for boosted regression trees and the svm function in the e1071 package for support vector machines regression. The number of important PLS latent variables and the svm parameter optimization is done automatically based on experience with soil spectra.

sp uses for spectral transformation (i) the locpoly function in KernSmooth package for derivative calculation, (ii) the chull and approx functions in "KernSmooth" package for continuum removal and (iii) the dwt function in wavelets package for extraction of wavelet coefficients. Experiences showed for wavelet decomposition that the best ratio of prediction performance and sparse spectral representation is reached when all 128 wavelet coefficients from decomposition lv three are taken (which is the default).

Settings in the used functions for regression and transformation are chosen based on experience with soil spectra calibrations. It is recommended to take the given default values. Nevertheless, the settings can be adapted to a certain degree. In case you want to use complete functionality use the named functions directly. If r is "brt", the number of samples has to be more than 70.

Column names of x and new must contain the wavebands. Wavebands are made automatically compatible if needed (see details in read.spc)..

Constituent values are not always normally distributed. This can violate prerequisitives for regression methods. Thus, transformation prior regression can solve this problem. The regr function uses log, square root and box-cox transformation aside untransformed values and let the user decide graphically which transformation to take for each constituent.

Predictions from pred.regr are given back with the prediction uncertainty for each individual sample (based on the validation set prediction error). The prediction uncertainty is calculated as the root median square error of prediction (RMedianSEP) using a moving window of in maximum 50 samples with similar predicted values. From the RMedianSEP the confidence interval is calculated.

Predictions are only made if (i) the new spectrum lies within the mahalanobis space of the calibration set, (ii) there is a local neighbor within of 5 and (iii) the predicted value lies within the calibration set range. Otherwise they are set to NA values. Mahalanobis distance can only be calculated when the number of calibration samples is higher than the number of wavebands/variables.

Calibration statistics contains for each constituents (i) n the number of samples used in calibration, (ii) r2 the coefficient of determination for the linear regression of measured against predicted values, (iii) a the slope of the regression line, (iv) bias the bias, (v) RMSEC the root means square error of calibration, (vi) RPD the ratio of constituent standard deviation to RMSEC, (vii) n LV the number of latent variables used when r is equal to "pls", (viii) n bc out the number of backtransformed values being NA values after box-cox transformation and (ix) n trees the number of trees when r is equal to "brt". Validation statistics contains for each constituents points (i) to (vi). The RMSEC is logically the RMSEP.

The calibration and validation regressions of all constituents are plotted and the statistics printed in the Console.

Nearly each run of regr yields following warning message: 1: in optimize(f = function(lambda).... Its related with the box-cox transformation, but does not have any impact or negative side effects.