Learn R Programming

cvq2 (version 1.1.0)

cvq2-package: Calculate the predictive squared correlation coefficient.

Description

This package calculates the predictive squared correlation coefficient, $q^2$, in comparison to the well known conventional squared correlation coefficient, $r^2$. For a given model M, $q^2$ indicates the prediction performance of M, whereas $r^2$ is a measure for its calibration performance.

Arguments

encoding

latin1

Details

ll{ Package: cvq2 Type: Package Version: 1.1.0 Date: 2013-03-13 Depends: stats License: GPL v3 LazyLoad: yes }The calculation procedure is as follows: The model M is described as a data set, where the parameters $x_1 \ldots x_n$ describe an observation y.First, a general linear regression is applied to M. Therewith, the conventional squared correlation coefficient, $r^2$, can be calculated: $$r^2 = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{fit} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}\right)^2} \equiv 1 - \frac{RSS}{SS}$$The denominator complies with the Residual Sum of Squares RSS, the difference between the fitted values $y_i^{fit}$ and the observed values $y_i$. The numerator is the Sum of Squares, SS, and refers to the difference between the observed values $y_i$ and their mean $y_{mean}$. To compare the calibration of M with its prediction power, M is applied to an external data set. The comparison of the predicted values $y_i^{pred}$ with the observed values $y_i$ leads to the predictive squared correlation coefficient, $q^2$: $$q^2 = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{pred} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}\right)^2} \equiv 1 - \frac{PRESS}{SS}$$ The PREdictive residual Sum of Squares PRESS is the difference between the predicted values $y_i^{pred}$ and the observed values $y_i$. The Sum of Squares SS refers to the difference between the observed values $y_i$ and their mean $y_{mean}$.To avoid any bias, $y_{mean}$ is the arithemtic mean of the $y_i$ from the external data set. Hence the clarifying $q^2_{tr}$ equation is slighlty different to the previous $q^2$ equation: $$q^2_{tr} = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{pred} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}^{training}\right)^2}$$ The arithmetic mean of the observed values in the external data set, $y_{mean}^{training}$, is used to determine the prediction performance, $q^2_{tr}$, of M. In case, that no external data set is available, one can perform a cross-validation to evaluate the prediction performance. The cross-validation splits the model data set ($N$ elements) into a training set ($N-k$ elements) and a test set ($k$ elements). Each training set yields to an individual model M', which is used to predict the missing $k$ value(s). Each model M' is slightly different to M. At least, any observed value is predicted once and the comparison between the observation and the prediction yields to $q^2_{cv}$: $$q^2_{cv} = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{pred(N-k)} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}^{N-k,i}\right)^2}$$ The arithmetic mean used in this equation, $y_{mean}^{N-k,i}$, is individually for any test set and calculated for the observed values comprised in the training set. If $k > 1$, the compilation of training and test set may have impact on the calculation of the predictive squared correlation coefficient. To overcome biasing, one can repeat this calculation with various compilations of training and test set. Thus, any observed value is predicted several times, according to the number of runs performed.Remark, if the prediction performance is evaluated with cross-validation, the calculation of the predictive squared correlation coefficient, $q^2$, is more accurate than the calculation of the conventional squared correlation coefficient, $r^2$.

References

  1. Cramer RD III. 1980. BC(DEF) Parameters. 2. An Empirical Structure-Based Scheme for the Prediction of Some Physical Properties.J. Am. Chem. Soc.102:1849-1859.
  2. Cramer RD III, Bunce JD, Patterson DE, Frank IE. 1988. Crossvalidation, Bootstrapping, and Partial Least Squares Compared with Multiple Linear Regression in Conventional QSAR Studies.Quant. Struct.-Act. Relat.1988:18-25.
  3. Organisation for Economic Co-operation and Development. 2007. Guidance document on the validation of (quantitative) structure-activity relationship [(Q)SAR] models.OECD Series on Testing and Assessment 69.OECD Document ENV/JM/MONO(2007)2, pp 55 (paragraph no. 198) and 65 (Table 5.7).
  4. Sch��rmann{Schuurmann} G, Ebert R-U, Chen J, Wang B,K�hne{Kuhne} R. 2008. External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean.J. Chem. Inf. Model.48:2140-2145.
  5. Tropsha A, Gramatica P, Gombar VK. 2003. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models.QSAR Comb. Sci.22:69-77.

Examples

Run this code
library(cvq2)
  data(cvq2.setA)
  result <- cvq2( cvq2.setA, y ~ x1 + x2 )
  result
  
  data(cvq2.setB)
  result <- cvq2( cvq2.setB, y ~ x, nFold = 3 )
  result
  
  data(cvq2.setB)
  result <- cvq2( cvq2.setB, y ~ x, nFold = 3, nRun = 5 )
  result
  
  data(cvq2.setA)
  data(cvq2.setA_pred)
  result <- q2( cvq2.setA, cvq2.setA_pred, y ~ x1 + x2 )
  result

Run the code above in your browser using DataLab