cvq2-package: Calculate the predictive squared correlation coefficient.

Description

This package calculates the predictive squared correlation coefficient, $q^2$ in comparison to the well known conventional squared correlation coefficient, $r^2$. For a given model M, $q^2$ indicates the prediction performance of M, whereas $r^2$ is a measure for its calibration performance.

Arguments

encoding

latin1

Details

ll{ Package: cvq2 Type: Package Version: 1.0.2 Date: 2012-12-03 Depends: stats License: GPL v3 LazyLoad: yes }The calculation procedure is as follows: For a data set, a general linear regression is performed to calculate the conventional squared correlation coefficient, $r^2$: $$r^2 = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{fit} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}\right)^2} \equiv 1 - \frac{RSS}{SS}$$ The observed values $y_i$ are compared to the fitted values $y_i^{fit}$ calculated with the linear regression and yield to the calibration performance, $r^2$, of the described model. The denominator complies with the Residual Sum of Squares RSS, the difference between the fitted values $y_i^{fit}$ and the observed values $y_i$. The numerator is the Sum of Squares, which are often called SS in statistics, and refers to the difference between the observed values $y_i$ and their mean $y_{mean}$. To compare the calibration of the model with its prediction power, the model is applied to an external data set. The comparison of the predicted values $y_i^{pred}$ with the observed values $y_i$ leads to the predictive squared correlation coefficient, $q^2$: $$q^2 = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{pred} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}\right)^2} \equiv 1 - \frac{PRESS}{SS}$$ The PREdictive residual Sum of Squares PRESS is the difference between the predicted values $y_i^{pred}$ and the observed values $y_i$. The Sum of Squares SS refers to the difference between the observed values $y_i$ and their mean $y_{mean}$.To avoid any bias, $y_{mean}$ is the arithemtic mean of the $y_i$ from the external data set. Hence the clarifying $q^2_{tr}$ equation is slighlty different to the previous $q^2$ equation: $$q^2_{tr} = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{pred} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}^{training}\right)^2}$$ The arithmetic mean of the observed values in the external data set, $y_{mean}^{training}$, is used to determine the prediction performance $q^2_{tr}$ of this model. Furthermore, if no external data set is available, one can perform a cross validation to evaluate the prediction performance. The cross validation splits the data set ($N$ elements) into a training set ($N-k$ elements) and a test set ($k$ elements). Each training set yields to a model, which is used to predict the missing $k$ value(s). At least, any observed value is predicted once and the comparison between the observation and the prediction yields to $q^2_{cv}$: $$q^2_{cv} = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{pred(N-k)} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}^{N-k,i}\right)^2}$$ The arithmetic mean used in this equation, $y_{mean}^{N-k,i}$, is individually for any test set and calculated from the observed values comprised in the training set. If $k > 1$, the compilation of training and test set may have impact on the calculation of the predictive squared correlation coefficient. To avoid biasing, one can repeat this calculation with various compilations of training and test set. Thus, any observed value is predicted multiple times, according to the number of runs performed.Remark, in case of a cross validation, the calculation of the predictive squared correlation coefficient, $q^2$, is more accurate than the calculation of the conventional squared correlation coefficient, $r^2$.

References

Cramer RD III. 1980. BC(DEF) Parameters. 2. An Empirical Structure-Based Scheme for the Prediction of Some Physical Properties.J. Am. Chem. Soc.102:1849-1859.
Cramer RD III, Bunce JD, Patterson DE, Frank IE. 1988. Crossvalidation, Bootstrapping, and Partial Least Squares Compared with Multiple Linear Regression in Conventional QSAR Studies.Quant. Struct.-Act. Relat.1988:18-25.
Organisation for Economic Co-operation and Development. 2007. Guidance document on the validation of (quantitative) structure-activity relationship [(Q)SAR] models.OECD Series on Testing and Assessment 69.OECD Document ENV/JM/MONO(2007)2, pp 55 (paragraph no. 198) and 65 (Table 5.7).
Sch��rmann{Schuurmann} G, Ebert R-U, Chen J, Wang B,K�hne{Kuhne} R. 2008. External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean.J. Chem. Inf. Model.48:2140-2145.
Tropsha A, Gramatica P, Gombar VK. 2003. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models.QSAR Comb. Sci.22:69-77.

Examples

Run this code

library(cvq2)
data(cvq2.setA)
result <- cvq2( cvq2.setA, y ~ x1 + x2 )
result

data(cvq2.setB)
result <- cvq2( cvq2.setB, y ~ x, nGroup = 3 )
result

data(cvq2.setB)
result <- cvq2( cvq2.setB, y ~ x, nGroup = 3, nRun = 5 )
result

Run the code above in your browser using DataLab