cvq2: Calculation of the predictive squared correlation without external data set. A cross validation is applied to the data set, which is the base of a model.

Description

A cross validation is performed to the data set, which is used to construct the model. The cross validation splits the data set ($N$ elements) into a training set ($N-k$ elements) and a test set ($k$ elements). The training set is used to predict the missing $k$ value(s). Any observed value is predicted once within one run. If $k > 1$ one can repeat this calculation with different distributions of training and test sets to avoid any bias. The comparison between predicted and observed values yield to the predictive squared correlation coefficient $q^2_{cv}$.

Usage

cvq2( data, formula = NULL, nGroup = N, nRun = 1,
round = 4, extOut = FALSE, extOutFile = NULL )

Arguments

data

The data set consists of parameter $x_1$, $x_2$, ..., $x_n$ and an observed value y

formula

The formula used to predict the observed value, like $y$ ~ $x_1 + x_2 + \ldots + x_n$ DEFAULT: NULL If NULL, a generic formula is derived from the data set, assuming that the last column contains the observed value

nGroup

Number of individual test sets generated from the data set, DEFAULT: N, $1

nRun

Number of predictions per observed value, DEFAULT: 1

round

The rounding value, DEFAULT: 4

extOut

Extended output, DEFAULT: FALSE If extOutFile is not specified, write to stdout()

extOutFile

Write extended output into file (implies extOut = TRUE), DEFAULT: NULL

Value

Return two lists:
result$cvContain the cross validation results
result$fitContain the linear regression results
The cross validation result list result$cv with following elements:
nTrainingSetThe number of elements in the training set ($N-k$).
nTestSetThe number of elements in the test set ($k$).
nGroupThe number of individual sets generated from the data.
variableSplitTrue, if some groups consist of $k-1$.
datatableFor each value, it contains the model parameters, the arithmetic mean of the training set, the observed and the predicted value.
datatable_columnsThe explanation of the datatable's column names.
nRunThe number of runs each value is predicted.
q2The predictive squared correlation coefficient.
rmseThe root mean square error. This will be calculated with Bessel's sample covariance correction, using $N-1$ in the denominator instead $N$
TestSetThe composition of the individual test sets.
The linear regression result list result$fit with following elements:
r2The conventional squared correlation coefficient.
rmseThe linear regression root mean square error.
nThe number of elements in the data set.
modelThe linear regression model.
observed_meanThe arithmetic mean of the observed values.
datatableThe observed and predicted values.
datatable_columnsThe explanation of the datatable's column names.

Details

This method performs a cross validation for a data set with nGroup training and test sets. The given data set is split into several groups, whereas one group will be the test set and the others are merged as training set. Each group consist of $k$ elements: $$k = \left\lceil\frac{N}{nGroup}\right\rceil$$ In general, each test set has size $k$, whereas the training set has size $N-k$. In case, $\frac{N}{nGroup}$ is a decimal number, some groups consist of $k-1$ elements. For each test set, the training set with the remaining values is used to construct a model to predict the observed values from the test set. This model is slighlty different compared to the model for the $r^2$ calculation, which is due to the missing k values. The difference between the prediction and the observation is used to calculate the PREdictive residual Sum of Squares (PRESS). Furthermore for any training set, the mean of the observed values, $y_{mean}^{N-k,i}$, is calculated. With PRESS and $y_{mean}^{N-k,i}$, the modified $q^2_{cv}$ equation is used to calculate the predictive squared correlation coefficient.

Additionally, the conventional squared correlation coefficient, $r^2$, is calculated with a linear regression for the entire data set.

Examples

Run this code

library(cvq2)
data(cvq2.setA)
result <- cvq2( cvq2.setA, y ~ x1 + x2 )
result

data(cvq2.setB)
result <- cvq2( cvq2.setB, y ~ x, nGroup = 3 )
result

data(cvq2.setB)
result <- cvq2( cvq2.setB, y ~ x, nGroup = 3, nRun = 5 )
result

Run the code above in your browser using DataLab