rda: Regularized Discriminant Analysis (RDA)

Description

Builds a classification rule using regularized group covariance matrices that are supposed to be more robust against multicollinearity in the data.

Usage

rda(x, ...)
# S3 method for default
rda(x, grouping = NULL, prior = NULL, gamma = NA, 
    lambda = NA, regularization = c(gamma = gamma, lambda = lambda), 
    crossval = TRUE, fold = 10, train.fraction = 0.5, 
    estimate.error = TRUE, output = FALSE, startsimplex = NULL, 
    max.iter = 100, trafo = TRUE, simAnn = FALSE, schedule = 2, 
    T.start = 0.1, halflife = 50, zero.temp = 0.01, alpha = 2, 
    K = 100, ...)
# S3 method for formula
rda(formula, data, ...)

Arguments

Matrix or data frame containing the explanatory variables (required, if formula is not given).

formula

Formula of the form ‘groups ~ x1 + x2 + ...’.

data

A data frame (or matrix) containing the explanatory variables.

grouping

(Optional) a vector specifying the class for each observation; if not specified, the first column of ‘data’ is taken.

prior

(Optional) prior probabilities for the classes. Default: proportional to training sample sizes. “prior=1” indicates equally likely classes.

gamma, lambda, regularization

One or both of the rda-parameters may be fixed manually. Unspecified parameters are determined by minimizing the estimated error rate (see below).

crossval

Logical. If TRUE, in the optimization step the error rate is estimated by Cross-Validation, otherwise by drawing several training- and test-samples.

fold

The number of Cross-Validation- or Bootstrap-samples to be drawn.

train.fraction

In case of Bootstrapping: the fraction of the data to be used for training in each Bootstrap-sample; the remainder is used to estimate the misclassification rate.

estimate.error

Logical. If TRUE, the apparent error rate for the final parameter set is estimated.

output

Logical flag to indicate whether text output during computation is desired.

startsimplex

(Optional) a starting simplex for the Nelder-Mead-minimization.

max.iter

Maximum number of iterations for Nelder-Mead.

trafo

Logical; indicates whether minimization is carrried out using transformed parameters.

simAnn

Logical; indicates whether Simulated Annealing shall be used.

schedule

Annealing schedule 1 or 2 (exponential or polynomial).

T.start

Starting temperature for Simulated Annealing.

halflife

Number of iterations until temperature is reduced to a half (schedule 1).

zero.temp

Temperature at which it is set to zero (schedule 1).

alpha

Power of temperature reduction (linear, quadratic, cubic,...) (schedule 2).

Number of iterations until temperature = 0 (schedule 2).

...

currently unused

Value

A list of class rda containing the following components:

call

The (matched) function call.

regularization

vector containing the two regularization parameters (gamma, lambda)

classes

the names of the classes

prior

the prior probabilities for the classes

error.rate

apparent error rate (if computation was not suppressed), and, if any optimization took place, the final (cross-validated or bootstrapped) error rate estimate as well.

means

Group means.

covariances

Array of group covariances.

covpooled

Pooled covariance.

converged

(Logical) indicator of convergence (only for Nelder-Mead).

iter

Number of iterations actually performed (only for Nelder-Mead).

More details

The explicit defintion of $\gamma$, $\lambda$ and the resulting covariance estimates is as follows:

The pooled covariance estimate $\hat{\Sigma}$ is given as well as the individual covariance estimates $\hat{\Sigma}_k$ for each group.

First, using $\lambda$, a convex combination of these two is computed: $$\hat{\Sigma}_k (\lambda) := (1-\lambda) \hat{\Sigma}_k + \lambda \hat{\Sigma}.$$ Then, another convex combination is constructed using the above estimate and a (scaled) identity matrix: $$\hat{\Sigma}_k (\lambda,\gamma) = (1-\gamma)\hat{\Sigma}_k(\lambda)+ \gamma\frac{1}{d}\mathrm{tr}[\hat{\Sigma}_k(\lambda)]\mathrm{I}.$$ The factor $\frac{1}{d}\mathrm{tr}[\hat{\Sigma}_k(\lambda)]$ in front of the identity matrix I is the mean of the diagonal elements of $\hat{\Sigma}_k(\lambda)$, so it is the mean variance of all $d$ variables assuming the group covariance $\hat{\Sigma}_k(\lambda)$.

For the four extremes of ($\gamma$,$\lambda$) the covariance structure reduces to special cases:

($\gamma=0$, $\lambda=0$): QDA - individual covariance for each group.
($\gamma=0$, $\lambda=1$): LDA - a common covariance matrix.
($\gamma=1$, $\lambda=0$): Conditional independent variables - similar to Naive Bayes, but variable variances within group (main diagonal elements) are equal.
($\gamma=1$, $\lambda=1$): Classification using euclidean distance - as in previous case, but variances are the same for all groups. Objects are assigned to group with nearest mean.

Details

J.H. Friedman (see references below) suggested a method to fix almost singular covariance matrices in discriminant analysis. Basically, individual covariances as in QDA are used, but depending on two parameters ($\gamma$ and $\lambda$), these can be shifted towards a diagonal matrix and/or the pooled covariance matrix. For ($\gamma=0$, $\lambda=0$) it equals QDA, for ($\gamma=0$, $\lambda=1$) it equals LDA.

You may fix these parameters at certain values or leave it to the function to try to find “optimal” values. If one parameter is given, the other one is determined using the R-function ‘optimize’. If no parameter is given, both are determined numerically by a Nelder-Mead-(Simplex-)algorithm with the option of using Simulated Annealing. The goal function to be minimized is the (estimated) misclassification rate; the misclassification rate is estimated either by Cross-Validation or by repeatedly dividing the data into training- and test-sets (Boostrapping).

Warning: If these sets are small, optimization is expected to produce almost random results. We recommend to adjust the parameters manually in such a case. In all other cases it is recommended to run the optimization several times in order to see whether stable results are gained.

Since the Nelder-Mead-algorithm is actually intended for continuous functions while the observed error rate by its nature is discrete, a greater number of Boostrap-samples might improve the optimization by increasing the smoothness of the response surface (and, of course, by reducing variance and bias). If a set of parameters leads to singular covariance matrices, a penalty term is added to the misclassification rate which will hopefully help to maneuver back out of singularity (so do not worry about error rates greater than one during optimization).

References

Friedman, J.H. (1989): Regularized Discriminant Analysis. In: Journal of the American Statistical Association 84, 165-175.

Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T. (1992): Numerical Recipes in C. Cambridge: Cambridge University Press.

Examples

Run this code

# NOT RUN {
data(iris)
x <- rda(Species ~ ., data = iris, gamma = 0.05, lambda = 0.2)
predict(x, iris)
# }

Run the code above in your browser using DataLab