caret (version 4.87)

findCorrelation: Determine highly correlated variables

Description

This function searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.

Usage

findCorrelation(x, cutoff = .90, verbose = FALSE)

Arguments

x
A correlation matrix
cutoff
A numeric value for the pariwise absolute correlation cutoff
verbose
A boolean for printing the details

Value

  • A vector of indices denoting the columns to remove. If no correlations meet the criteria, numeric(0) is returned.

Details

The absolute values of pair-wise correlations are considered. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.

Examples

Run this code
corrMatrix <- diag(rep(1, 5))
corrMatrix[2, 3] <- corrMatrix[3, 2] <- .7
corrMatrix[5, 3] <- corrMatrix[3, 5] <- -.7
corrMatrix[4, 1] <- corrMatrix[1, 4] <- -.67

corrDF <- expand.grid(row = 1:5, col = 1:5)
corrDF$correlation <- as.vector(corrMatrix)
levelplot(correlation ~ row+ col, corrDF)

findCorrelation(corrMatrix, cutoff = .65, verbose = TRUE)

findCorrelation(corrMatrix, cutoff = .99, verbose = TRUE)

removeCols <- findCorrelation(corrMatrix, cutoff = .65, verbose = FALSE)
   if(!isTRUE(all.equal(corrMatrix[-removeCols, -removeCols], diag(rep(1, 3))))) stop("test 1 failed")
   if(!isTRUE(all.equal( findCorrelation(corrMatrix, .99, verbose = FALSE), numeric(0)))) stop("test 2 failed")

Run the code above in your browser using DataCamp Workspace