findCorrelation

0th

Percentile

Determine highly correlated variables

This function searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.

Keywords
manip
Usage
findCorrelation(x, cutoff = 0.9, verbose = FALSE, names = FALSE,
  exact = ncol(x) < 100)
Arguments
x

A correlation matrix

cutoff

A numeric value for the pair-wise absolute correlation cutoff

verbose

A boolean for printing the details

names

a logical; should the column names be returned (TRUE) or the column index (FALSE)?

exact

a logical; should the average correlations be recomputed at each step? See Details below.

Details

The absolute values of pair-wise correlations are considered. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.

Using exact = TRUE will cause the function to re-evaluate the average correlations at each step while exact = FALSE uses all the correlations regardless of whether they have been eliminated or not. The exact calculations will remove a smaller number of predictors but can be much slower when the problem dimensions are "big".

There are several function in the subselect package (leaps, genetic, anneal) that can also be used to accomplish the same goal but tend to retain more predictors.

Value

A vector of indices denoting the columns to remove (when names = TRUE) otherwise a vector of column names. If no correlations meet the criteria, integer(0) is returned.

See Also

leaps, genetic, anneal, findLinearCombos

Aliases
  • findCorrelation
Examples
# NOT RUN {
R1 <- structure(c(1, 0.86, 0.56, 0.32, 0.85, 0.86, 1, 0.01, 0.74, 0.32, 
                  0.56, 0.01, 1, 0.65, 0.91, 0.32, 0.74, 0.65, 1, 0.36,
                  0.85, 0.32, 0.91, 0.36, 1), 
                .Dim = c(5L, 5L))
colnames(R1) <- rownames(R1) <- paste0("x", 1:ncol(R1))
R1

findCorrelation(R1, cutoff = .6, exact = FALSE)
findCorrelation(R1, cutoff = .6, exact = TRUE)
findCorrelation(R1, cutoff = .6, exact = TRUE, names = FALSE)


R2 <- diag(rep(1, 5))
R2[2, 3] <- R2[3, 2] <- .7
R2[5, 3] <- R2[3, 5] <- -.7
R2[4, 1] <- R2[1, 4] <- -.67

corrDF <- expand.grid(row = 1:5, col = 1:5)
corrDF$correlation <- as.vector(R2)
levelplot(correlation ~ row + col, corrDF)

findCorrelation(R2, cutoff = .65, verbose = TRUE)

findCorrelation(R2, cutoff = .99, verbose = TRUE)

# }
Documentation reproduced from package caret, version 6.0-80, License: GPL (>= 2)

Community examples

soymari125 at Jul 3, 2018 caret v6.0-80

# *Another example* # calculate correlation matrix > correlationMatrix <- cor(data4[,3:18]) > dim(correlationMatrix) [1] 16 16 > # summarize the correlation matrix > # find attributes that are highly corrected (ideally >0.75) > print(correlationMatrix) basal kurt cl th pk ra basal 1.000000000 -0.047181714 0.431029995 0.3729320332 -0.01265766 0.373203406 kurt -0.047181714 1.000000000 -0.014940999 0.1216412366 0.11793816 0.120720327 cl 0.431029995 -0.014940999 1.000000000 0.9576454368 0.10334894 0.956572659 th 0.372932033 0.121641237 0.957645437 1.0000000000 0.07389917 0.998605988 pk -0.012657655 0.117938158 0.103348945 0.0738991747 1.00000000 0.073040802 ra 0.373203406 0.120720327 0.956572659 0.9986059878 0.07304080 1.000000000 ne 0.377067319 0.089514020 0.921709799 0.9438973869 0.06957701 0.944794152 zc 0.151460899 -0.025275545 0.351339622 0.2540941745 0.72209923 0.246130636 sbi 0.021139002 0.151290286 0.108751493 0.1599702995 0.06275704 0.158537149 spi -0.007213781 -0.081052683 -0.041535791 -0.0732224698 0.00877137 -0.073350844 spr -0.003240699 -0.035526419 -0.040623858 -0.0628132859 -0.02007257 -0.063255435 sc 0.022524454 0.231214505 0.264454041 0.3883356437 0.25904458 0.385160694 smad -0.003918369 -0.003893825 0.002837923 -0.0008341529 0.01591010 -0.000796614 ssd -0.011171919 -0.019674459 -0.045167014 -0.0742436729 0.02843364 -0.074679163 scr 0.022524454 0.231214505 0.264454041 0.3883356437 0.25904458 0.385160694 sf 0.002199715 0.232919191 0.102390876 0.1857397940 0.10900790 0.184206491 ne zc sbi spi spr sc basal 0.37706732 0.151460899 0.021139002 -0.007213781 -0.003240699 0.022524454 kurt 0.08951402 -0.025275545 0.151290286 -0.081052683 -0.035526419 0.231214505 cl 0.92170980 0.351339622 0.108751493 -0.041535791 -0.040623858 0.264454041 th 0.94389739 0.254094175 0.159970299 -0.073222470 -0.062813286 0.388335644 pk 0.06957701 0.722099233 0.062757041 0.008771370 -0.020072568 0.259044576 ra 0.94479415 0.246130636 0.158537149 -0.073350844 -0.063255435 0.385160694 ne 1.00000000 0.196797081 0.107280715 -0.069163325 -0.052599448 0.285765198 zc 0.19679708 1.000000000 0.080471076 0.011499973 -0.006075214 0.179013205 sbi 0.10728071 0.080471076 1.000000000 -0.192516195 -0.004037006 0.424772641 spi -0.06916333 0.011499973 -0.192516195 1.000000000 0.305876755 -0.063102361 spr -0.05259945 -0.006075214 -0.004037006 0.305876755 1.000000000 -0.085859834 sc 0.28576520 0.179013205 0.424772641 -0.063102361 -0.085859834 1.000000000 smad 0.00331882 0.001497509 -0.017840083 0.015031269 -0.010531983 -0.004939186 ssd -0.08686502 0.060474407 0.233744239 0.293263846 0.230609514 0.014486054 scr 0.28576520 0.179013205 0.424772641 -0.063102361 -0.085859834 1.000000000 sf 0.12856858 0.077335109 0.858225089 -0.202560755 -0.086086306 0.532576231 smad ssd scr sf basal -0.0039183695 -0.011171919 0.022524454 0.002199715 kurt -0.0038938248 -0.019674459 0.231214505 0.232919191 cl 0.0028379226 -0.045167014 0.264454041 0.102390876 th -0.0008341529 -0.074243673 0.388335644 0.185739794 pk 0.0159101028 0.028433644 0.259044576 0.109007897 ra -0.0007966140 -0.074679163 0.385160694 0.184206491 ne 0.0033188199 -0.086865024 0.285765198 0.128568582 zc 0.0014975095 0.060474407 0.179013205 0.077335109 sbi -0.0178400832 0.233744239 0.424772641 0.858225089 spi 0.0150312687 0.293263846 -0.063102361 -0.202560755 spr -0.0105319834 0.230609514 -0.085859834 -0.086086306 sc -0.0049391857 0.014486054 1.000000000 0.532576231 smad 1.0000000000 -0.007504293 -0.004939186 -0.013226721 ssd -0.0075042926 1.000000000 0.014486054 0.167001635 scr -0.0049391857 0.014486054 1.000000000 0.532576231 sf -0.0132267209 0.167001635 0.532576231 1.000000000 > highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75,names=TRUE) > # print indexes of highly correlated attributes > print(highlyCorrelated) [1] "th" "ra" "cl" "sc" "sf"