findCorrelation
Determine highly correlated variables
This function searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.
- Keywords
- manip
Usage
findCorrelation(
x,
cutoff = 0.9,
verbose = FALSE,
names = FALSE,
exact = ncol(x) < 100
)
Arguments
- x
A correlation matrix
- cutoff
A numeric value for the pair-wise absolute correlation cutoff
- verbose
A boolean for printing the details
- names
a logical; should the column names be returned (
TRUE
) or the column index (FALSE
)?- exact
a logical; should the average correlations be recomputed at each step? See Details below.
Details
The absolute values of pair-wise correlations are considered. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation.
Using exact = TRUE
will cause the function to re-evaluate the average
correlations at each step while exact = FALSE
uses all the
correlations regardless of whether they have been eliminated or not. The
exact calculations will remove a smaller number of predictors but can be
much slower when the problem dimensions are "big".
There are several function in the subselect package
(leaps
,
genetic
,
anneal
) that can also be used to accomplish
the same goal but tend to retain more predictors.
Value
A vector of indices denoting the columns to remove (when names
= TRUE
) otherwise a vector of column names. If no correlations meet the
criteria, integer(0)
is returned.
See Also
Examples
# NOT RUN {
R1 <- structure(c(1, 0.86, 0.56, 0.32, 0.85, 0.86, 1, 0.01, 0.74, 0.32,
0.56, 0.01, 1, 0.65, 0.91, 0.32, 0.74, 0.65, 1, 0.36,
0.85, 0.32, 0.91, 0.36, 1),
.Dim = c(5L, 5L))
colnames(R1) <- rownames(R1) <- paste0("x", 1:ncol(R1))
R1
findCorrelation(R1, cutoff = .6, exact = FALSE)
findCorrelation(R1, cutoff = .6, exact = TRUE)
findCorrelation(R1, cutoff = .6, exact = TRUE, names = FALSE)
R2 <- diag(rep(1, 5))
R2[2, 3] <- R2[3, 2] <- .7
R2[5, 3] <- R2[3, 5] <- -.7
R2[4, 1] <- R2[1, 4] <- -.67
corrDF <- expand.grid(row = 1:5, col = 1:5)
corrDF$correlation <- as.vector(R2)
levelplot(correlation ~ row + col, corrDF)
findCorrelation(R2, cutoff = .65, verbose = TRUE)
findCorrelation(R2, cutoff = .99, verbose = TRUE)
# }
Community examples
# *Another example* # calculate correlation matrix > correlationMatrix <- cor(data4[,3:18]) > dim(correlationMatrix) [1] 16 16 > # summarize the correlation matrix > # find attributes that are highly corrected (ideally >0.75) > print(correlationMatrix) basal kurt cl th pk ra basal 1.000000000 -0.047181714 0.431029995 0.3729320332 -0.01265766 0.373203406 kurt -0.047181714 1.000000000 -0.014940999 0.1216412366 0.11793816 0.120720327 cl 0.431029995 -0.014940999 1.000000000 0.9576454368 0.10334894 0.956572659 th 0.372932033 0.121641237 0.957645437 1.0000000000 0.07389917 0.998605988 pk -0.012657655 0.117938158 0.103348945 0.0738991747 1.00000000 0.073040802 ra 0.373203406 0.120720327 0.956572659 0.9986059878 0.07304080 1.000000000 ne 0.377067319 0.089514020 0.921709799 0.9438973869 0.06957701 0.944794152 zc 0.151460899 -0.025275545 0.351339622 0.2540941745 0.72209923 0.246130636 sbi 0.021139002 0.151290286 0.108751493 0.1599702995 0.06275704 0.158537149 spi -0.007213781 -0.081052683 -0.041535791 -0.0732224698 0.00877137 -0.073350844 spr -0.003240699 -0.035526419 -0.040623858 -0.0628132859 -0.02007257 -0.063255435 sc 0.022524454 0.231214505 0.264454041 0.3883356437 0.25904458 0.385160694 smad -0.003918369 -0.003893825 0.002837923 -0.0008341529 0.01591010 -0.000796614 ssd -0.011171919 -0.019674459 -0.045167014 -0.0742436729 0.02843364 -0.074679163 scr 0.022524454 0.231214505 0.264454041 0.3883356437 0.25904458 0.385160694 sf 0.002199715 0.232919191 0.102390876 0.1857397940 0.10900790 0.184206491 ne zc sbi spi spr sc basal 0.37706732 0.151460899 0.021139002 -0.007213781 -0.003240699 0.022524454 kurt 0.08951402 -0.025275545 0.151290286 -0.081052683 -0.035526419 0.231214505 cl 0.92170980 0.351339622 0.108751493 -0.041535791 -0.040623858 0.264454041 th 0.94389739 0.254094175 0.159970299 -0.073222470 -0.062813286 0.388335644 pk 0.06957701 0.722099233 0.062757041 0.008771370 -0.020072568 0.259044576 ra 0.94479415 0.246130636 0.158537149 -0.073350844 -0.063255435 0.385160694 ne 1.00000000 0.196797081 0.107280715 -0.069163325 -0.052599448 0.285765198 zc 0.19679708 1.000000000 0.080471076 0.011499973 -0.006075214 0.179013205 sbi 0.10728071 0.080471076 1.000000000 -0.192516195 -0.004037006 0.424772641 spi -0.06916333 0.011499973 -0.192516195 1.000000000 0.305876755 -0.063102361 spr -0.05259945 -0.006075214 -0.004037006 0.305876755 1.000000000 -0.085859834 sc 0.28576520 0.179013205 0.424772641 -0.063102361 -0.085859834 1.000000000 smad 0.00331882 0.001497509 -0.017840083 0.015031269 -0.010531983 -0.004939186 ssd -0.08686502 0.060474407 0.233744239 0.293263846 0.230609514 0.014486054 scr 0.28576520 0.179013205 0.424772641 -0.063102361 -0.085859834 1.000000000 sf 0.12856858 0.077335109 0.858225089 -0.202560755 -0.086086306 0.532576231 smad ssd scr sf basal -0.0039183695 -0.011171919 0.022524454 0.002199715 kurt -0.0038938248 -0.019674459 0.231214505 0.232919191 cl 0.0028379226 -0.045167014 0.264454041 0.102390876 th -0.0008341529 -0.074243673 0.388335644 0.185739794 pk 0.0159101028 0.028433644 0.259044576 0.109007897 ra -0.0007966140 -0.074679163 0.385160694 0.184206491 ne 0.0033188199 -0.086865024 0.285765198 0.128568582 zc 0.0014975095 0.060474407 0.179013205 0.077335109 sbi -0.0178400832 0.233744239 0.424772641 0.858225089 spi 0.0150312687 0.293263846 -0.063102361 -0.202560755 spr -0.0105319834 0.230609514 -0.085859834 -0.086086306 sc -0.0049391857 0.014486054 1.000000000 0.532576231 smad 1.0000000000 -0.007504293 -0.004939186 -0.013226721 ssd -0.0075042926 1.000000000 0.014486054 0.167001635 scr -0.0049391857 0.014486054 1.000000000 0.532576231 sf -0.0132267209 0.167001635 0.532576231 1.000000000 > highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75,names=TRUE) > # print indexes of highly correlated attributes > print(highlyCorrelated) [1] "th" "ra" "cl" "sc" "sf"