bicor: Biweight Midcorrelation

Description

Calculate biweight midcorrelation efficiently for matrices.

Usage

bicor(x, y = NULL, 
      robustX = TRUE, robustY = TRUE, 
      use = "all.obs", 
      maxPOutliers = 1,
      quick = 0,
      pearsonFallback = "individual",
      nThreads = 0, 
      verbose = 0, indent = 0)

Arguments

a vector or matrix-like numeric object

robustX

use robust calculation for x?

robustY

use robust calculation for y?

use

specifies handling of NAs. One of (unique abbreviations of) "all.obs", "pairwise.complete.obs".

maxPOutliers

specifies the maximum percentile of data that can be considered outliers on either side of the median separately. For each side of the median, if higher percentile than maxPOutliers is considered an outlier by the weight function based on

quick

real number between 0 and 1 that controls the handling of missing data in the calculation of correlations. See details.

nThreads

non-negative integer specifying the number of parallel threads to be used by certain parts of correlation calculations. This option only has an effect on systems on which a POSIX thread library is available (which currently includes Linux and Mac OSX, b

pearsonFallback

Specifies whether the bicor calculation should revert to Pearson when median absolute deviation (mad) is zero. Recongnized values are (abbreviations of) "none", "individual", "all". If set to "none", zero mad will result in <

verbose

if non-zero, the underlying C function will print some diagnostics.

indent

indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.

Value

A matrix of biweight midcorrelations. Dimnames on the result are set appropriately.

Details

This function implements biweight midcorrelation calculation (see references). If y is not supplied, midcorrelation of columns of x will be calculated; otherwise, the midcorrelation between columns of x and y will be calculated. Thus, bicor(x) is equivalent to bicor(x,x) but is more efficient. The options robustX, robustY allow the user to revert the calculation to standard correlation calculation. This is important, for example, if any of the variables is binary (or, more generally, discrete) as in such cases the robust methods produce meaningless results. If both robustX, robustY are set to FALSE, the function calculates the standard Pearson correlation (but is slower than the function cor). The argument quick specifies the precision of handling of missing data in the correlation calculations. Value quick = 0 will cause all calculations to be executed accurately, which may be significantly slower than calculations without missing data. Progressively higher values will speed up the calculations but introduce progressively larger errors. Without missing data, all column meadians and median absolute deviations (MADs) can be pre-calculated before the covariances are calculated. When missing data are present, exact calculations require the column medians and MADs to be calculated for each covariance. The approximate calculation uses the pre-calculated median and MAD and simply ignores missing data in the covariance calculation. If the number of missing data is high, the pre-calculated medians and MADs may be very different from the actual ones, thus potentially introducing large errors. The quick value times the number of rows specifies the maximum difference in the number of missing entries for median and MAD calculations on the one hand and covariance on the other hand that will be tolerated before a recalculation is triggered. The hope is that if only a few missing data are treated approximately, the error introduced will be small but the potential speedup can be significant. The choice "all" for pearsonFallback is not fully implemented in the sense that there are rare but possible cases in which the calculation is equivalent to "individual". This may happen if the use option is set to "pairwise.complete.obs" and the missing data are arranged such that each individual mad is non-zero, but when two columns are analyzed together, the missing data from both columns may make a mad zero. In such a case, the calculation is treated as Pearson, but other columns will be treated as bicor.

References

"Dealing with Outliers in Bivariate Data: Robust Correlation", Rich Herrington, http://www.unt.edu/benchmarks/archives/2001/december01/rss.htm "Introduction to Robust Estimation and Hypothesis Testing", Rand Wilcox, Academic Press, 1997. "Data Analysis and Regression: A Second Course in Statistics", Mosteller and Tukey, Addison-Wesley, 1977, pp. 203-209.