Learn R Programming

matrixCorr (version 0.10.0)

biserial: Biserial Correlation Between Continuous and Binary Variables

Description

Computes biserial correlations between continuous variables in data and binary variables in y. Both pairwise vector mode and rectangular matrix/data-frame mode are supported.

Usage

biserial(data, y, check_na = TRUE)

# S3 method for biserial_corr print( x, digits = 4, n = NULL, topn = NULL, max_vars = NULL, width = NULL, show_ci = NULL, ... )

# S3 method for biserial_corr plot( x, title = "Biserial correlation heatmap", low_color = "indianred1", high_color = "steelblue1", mid_color = "white", value_text_size = 4, show_value = TRUE, ... )

# S3 method for biserial_corr summary( object, n = NULL, topn = NULL, max_vars = NULL, width = NULL, show_ci = NULL, ... )

Value

If both data and y are vectors, a numeric scalar. Otherwise a numeric matrix of class biserial_corr with rows corresponding to the continuous variables in data and columns to the binary variables in y. Matrix outputs carry attributes method, description, and package = "matrixCorr".

Arguments

data

A numeric vector, matrix, or data frame containing continuous variables.

y

A binary vector, matrix, or data frame. In data-frame mode, only two-level columns are retained.

check_na

Logical (default TRUE). If TRUE, missing values are rejected. If FALSE, pairwise complete cases are used.

x

An object of class biserial_corr.

digits

Integer; number of decimal places to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

show_ci

One of "yes" or "no".

...

Additional arguments passed to print().

title

Plot title. Default is "Biserial correlation heatmap".

low_color

Color for the minimum correlation.

high_color

Color for the maximum correlation.

mid_color

Color for zero correlation.

value_text_size

Font size used in tile labels.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class biserial_corr.

Author

Thiago de Paula Oliveira

Details

The biserial correlation is the special two-category case of the polyserial model. It assumes that a binary variable \(Y\) arises by thresholding an unobserved standard-normal variable \(Z\) that is jointly normal with a continuous variable \(X\). Writing \(p = P(Y = 1)\) and \(q = 1-p\), let \(z_p = \Phi^{-1}(p)\) and \(\phi(z_p)\) be the standard-normal density evaluated at \(z_p\). If \(\bar x_1\) and \(\bar x_0\) denote the sample means of \(X\) in the two observed groups and \(s_x\) is the sample standard deviation of \(X\), the usual biserial estimator is $$ r_b = \frac{\bar x_1 - \bar x_0}{s_x} \frac{pq}{\phi(z_p)}. $$ This is exactly the estimator implemented in the underlying C++ kernel.

In vector mode a single biserial correlation is returned. In matrix/data-frame mode, every numeric column of data is paired with every binary column of y, producing a rectangular matrix of continuous-by-binary biserial correlations.

Unlike the point-biserial correlation, which is just Pearson correlation on a 0/1 coding of the binary variable, the biserial coefficient explicitly assumes an underlying latent normal threshold model and rescales the mean difference accordingly.

Computational complexity. If data has \(p_x\) continuous columns and y has \(p_y\) binary columns, the matrix path computes \(p_x p_y\) closed-form estimates with negligible extra memory beyond the output matrix.

References

Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Psychometrika, 47(3), 337-347.

Examples

Run this code
set.seed(126)
n <- 1000
Sigma <- matrix(c(
  1.00, 0.35, 0.50, 0.25,
  0.35, 1.00, 0.30, 0.55,
  0.50, 0.30, 1.00, 0.40,
  0.25, 0.55, 0.40, 1.00
), 4, 4, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 4), varcov = Sigma)
X <- data.frame(x1 = Z[, 1], x2 = Z[, 2])
Y <- data.frame(
  g1 = Z[, 3] > stats::qnorm(0.65),
  g2 = Z[, 4] > stats::qnorm(0.55)
)

bs <- biserial(X, Y)
print(bs, digits = 3)
summary(bs)
plot(bs)

Run the code above in your browser using DataLab