biserial: Biserial Correlation Between Continuous and Binary Variables

Description

Computes biserial correlations between continuous variables in data and binary variables in y. Both pairwise vector mode and rectangular matrix/data-frame mode are supported.

Usage

biserial(data, y, check_na = TRUE)
# S3 method for biserial_corr
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)
# S3 method for biserial_corr
plot(
  x,
  title = "Biserial correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  show_value = TRUE,
  ...
)
# S3 method for biserial_corr
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Value

If both data and y are vectors, a numeric scalar. Otherwise a numeric matrix of class biserial_corr with rows corresponding to the continuous variables in data and columns to the binary variables in y. Matrix outputs carry attributes method, description, and package = "matrixCorr".

Arguments

data: A numeric vector, matrix, or data frame containing continuous variables.
y: A binary vector, matrix, or data frame. In data-frame mode, only two-level columns are retained.
check_na: Logical (default TRUE). If TRUE, missing values are rejected. If FALSE, pairwise complete cases are used.
x: An object of class biserial_corr.
digits: Integer; number of decimal places to print.
n: Optional row threshold for compact preview output.
topn: Optional number of leading/trailing rows to show when truncated.
max_vars: Optional maximum number of visible columns; NULL derives this from console width.
width: Optional display width; defaults to getOption("width").
show_ci: One of "yes" or "no".
...: Additional arguments passed to print().
title: Plot title. Default is "Biserial correlation heatmap".
low_color: Color for the minimum correlation.
high_color: Color for the maximum correlation.
mid_color: Color for zero correlation.
value_text_size: Font size used in tile labels.
show_value: Logical; if TRUE (default), overlay numeric values on the heatmap tiles.
object: An object of class biserial_corr.

Author

Thiago de Paula Oliveira

Details

The biserial correlation is the special two-category case of the polyserial model. It assumes that a binary variable $Y$ arises by thresholding an unobserved standard-normal variable $Z$ that is jointly normal with a continuous variable $X$. Writing $p = P(Y = 1)$ and $q = 1-p$, let $z_p = \Phi^{-1}(p)$ and $\phi(z_p)$ be the standard-normal density evaluated at $z_p$. If $\bar x_1$ and $\bar x_0$ denote the sample means of $X$ in the two observed groups and $s_x$ is the sample standard deviation of $X$, the usual biserial estimator is $$ r_b = \frac{\bar x_1 - \bar x_0}{s_x} \frac{pq}{\phi(z_p)}. $$ This is exactly the estimator implemented in the underlying C++ kernel.

In vector mode a single biserial correlation is returned. In matrix/data-frame mode, every numeric column of data is paired with every binary column of y, producing a rectangular matrix of continuous-by-binary biserial correlations.

Unlike the point-biserial correlation, which is just Pearson correlation on a 0/1 coding of the binary variable, the biserial coefficient explicitly assumes an underlying latent normal threshold model and rescales the mean difference accordingly.

Computational complexity. If data has $p_x$ continuous columns and y has $p_y$ binary columns, the matrix path computes $p_x p_y$ closed-form estimates with negligible extra memory beyond the output matrix.

References

Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Psychometrika, 47(3), 337-347.

Examples

Run this code

set.seed(126)
n <- 1000
Sigma <- matrix(c(
  1.00, 0.35, 0.50, 0.25,
  0.35, 1.00, 0.30, 0.55,
  0.50, 0.30, 1.00, 0.40,
  0.25, 0.55, 0.40, 1.00
), 4, 4, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 4), varcov = Sigma)
X <- data.frame(x1 = Z[, 1], x2 = Z[, 2])
Y <- data.frame(
  g1 = Z[, 3] > stats::qnorm(0.65),
  g2 = Z[, 4] > stats::qnorm(0.55)
)

bs <- biserial(X, Y)
print(bs, digits = 3)
summary(bs)
plot(bs)

Run the code above in your browser using DataLab