Computes biserial correlations between continuous variables in data
and binary variables in y. Both pairwise vector mode and rectangular
matrix/data-frame mode are supported.
biserial(data, y, check_na = TRUE)# S3 method for biserial_corr
print(
x,
digits = 4,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
# S3 method for biserial_corr
plot(
x,
title = "Biserial correlation heatmap",
low_color = "indianred1",
high_color = "steelblue1",
mid_color = "white",
value_text_size = 4,
show_value = TRUE,
...
)
# S3 method for biserial_corr
summary(
object,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
If both data and y are vectors, a numeric scalar. Otherwise a
numeric matrix of class biserial_corr with rows corresponding to
the continuous variables in data and columns to the binary variables
in y. Matrix outputs carry attributes method,
description, and package = "matrixCorr".
A numeric vector, matrix, or data frame containing continuous variables.
A binary vector, matrix, or data frame. In data-frame mode, only two-level columns are retained.
Logical (default TRUE). If TRUE, missing values
are rejected. If FALSE, pairwise complete cases are used.
An object of class biserial_corr.
Integer; number of decimal places to print.
Optional row threshold for compact preview output.
Optional number of leading/trailing rows to show when truncated.
Optional maximum number of visible columns; NULL derives this
from console width.
Optional display width; defaults to getOption("width").
One of "yes" or "no".
Additional arguments passed to print().
Plot title. Default is "Biserial correlation heatmap".
Color for the minimum correlation.
Color for the maximum correlation.
Color for zero correlation.
Font size used in tile labels.
Logical; if TRUE (default), overlay numeric values
on the heatmap tiles.
An object of class biserial_corr.
Thiago de Paula Oliveira
The biserial correlation is the special two-category case of the polyserial model. It assumes that a binary variable \(Y\) arises by thresholding an unobserved standard-normal variable \(Z\) that is jointly normal with a continuous variable \(X\). Writing \(p = P(Y = 1)\) and \(q = 1-p\), let \(z_p = \Phi^{-1}(p)\) and \(\phi(z_p)\) be the standard-normal density evaluated at \(z_p\). If \(\bar x_1\) and \(\bar x_0\) denote the sample means of \(X\) in the two observed groups and \(s_x\) is the sample standard deviation of \(X\), the usual biserial estimator is $$ r_b = \frac{\bar x_1 - \bar x_0}{s_x} \frac{pq}{\phi(z_p)}. $$ This is exactly the estimator implemented in the underlying C++ kernel.
In vector mode a single biserial correlation is returned. In
matrix/data-frame mode, every numeric column of data is paired with every
binary column of y, producing a rectangular matrix of
continuous-by-binary biserial correlations.
Unlike the point-biserial correlation, which is just Pearson correlation on a 0/1 coding of the binary variable, the biserial coefficient explicitly assumes an underlying latent normal threshold model and rescales the mean difference accordingly.
Computational complexity. If data has \(p_x\) continuous
columns and y has \(p_y\) binary columns, the matrix path computes
\(p_x p_y\) closed-form estimates with negligible extra memory beyond the
output matrix.
Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Psychometrika, 47(3), 337-347.
set.seed(126)
n <- 1000
Sigma <- matrix(c(
1.00, 0.35, 0.50, 0.25,
0.35, 1.00, 0.30, 0.55,
0.50, 0.30, 1.00, 0.40,
0.25, 0.55, 0.40, 1.00
), 4, 4, byrow = TRUE)
Z <- mnormt::rmnorm(n = n, mean = rep(0, 4), varcov = Sigma)
X <- data.frame(x1 = Z[, 1], x2 = Z[, 2])
Y <- data.frame(
g1 = Z[, 3] > stats::qnorm(0.65),
g2 = Z[, 4] > stats::qnorm(0.55)
)
bs <- biserial(X, Y)
print(bs, digits = 3)
summary(bs)
plot(bs)
Run the code above in your browser using DataLab