Computes the tetrachoric correlation for either a pair of binary variables or all pairwise combinations of binary columns in a matrix/data frame.
tetrachoric(data, y = NULL, correct = 0.5, check_na = TRUE)# S3 method for tetrachoric_corr
print(
x,
digits = 4,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
# S3 method for tetrachoric_corr
plot(
x,
title = "Tetrachoric correlation heatmap",
low_color = "indianred1",
high_color = "steelblue1",
mid_color = "white",
value_text_size = 4,
show_value = TRUE,
...
)
# S3 method for tetrachoric_corr
summary(
object,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
If y is supplied, a numeric scalar with attributes
diagnostics and thresholds. Otherwise a symmetric matrix of
class tetrachoric_corr with attributes method,
description, package = "matrixCorr", diagnostics,
thresholds, and correct.
A binary vector, matrix, or data frame. In matrix/data-frame mode, only binary columns are retained.
Optional second binary vector. When supplied, the function returns a single tetrachoric correlation estimate.
Non-negative continuity correction added to zero-count cells.
Default is 0.5.
Logical (default TRUE). If TRUE, missing values
are rejected. If FALSE, pairwise complete cases are used.
An object of class tetrachoric_corr.
Integer; number of decimal places to print.
Optional row threshold for compact preview output.
Optional number of leading/trailing rows to show when truncated.
Optional maximum number of visible columns; NULL derives this
from console width.
Optional display width; defaults to getOption("width").
One of "yes" or "no".
Additional arguments passed to print().
Plot title. Default is "Tetrachoric correlation heatmap".
Color for the minimum correlation.
Color for the maximum correlation.
Color for zero correlation.
Font size used in tile labels.
Logical; if TRUE (default), overlay numeric values
on the heatmap tiles.
An object of class tetrachoric_corr.
Thiago de Paula Oliveira
The tetrachoric correlation assumes that the observed binary variables arise by dichotomising latent standard-normal variables. Let \(Z_1, Z_2 \sim N(0, 1)\) with latent correlation \(\rho\), and define observed binary variables by thresholds \(\tau_1, \tau_2\): $$ X = \mathbf{1}\{Z_1 > \tau_1\}, \qquad Y = \mathbf{1}\{Z_2 > \tau_2\}. $$ If the observed \(2 \times 2\) table has counts \(n_{ij}\) for \(i,j \in \{0,1\}\), the marginal proportions determine the thresholds: $$ \tau_1 = \Phi^{-1}\!\big(P(X = 0)\big), \qquad \tau_2 = \Phi^{-1}\!\big(P(Y = 0)\big). $$ The estimator returned here is the maximum-likelihood estimate of the latent correlation \(\rho\), obtained by maximizing the multinomial log-likelihood built from the rectangle probabilities of the bivariate normal distribution: $$ \ell(\rho) = \sum_{i=0}^1 \sum_{j=0}^1 n_{ij}\log \pi_{ij}(\rho;\tau_1,\tau_2), $$ where \(\pi_{ij}\) are the four bivariate-normal cell probabilities implied by \(\rho\) and the fixed thresholds. The implementation evaluates the likelihood over \(\rho \in (-1,1)\) by a coarse search followed by Brent refinement in C++.
The argument correct adds a continuity correction only to zero-count
cells before threshold estimation and likelihood evaluation. This stabilises
the estimator for sparse tables and mirrors the conventional
correct = 0.5 behaviour used in several psychometric implementations.
When correct = 0 and the observed contingency table contains zero
cells, the fit is non-regular and may be boundary-driven. In those cases the
returned object stores sparse-fit diagnostics, including whether the fit was
classified as boundary or near_boundary.
In matrix/data-frame mode, all pairwise tetrachoric correlations are computed
between binary columns. Diagonal entries are 1 for non-degenerate
columns and NA for columns with fewer than two observed levels.
Variable-specific latent thresholds are stored in the thresholds
attribute, and pairwise sparse-fit diagnostics are stored in
diagnostics.
Computational complexity. For \(p\) binary variables, the matrix path evaluates \(p(p-1)/2\) pairwise likelihoods. Each pair uses a one-dimensional optimisation with negligible memory overhead beyond the output matrix.
Pearson, K. (1900). Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society A, 195, 1-47.
Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.
set.seed(123)
n <- 1000
Sigma <- matrix(c(
1.00, 0.55, 0.35,
0.55, 1.00, 0.45,
0.35, 0.45, 1.00
), 3, 3, byrow = TRUE)
Z <- mnormt::rmnorm(n = n, mean = rep(0, 3), varcov = Sigma)
X <- data.frame(
item1 = Z[, 1] > stats::qnorm(0.70),
item2 = Z[, 2] > stats::qnorm(0.60),
item3 = Z[, 3] > stats::qnorm(0.50)
)
tc <- tetrachoric(X)
print(tc, digits = 3)
summary(tc)
plot(tc)
# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
view_corr_shiny(tc)
}
# latent Pearson correlations used to generate the binary items
round(stats::cor(Z), 2)
Run the code above in your browser using DataLab