Learn R Programming

matrixCorr (version 0.10.0)

tetrachoric: Pairwise Tetrachoric Correlation

Description

Computes the tetrachoric correlation for either a pair of binary variables or all pairwise combinations of binary columns in a matrix/data frame.

Usage

tetrachoric(data, y = NULL, correct = 0.5, check_na = TRUE)

# S3 method for tetrachoric_corr print( x, digits = 4, n = NULL, topn = NULL, max_vars = NULL, width = NULL, show_ci = NULL, ... )

# S3 method for tetrachoric_corr plot( x, title = "Tetrachoric correlation heatmap", low_color = "indianred1", high_color = "steelblue1", mid_color = "white", value_text_size = 4, show_value = TRUE, ... )

# S3 method for tetrachoric_corr summary( object, n = NULL, topn = NULL, max_vars = NULL, width = NULL, show_ci = NULL, ... )

Value

If y is supplied, a numeric scalar with attributes diagnostics and thresholds. Otherwise a symmetric matrix of class tetrachoric_corr with attributes method, description, package = "matrixCorr", diagnostics, thresholds, and correct.

Arguments

data

A binary vector, matrix, or data frame. In matrix/data-frame mode, only binary columns are retained.

y

Optional second binary vector. When supplied, the function returns a single tetrachoric correlation estimate.

correct

Non-negative continuity correction added to zero-count cells. Default is 0.5.

check_na

Logical (default TRUE). If TRUE, missing values are rejected. If FALSE, pairwise complete cases are used.

x

An object of class tetrachoric_corr.

digits

Integer; number of decimal places to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

show_ci

One of "yes" or "no".

...

Additional arguments passed to print().

title

Plot title. Default is "Tetrachoric correlation heatmap".

low_color

Color for the minimum correlation.

high_color

Color for the maximum correlation.

mid_color

Color for zero correlation.

value_text_size

Font size used in tile labels.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class tetrachoric_corr.

Author

Thiago de Paula Oliveira

Details

The tetrachoric correlation assumes that the observed binary variables arise by dichotomising latent standard-normal variables. Let \(Z_1, Z_2 \sim N(0, 1)\) with latent correlation \(\rho\), and define observed binary variables by thresholds \(\tau_1, \tau_2\): $$ X = \mathbf{1}\{Z_1 > \tau_1\}, \qquad Y = \mathbf{1}\{Z_2 > \tau_2\}. $$ If the observed \(2 \times 2\) table has counts \(n_{ij}\) for \(i,j \in \{0,1\}\), the marginal proportions determine the thresholds: $$ \tau_1 = \Phi^{-1}\!\big(P(X = 0)\big), \qquad \tau_2 = \Phi^{-1}\!\big(P(Y = 0)\big). $$ The estimator returned here is the maximum-likelihood estimate of the latent correlation \(\rho\), obtained by maximizing the multinomial log-likelihood built from the rectangle probabilities of the bivariate normal distribution: $$ \ell(\rho) = \sum_{i=0}^1 \sum_{j=0}^1 n_{ij}\log \pi_{ij}(\rho;\tau_1,\tau_2), $$ where \(\pi_{ij}\) are the four bivariate-normal cell probabilities implied by \(\rho\) and the fixed thresholds. The implementation evaluates the likelihood over \(\rho \in (-1,1)\) by a coarse search followed by Brent refinement in C++.

The argument correct adds a continuity correction only to zero-count cells before threshold estimation and likelihood evaluation. This stabilises the estimator for sparse tables and mirrors the conventional correct = 0.5 behaviour used in several psychometric implementations. When correct = 0 and the observed contingency table contains zero cells, the fit is non-regular and may be boundary-driven. In those cases the returned object stores sparse-fit diagnostics, including whether the fit was classified as boundary or near_boundary.

In matrix/data-frame mode, all pairwise tetrachoric correlations are computed between binary columns. Diagonal entries are 1 for non-degenerate columns and NA for columns with fewer than two observed levels. Variable-specific latent thresholds are stored in the thresholds attribute, and pairwise sparse-fit diagnostics are stored in diagnostics.

Computational complexity. For \(p\) binary variables, the matrix path evaluates \(p(p-1)/2\) pairwise likelihoods. Each pair uses a one-dimensional optimisation with negligible memory overhead beyond the output matrix.

References

Pearson, K. (1900). Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society A, 195, 1-47.

Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.

Examples

Run this code
set.seed(123)
n <- 1000
Sigma <- matrix(c(
  1.00, 0.55, 0.35,
  0.55, 1.00, 0.45,
  0.35, 0.45, 1.00
), 3, 3, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 3), varcov = Sigma)
X <- data.frame(
  item1 = Z[, 1] > stats::qnorm(0.70),
  item2 = Z[, 2] > stats::qnorm(0.60),
  item3 = Z[, 3] > stats::qnorm(0.50)
)

tc <- tetrachoric(X)
print(tc, digits = 3)
summary(tc)
plot(tc)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(tc)
}

# latent Pearson correlations used to generate the binary items
round(stats::cor(Z), 2)

Run the code above in your browser using DataLab