Compute pairwise correlations between two or more variables in a survey
design, with design-based standard errors and confidence intervals. Returns
results in long or wide format. The estimator is selected by method:
"pearson" (default) for two numeric variables, "polychoric" for two
ordinal variables under a bivariate-normal latent model (Olsson 1979),
or "polyserial" for one ordinal + one continuous variable (Cox 1974).
The survey-weighted polychoric and polyserial estimators (point estimates
and design-based variance) are implemented from scratch following
Mannan (2025); they are not derived from the survey package, which does
not provide these estimators.
get_corr(
design,
x,
group = NULL,
format = c("long", "wide"),
redundant = FALSE,
diagonal = FALSE,
variance = "ci",
conf_level = 0.95,
n_weighted = FALSE,
decimals = NULL,
min_cell_n = 30L,
na.rm = TRUE,
label_values = TRUE,
label_vars = TRUE,
name_style = "surveycore",
method = "pearson",
...,
.id = NULL,
.if_missing_var = NULL
)A survey_corr tibble (also inheriting survey_result).
When group is active, group variable columns are prepended before all
other columns in both long and wide formats.
Long format columns:
[group_cols...] — group variable columns (when active), first.
var1, var2 — variable names (or labels when
label_vars = TRUE).
r — Pearson correlation coefficient.
Variance columns (se, var, cv, ci_low, ci_high, moe,
deff) — only those requested via variance.
p_value — two-tailed p-value.
statistic — t-statistic.
df — degrees of freedom for the t-test (n minus 2).
n — pairwise unweighted count.
n_weighted — pairwise sum of weights (only when requested).
Wide format columns:
[group_cols...] — group variable columns (when active), first.
variable — row variable names (or labels).
One column per focal variable, containing r values.
Use meta(result) to access design type, variable labels, and
method ("pearson", "polychoric", or "polyserial"). For
method != "pearson", meta(result)$bivariate_normal_cdf is
"pbivnorm" (the bivariate-normal CDF used internally). When the
replicate variance path observed one or more non-converged replicates,
meta(result)$n_failed_replicates_total carries the scalar total.
A survey design object: survey_taylor, survey_replicate,
survey_twophase, or survey_nonprob. method values "polychoric"
and "polyserial" are supported on survey_taylor and
survey_replicate only; other design classes raise
surveycore_error_polychoric_design_unsupported.
<tidy-select> Two or more unquoted
variable names. For method = "pearson", non-numeric columns are dropped
with a warning. For method = "polychoric", every selected column must
classify as ordinal (ordered factor, unordered factor, or integer with
<= 10 distinct values) — non-ordinal columns raise
surveycore_error_polychoric_requires_ordinal. For
method = "polyserial", each pair is canonicalized by type (one ordinal
one continuous); logical / character / high-cardinality integer
columns raise surveycore_error_polyserial_canonicalization_ambiguous.
<tidy-select> Optional grouping
variable(s). Combined with any grouping set by group_by(). Default
NULL.
"long" (default) or "wide". Long format returns one row
per variable pair with inference statistics. Wide format returns the
correlation matrix (r values only — no variance or inference columns).
When group is active, group columns are prepended in both formats.
Case-sensitive.
Logical. If FALSE (default), each pair appears once
(lower triangle: pairs where var1 precedes var2 in input order). If
TRUE, both (A, B) and (B, A) are included (full directed pairs).
Only affects long format; wide format always shows the full symmetric
matrix.
Logical. If FALSE (default), self-correlations are
excluded (diagonal is NA in wide format). If TRUE, self-correlations
(r equals 1) are included.
NULL or a character vector of one or more of "se",
"ci", "var", "cv", "moe", "deff". Default "ci". CI bounds
use the Fisher Z transform (guaranteeing bounds in (-1, 1)). Only
applies to long format.
Numeric scalar in (0, 1). Default 0.95.
Logical. If TRUE, add an n_weighted column with the
pairwise sum of weights (both variables non-NA). Default FALSE.
Integer or NULL. If an integer, rounds all numeric output
columns (e.g., r, se, ci_low, ci_high) to this many decimal
places. Default NULL (no rounding).
Integer. Minimum pairwise unweighted count before
surveycore_warning_small_cell fires. Default 30L (AAPOR guidance).
Logical. If TRUE (default), pairs use complete cases for
each variable pair separately (pairwise deletion), and observations where
any group variable is NA are excluded from the output. If FALSE,
pairwise complete cases are still used for each variable pair, and
observations where a group variable is NA are collected into their own
group row in the output (appearing after all non-NA group rows).
Logical. If TRUE (default) and the grouping variable
has value labels, the group column is converted to a labelled factor.
Has no visible effect when no groups are active.
Logical. If TRUE (default) and variable labels are
set in metadata, var1/var2 columns (long) and variable column
(wide) show labels instead of raw names. Falls back to raw names if
labels are unset.
"surveycore" (default) or "broom". When "broom",
renames r → estimate, se → std.error, etc. Only affects long
format.
Character(1). Estimator applied to every pair. One of
"pearson" (default, sample-based product-moment correlation),
"polychoric" (MLE under a bivariate-normal latent model for two
ordinal variables), or "polyserial" (MLE for one ordinal + one
continuous variable). The same method applies to every pair; it
cannot be vectorised. Non-matching values raise the standard
base::match.arg() signal.
Unused. Reserved so that .id and .if_missing_var remain
named-only when a survey_collection is passed as design.
Character(1) or NULL. Column name used to identify each
survey when design is a survey_collection. For collection inputs,
NULL (the default) resolves to the collection's stored @id property.
Pass a non-NULL value to override. Ignored when design is a single
survey.
"error", "skip", or NULL. How to handle
surveys in a collection that lack one of the requested NSE variables.
For collection inputs, NULL (the default) resolves to the collection's
stored @if_missing_var property. Pass a non-NULL value to override.
Ignored when design is a single survey.
Polychoric / polyserial semantics. For method != "pearson", each pair
is fit by a two-step MLE: weighted marginal thresholds (and, for
polyserial, a weighted standardization of the continuous side) are
estimated first, then rho is maximised over the weighted
log-likelihood via stats::optimize() on (-1 + 1e-6, 1 - 1e-6).
Confidence intervals are constructed on the Fisher-z scale
(atanh(rho)) and back-transformed via tanh with truncation to
[-1, 1]. The Wald statistic zeta.hat / SE(zeta.hat) is referred to
a standard normal distribution, so df = NA_integer_ — distinct from
the Pearson case where df = n - 2 and the t-distribution is used.
Column label attributes are method-neutral (e.g. "statistic", not
"t-statistic" / "z-statistic"); check meta(result)$method to
interpret the values.
Bivariate-normal assumption. The polychoric / polyserial MLEs assume the underlying latent variables are jointly bivariate-normal. This is an unverified assumption; no runtime diagnostic is performed.
Taylor-path cost. On a survey_taylor design, the variance path
for method != "pearson" is O(n) re-optimisations per variable pair
(a perturbation-based influence function). For large n and many
pairs, passing a survey_replicate design (one re-fit per replicate,
not per respondent) is substantially faster.
Replicate-type caveat. Mannan (2025) verifies the replicate-weight
variance formula for jackknife and bootstrap replicates. BRR and Fay
replicates are admitted mechanically via the design's stored scale
/ rscales coefficients, but the paper does not validate their
behaviour for this non-linear pseudo-likelihood estimator.
Cox, N. R. (1974). Estimation of the correlation between a continuous and a discrete variable. Biometrics, 30(1), 171-178.
Mannan, H. (2025). SAS programs for estimation of weighted polychoric and weighted polyserial correlations in a complex survey. SSRN. tools:::Rd_expr_doi("10.2139/ssrn.6580480")
Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.
Other analysis:
clean(),
get_anova(),
get_covariance(),
get_diffs(),
get_freqs(),
get_means(),
get_pairwise(),
get_quantiles(),
get_ratios(),
get_t_test(),
get_totals(),
get_variance(),
meta()
d <- as_survey(nhanes_2017, ids = sdmvpsu, weights = wtint2yr,
strata = sdmvstra, nest = TRUE)
get_corr(d, x = c(ridageyr, bpxsy1))
# Wide correlation matrix
get_corr(d, x = c(ridageyr, bpxsy1), format = "wide")
# AAPOR-compliant
get_corr(d, x = c(ridageyr, bpxsy1),
variance = c("ci", "moe"), n_weighted = TRUE)
# Polychoric correlation between two ordinal variables
df <- data.frame(
id = 1:200,
wt = runif(200, 0.5, 2),
o1 = factor(sample(1:4, 200, replace = TRUE), ordered = TRUE),
o2 = factor(sample(1:4, 200, replace = TRUE), ordered = TRUE)
)
d_ord <- as_survey(df, weights = wt)
get_corr(d_ord, x = c(o1, o2), method = "polychoric")
Run the code above in your browser using DataLab