stddiff.category: Compute Standardized Differences for Categorical Variables (Spark)

Description

Calculates standardized differences for categorical variables using a Spark DataFrame. Equivalent to stddiff::stddiff.category but operates on Spark data.

Usage

stddiff.category(data, gcol, vcol, verbose = FALSE)

Value

A numeric matrix with one row per category level and columns:

p.c: Proportion in control group
p.t: Proportion in treatment group
missing.c: Number of missing values in control group (first row only)
missing.t: Number of missing values in treatment group (first row only)
stddiff: Standardized difference (first row only)
stddiff.l: Lower CI bound (first row only)
stddiff.u: Upper CI bound (first row only)

Row names are formatted as "variable_name level_name".

Arguments

data: A Spark DataFrame (tbl_spark) containing the variables.
gcol: Integer; column index of the binary grouping variable.
vcol: Integer vector; column indices of the categorical variables to analyze.
verbose: Logical; if TRUE, prints progress messages. Default is FALSE.

Details

For categorical variables with K levels, the standardized difference is computed using a multivariate approach that accounts for all K-1 levels simultaneously (excluding the reference level). Category levels are sorted lexicographically; the first level alphabetically serves as the reference.

Examples

Run this code

if (FALSE) { # requireNamespace("sparklyr", quietly = TRUE) && interactive()
sc <- sparklyr::spark_connect(master = "local")

spark_df <- sparklyr::copy_to(sc, as.data.frame(Titanic))

result <- stddiff.category(
  data = spark_df,
  gcol = 4,   # column index of grouping variable
  vcol = c(1) # columns of categorical variables
)

sparklyr::spark_disconnect(sc)
}