stddiff.numeric: Compute Standardized Differences for Numeric Variables (Spark)

Description

Calculates standardized differences for continuous numeric variables using a Spark DataFrame. Equivalent to stddiff::stddiff.numeric but operates on Spark data.

Usage

stddiff.numeric(data, gcol, vcol, verbose = FALSE)

Value

A numeric matrix with one row per variable and columns:

mean.c: Mean in control group
sd.c: Standard deviation in control group
mean.t: Mean in treatment group
sd.t: Standard deviation in treatment group
missing.c: Number of missing values in control group
missing.t: Number of missing values in treatment group
stddiff: Standardized difference
stddiff.l: Lower bound of 95% confidence interval
stddiff.u: Upper bound of 95% confidence interval

Arguments

data: A Spark DataFrame (tbl_spark) containing the variables.
gcol: Integer; column index of the binary grouping variable.
vcol: Integer vector; column indices of the numeric variables to analyze.
verbose: Logical; if TRUE, prints progress messages. Default is FALSE.

Details

The standardized difference for continuous variables is computed as: $$d = \frac{|\bar{x}_t - \bar{x}_c|}{\sqrt{(s_t^2 + s_c^2)/2}}$$ where $\bar{x}$ represents means and $s^2$ represents variances.

This is equivalent to Cohen's d with pooled standard deviation.

Examples

Run this code

if (FALSE) { # requireNamespace("sparklyr", quietly = TRUE) && interactive()
sc <- sparklyr::spark_connect(master = "local")

spark_df <- sparklyr::copy_to(sc, mtcars)

result <- stddiff.numeric(
  data = spark_df,
  gcol = 8,          # column index of grouping variable
  vcol = c(1, 2, 5)  # columns of numeric variables
)

sparklyr::spark_disconnect(sc)
}