Learn R Programming

stddiff.spark (version 1.0)

stddiff.numeric: Compute Standardized Differences for Numeric Variables (Spark)

Description

Calculates standardized differences for continuous numeric variables using a Spark DataFrame. Equivalent to stddiff::stddiff.numeric but operates on Spark data.

Usage

stddiff.numeric(data, gcol, vcol, verbose = FALSE)

Value

A numeric matrix with one row per variable and columns:

  • mean.c: Mean in control group

  • sd.c: Standard deviation in control group

  • mean.t: Mean in treatment group

  • sd.t: Standard deviation in treatment group

  • missing.c: Number of missing values in control group

  • missing.t: Number of missing values in treatment group

  • stddiff: Standardized difference

  • stddiff.l: Lower bound of 95% confidence interval

  • stddiff.u: Upper bound of 95% confidence interval

Arguments

data

A Spark DataFrame (tbl_spark) containing the variables.

gcol

Integer; column index of the binary grouping variable.

vcol

Integer vector; column indices of the numeric variables to analyze.

verbose

Logical; if TRUE, prints progress messages. Default is FALSE.

Details

The standardized difference for continuous variables is computed as: $$d = \frac{|\bar{x}_t - \bar{x}_c|}{\sqrt{(s_t^2 + s_c^2)/2}}$$ where \(\bar{x}\) represents means and \(s^2\) represents variances.

This is equivalent to Cohen's d with pooled standard deviation.

See Also

stddiff.binary, stddiff.category

Examples

Run this code
if (FALSE) { # requireNamespace("sparklyr", quietly = TRUE) && interactive()
sc <- sparklyr::spark_connect(master = "local")

spark_df <- sparklyr::copy_to(sc, mtcars)

result <- stddiff.numeric(
  data = spark_df,
  gcol = 8,          # column index of grouping variable
  vcol = c(1, 2, 5)  # columns of numeric variables
)

sparklyr::spark_disconnect(sc)
}

Run the code above in your browser using DataLab