Learn R Programming

stddiff.spark (version 1.0)

stddiff.binary: Compute Standardized Differences for Binary Variables (Spark)

Description

Calculates standardized differences for binary variables using a Spark DataFrame. Equivalent to stddiff::stddiff.binary but operates on Spark data.

Usage

stddiff.binary(data, gcol, vcol, verbose = FALSE)

Value

A numeric matrix with one row per variable and columns:

  • p.c: Proportion in control group (first level alphabetically)

  • p.t: Proportion in treatment group (second level alphabetically)

  • missing.c: Number of missing values in control group

  • missing.t: Number of missing values in treatment group

  • stddiff: Standardized difference

  • stddiff.l: Lower bound of 95% confidence interval

  • stddiff.u: Upper bound of 95% confidence interval

Arguments

data

A Spark DataFrame (tbl_spark) containing the variables.

gcol

Integer; column index of the binary grouping variable (e.g., treatment vs control).

vcol

Integer vector; column indices of the binary variables to analyze.

verbose

Logical; if TRUE, prints progress messages. Default is FALSE.

Details

Variables are encoded using lexicographic ordering since Spark does not have factor types. The first level alphabetically becomes 0, the second becomes 1.

The standardized difference is computed as: $$d = \frac{|p_t - p_c|}{\sqrt{(p_t(1-p_t) + p_c(1-p_c))/2}}$$

See Also

stddiff.category, stddiff.numeric

Examples

Run this code
if (FALSE) { # requireNamespace("sparklyr", quietly = TRUE) && interactive()
sc <- sparklyr::spark_connect(master = "local")

spark_df <- sparklyr::copy_to(sc, mtcars)

result <- stddiff.binary(
  data = spark_df,
  gcol = 9,   # column index of grouping variable
  vcol = c(8) # columns of binary variables
)

sparklyr::spark_disconnect(sc)
}

Run the code above in your browser using DataLab