Learn R Programming

stddiff.spark (version 1.0)

stddiff.category: Compute Standardized Differences for Categorical Variables (Spark)

Description

Calculates standardized differences for categorical variables using a Spark DataFrame. Equivalent to stddiff::stddiff.category but operates on Spark data.

Usage

stddiff.category(data, gcol, vcol, verbose = FALSE)

Value

A numeric matrix with one row per category level and columns:

  • p.c: Proportion in control group

  • p.t: Proportion in treatment group

  • missing.c: Number of missing values in control group (first row only)

  • missing.t: Number of missing values in treatment group (first row only)

  • stddiff: Standardized difference (first row only)

  • stddiff.l: Lower CI bound (first row only)

  • stddiff.u: Upper CI bound (first row only)

Row names are formatted as "variable_name level_name".

Arguments

data

A Spark DataFrame (tbl_spark) containing the variables.

gcol

Integer; column index of the binary grouping variable.

vcol

Integer vector; column indices of the categorical variables to analyze.

verbose

Logical; if TRUE, prints progress messages. Default is FALSE.

Details

For categorical variables with K levels, the standardized difference is computed using a multivariate approach that accounts for all K-1 levels simultaneously (excluding the reference level). Category levels are sorted lexicographically; the first level alphabetically serves as the reference.

See Also

stddiff.binary, stddiff.numeric

Examples

Run this code
if (FALSE) { # requireNamespace("sparklyr", quietly = TRUE) && interactive()
sc <- sparklyr::spark_connect(master = "local")

spark_df <- sparklyr::copy_to(sc, as.data.frame(Titanic))

result <- stddiff.category(
  data = spark_df,
  gcol = 4,   # column index of grouping variable
  vcol = c(1) # columns of categorical variables
)

sparklyr::spark_disconnect(sc)
}

Run the code above in your browser using DataLab