Learn R Programming

Introduction

stddiff.spark provides Spark-compatible implementations of the standardized difference calculations from the stddiff package. The interface is identical to stddiff, so you can swap your existing calls in-place without changing your workflow.

Because Spark DataFrames do not have native factor types, categorical variables are encoded using alphabetic ordering: the first level alphabetically becomes 0, the second becomes 1, and so on. This ensures consistent, deterministic calculations for binary and multi-level categorical variables.

[!Note] If you want to choose a specific reference category, you must update the values in your Spark DataFrame so that the desired reference level comes first alphabetically. For example:

library(dplyr)
# Suppose original category: "Control", "Treatment"
spark_df <- spark_df %>%
  mutate(group = ifelse(group == "Treatment", "A_Treatment", group))

Here, prefixing "Treatment" with "A_" ensures it comes first alphabetically, making it the reference level for standardized difference calculations.

Functions automatically dispatch to the stddiff package when non-Spark data is supplied, so the same code works seamlessly on both local R data frames and Spark DataFrames.

Installation

CRAN

install.packages("stddiff.spark")

GitHub

# install.packages("remotes") # if you don’t have it
remotes::install_github("alicja-januszkiewicz/stddiff.spark")

Usage

library(sparklyr)
library(dplyr)
library(stddiff.spark)

# connect to Spark
sc <- spark_connect(master = "local")

# create example local data
my_data <- data.frame(
  treatment = c(1, 0, 1, 0, 1, 0),
  age       = c(34, 28, 45, 30, 50, 33),
  bmi       = c(22.1, 24.3, 27.8, 23.5, 28.2, 25.0),
  weight    = c(70, 65, 85, 68, 90, 72)
)

# copy data to Spark
spark_df <- copy_to(sc, my_data, overwrite = TRUE)

# compute standardized differences for numeric variables
stddiff.numeric(spark_df, gcol = 1, vcol = 2:4)

# disconnect Spark
spark_disconnect(sc)

Requirements

  • Apache Spark (tested with 3.4.4)
  • sparklyr (>= 1.8.0).

Copy Link

Version

Install

install.packages('stddiff.spark')

Version

1.0

License

GPL (>= 3)

Issues

Pull Requests

Stars

Forks

Maintainer

Alicja Januszkiewicz

Last Published

January 15th, 2026

Functions in stddiff.spark (1.0)

spark_to_r_type

Map spark to base R types
stddiff.numeric

Compute Standardized Differences for Numeric Variables (Spark)
stddiff.binary

Compute Standardized Differences for Binary Variables (Spark)
check_binary_group

Check if group variable has exactly 2 levels
stddiff.category

Compute Standardized Differences for Categorical Variables (Spark)
validate_stddiff_inputs

Validate inputs for stddiff functions