Learn R Programming

tidysynthesis

This repository contains the tidysynthesis R package for generating synthetic data. Complete documentation is available is available at the tidysynthesis documentation website.

Installation:

# The easiest way to get tidysynthesis is from CRAN:
install.packages("tidysynthesis")

# Or the development version from GitHub:
# install.packages("pak")
pak::pak("UrbanInstitute/tidysynthesis")

Overview

tidysynthesis is a “metapackage” for creating synthetic data sets for statistical disclosure limitation that shares the underlying design philosophy, grammar, and data structures of the tidyverse and tidymodels. tidysynthesis flexibly supports sequential synthesis modeling and sampling specifications with different formal and empirical privacy properties.

Note that the privacy and security properties of tidysynthesis’s outputs rely on many different technical assumptions. Our goal is to make the package agnostic to many of these assumptions, which places greater responsibility on users to evaluate synthetic data prior to dissemination. For more information, see our notes on security principles from our documentation.

presynth

Syntheses ultimately depend on a presynth object with two main components: a roadmap that outlines the macroscopic workflow (i.e., what order to synthesize variables, how variables are defined and relate to one another, etc.) and a synth_spec that details how individual variables are synthesized (i.e. how are specific output variable modeling workflows specified, how are new synthetic samples generated, etc.). Here is the general workflow, with required objects in blue and optional objects in magenta.

flowchart TD
  A[conf_data]:::required --> B[roadmap]:::required
  C[start_data]:::required --> B
  D[start_method]:::optional --> B
  E[schema]:::optional --> B 
  F[visit_sequence]:::optional --> B
  G[replicates]:::optional --> B
  H[constraints]:::optional --> B
  I[models]:::required --> J[synth_spec]:::required
  K[samplers]:::required --> J
  L[steps]:::optional --> J
  M[noise]:::optional --> J
  N[tuners]:::optional --> J
  O[extractors]:::optional --> J
  B --> P[presynth]:::required 
  J --> P
  P --> Q[postsynth]:::required
  classDef required fill:#1696d2,stroke:#1696d2;
  classDef optional fill:#ec008b,stroke:#ec008b;

roadmap Components

roadmap objects require data inputs. All other inputs can be optionally supplied as S3 objects or constructed by default. roadmaps also have a tidymodels-style API that lets you update objects, for example:

# create an example roadmap
roadmap(conf_data = example_conf_data, 
        start_data = example_start_data) |>
  # add an example visit_sequence using the API
  add_sequence_manual(var1, var2) |>
  # update the example schema using the API
  update_schema(col_schema = list("var1" = list("dtype" = "fct")))

See the documentation website for more API examples.

Required roadmap objects:

  • conf_data: the data.frame of confidential data to synthesize.
  • start_data the data.frame of starting data to initialize sequential models.

Optional roadmap objects:

  • start_method(): S3 object that specifies randomized transformations on the start_data, such as resampling, noise infusion, or joint modeling. Defaults to no transformation.
  • schema(): S3 column schema with specifications describing variables, such as data types and NA values. Defaults to inferred types from the provided conf_data.
  • visit_sequence(): S3 object that specifies the order in which variables get synthesized. Defaults to the same order they appear in the confidential data.
  • contraints(): S3 object that controls imposed constraints during the synthesis process, such as maxima and minima for numeric variables. Defaults to no constraints.
  • replicates(): S3 object controlling the synthesis component repetition for producing multiple synthetic datasets. Defaults to one synthetic dataset.

synth_spec Components

synth_spec S3 objects allow you to specify different components using default versions for regression and classification models, or custom models mapping individual variables to components. Here is an example:

synth_spec(
  default_regression_sampler = tidysynthesis::sample_lm,
  default_classification_sampler = tidysynthesis::sample_rpart,
  custom_samplers = list(
    list("vars" = c("var1", "var2"), "sampler" = sample_ranger)
  ),
  ...
)

These components also support updating via a tidymodels-style API; see the documentation website for more API examples.

Required synth_spec components:

  • models: parsnip model specifications
  • samplers: sampler functions (many provided in tidysynthesis)

Optional synth_spec components:

  • steps: functions that transform predictors using recipe::steps_* functions.
  • noise: S3 object for specifying additive noise for synthesis outputs.
  • tuners: list specifications for cross-validating hyperparameter tuning
  • extractors: parsnip functions for extracting fit model information.

Examples

Code in the following set of examples synthesizes the palmerpenguins data set with missing values removed.

library(palmerpenguins)
library(tidyverse)
library(tidysynthesis)

penguins_complete <- penguins |>
  select(-year) |>
  drop_na() |>
  mutate(
    flipper_length_mm = as.numeric(flipper_length_mm),
    body_mass_g = as.numeric(body_mass_g)
  )

All of the examples use the same starting data and visit sequence, as specified by the roadmap below.

set.seed(20220218)

# create "starting data"
starting_data <- penguins_complete |> 
  group_by(island) |>
  slice_sample(n = 5) |>
  select(species, island, sex) |>
  ungroup()

# create roadmap
rm <- roadmap(
  conf_data = penguins_complete,
  start_data = starting_data
) |> 
  add_sequence_numeric(
    dplyr::where(is.numeric), 
    method = "correlation", 
    cor_var = "bill_length_mm"
  )

Example 1

Example 1 uses linear regression to synthesize the numeric data in the penguins data set. sample_lm() samples from normal distributions centered on the regression line with standard deviation equal to the residual standard error.

# synth_spec

lm_mod <- parsnip::linear_reg() |> 
  parsnip::set_engine(engine = "lm") |>
  parsnip::set_mode(mode = "regression")

synth_spec1 <- synth_spec(
  default_regression_model = lm_mod,
  default_regression_sampler = tidysynthesis::sample_lm
)

# create a presynth object
# use defaults for noise, constraints, and replicates
presynth1 <- presynth(
  roadmap = rm,
  synth_spec = synth_spec1
)

# synthesize!
set.seed(1)
postsynth1 <- synthesize(presynth = presynth1)
#> Synthesizing 1/4 bill_length_mm...

#> Synthesizing 2/4 flipper_length_mm...

#> Synthesizing 3/4 body_mass_g...

#> Synthesizing 4/4 bill_depth_mm...
postsynth1$synthetic_data
#> # A tibble: 15 × 7
#>    species   island    sex    bill_length_mm flipper_length_mm body_mass_g
#>    <fct>     <fct>     <fct>           <dbl>             <dbl>       <dbl>
#>  1 Gentoo    Biscoe    male             47.9              219.       5764.
#>  2 Gentoo    Biscoe    male             49.8              221.       5400.
#>  3 Gentoo    Biscoe    female           43.7              218.       4895.
#>  4 Gentoo    Biscoe    male             53.1              227.       5605.
#>  5 Gentoo    Biscoe    female           46.4              217.       4432.
#>  6 Chinstrap Dream     female           45.1              196.       3308.
#>  7 Chinstrap Dream     female           48.1              197.       3399.
#>  8 Adelie    Dream     male             42.1              195.       4075.
#>  9 Chinstrap Dream     female           48.3              182.       3565.
#> 10 Adelie    Dream     female           35.9              189.       3624.
#> 11 Adelie    Torgersen male             44.4              197.       4110.
#> 12 Adelie    Torgersen female           38.1              188.       3322.
#> 13 Adelie    Torgersen female           35.8              179.       3391.
#> 14 Adelie    Torgersen male             35.8              189.       3997.
#> 15 Adelie    Torgersen female           39.8              192.       3307.
#> # ℹ 1 more variable: bill_depth_mm <dbl>

Example 2

synth_spec() can accept different model types. The example below is a regression tree model. Notice how all of the other objects from example 1 can be reused.

dt_mod <- parsnip::decision_tree() |>
  parsnip::set_engine(engine = "rpart") |>
  parsnip::set_mode(mode = "regression")

synth_spec2 <- synth_spec(
  default_regression_model = dt_mod,
  default_regression_sampler = tidysynthesis::sample_rpart
)

# create a presynth object
presynth2 <- presynth(
  roadmap = rm,
  synth_spec = synth_spec2
)

# synthesize!
set.seed(1)
postsynth2 <- synthesize(presynth = presynth2)
#> Synthesizing 1/4 bill_length_mm...

#> Synthesizing 2/4 flipper_length_mm...

#> Synthesizing 3/4 body_mass_g...

#> Synthesizing 4/4 bill_depth_mm...
postsynth2$synthetic_data
#> # A tibble: 15 × 7
#>    species   island    sex    bill_length_mm flipper_length_mm body_mass_g
#>    <fct>     <fct>     <fct>           <dbl>             <dbl>       <dbl>
#>  1 Gentoo    Biscoe    male             49.6               220        5350
#>  2 Gentoo    Biscoe    male             50                 215        5550
#>  3 Gentoo    Biscoe    female           45.2               222        4700
#>  4 Gentoo    Biscoe    male             59.6               216        5500
#>  5 Gentoo    Biscoe    female           42.8               218        5000
#>  6 Chinstrap Dream     female           46.6               195        3550
#>  7 Chinstrap Dream     female           40.9               185        3700
#>  8 Adelie    Dream     male             39.2               186        3800
#>  9 Chinstrap Dream     female           47.7               190        2900
#> 10 Adelie    Dream     female           36                 176        3850
#> 11 Adelie    Torgersen male             41.4               178        4050
#> 12 Adelie    Torgersen female           37                 174        3350
#> 13 Adelie    Torgersen female           37                 180        3775
#> 14 Adelie    Torgersen male             41.6               195        3550
#> 15 Adelie    Torgersen female           36.8               191        3200
#> # ℹ 1 more variable: bill_depth_mm <dbl>

Example 3

Sometimes prediction error is not enough and additional noise is added to predictions. noise controls adding additional noise to predicted values. add_noise toggles on/off which variables will receive additional noise. Here, we use the function add_noise_kde() with additional arguments exclusions and n_ntiles passed to this function (see the documentation ?add_noise_kde for details.)

# noise
# this turns on noise for all variables and adds 0 as an exclusion for body_mass_g
noise_spec <- noise(
  add_noise = TRUE,
  noise_func = add_noise_kde,
  exclusions = 0,
  n_ntiles = 20
)

synth_spec3 <- synth_spec2 |>
  update_synth_spec(
    default_regression_noise = noise_spec
  )

presynth3 <- presynth(
  roadmap = rm,
  synth_spec = synth_spec3
)

# synthesize!
set.seed(1)
postsynth3 <- synthesize(presynth = presynth3)
#> Synthesizing 1/4 bill_length_mm...

#> Synthesizing 2/4 flipper_length_mm...

#> Synthesizing 3/4 body_mass_g...

#> Synthesizing 4/4 bill_depth_mm...
postsynth3$synthetic_data
#> # A tibble: 15 × 7
#>    species   island    sex    bill_length_mm flipper_length_mm body_mass_g
#>    <fct>     <fct>     <fct>           <dbl>             <dbl>       <dbl>
#>  1 Gentoo    Biscoe    male             49.5              217.       5283.
#>  2 Gentoo    Biscoe    male             49.8              218.       5254.
#>  3 Gentoo    Biscoe    female           45.1              208.       4420.
#>  4 Gentoo    Biscoe    male             59.9              215.       5881.
#>  5 Gentoo    Biscoe    female           42.4              210.       4467.
#>  6 Chinstrap Dream     female           47.4              195.       3321.
#>  7 Chinstrap Dream     female           41.0              208.       3289.
#>  8 Adelie    Dream     male             39.9              195.       3914.
#>  9 Chinstrap Dream     female           47.4              198.       3463.
#> 10 Adelie    Dream     female           36.5              184.       3072.
#> 11 Adelie    Torgersen male             40.8              199.       3548.
#> 12 Adelie    Torgersen female           37.6              196.       3841.
#> 13 Adelie    Torgersen female           37.3              170.       3582.
#> 14 Adelie    Torgersen male             41.4              189.       3461.
#> 15 Adelie    Torgersen female           37.7              182.       3400.
#> # ℹ 1 more variable: bill_depth_mm <dbl>

Example 4

tidysynthesis contains a system for specifying constraints during this synthesis process. This means constraints imposed on earlier variables are realized before later variables are synthesized. The constraints can be unconditional (e.g. penguin weight must be positive) or conditional (e.g. a Gentoo penguin must weigh at least 6,000 grams).

Constraints can be specified for numeric and/or categorical variables. Depending on the variable type, two different constraint_df_* can be specified, either constraints_df_num or constraints_df_cat (see the documentation ?constraints for examples.) max_z_num and max_z_cat controls the number of times a value should be resampled if it violates a constraint before enforcing the constraints by modifying synthesized values.

Below is an example using constraints_df_num and max_df_num.

# create a tibble of constraints
# skipped variables will have a minimum of -Inf and a maximum of Inf
constraints_df_num <- 
  tibble::tribble(
    ~var, ~min, ~max, ~conditions,
    "bill_length_mm", 0, Inf, "TRUE",
    "bill_length_mm", 0, Inf, "TRUE",
    "flipper_length_mm", 0, Inf, "TRUE",
    "body_mass_g", 0, Inf, "TRUE",
    "body_mass_g", 4000, 10000, "flipper_length_mm > 190",
    "body_mass_g", 6000, Inf, "species == 'Gentoo'"
  ) 

# create a constraints object
constraints4 <- constraints(
  schema = rm$schema,
  constraints_df_num = constraints_df_num,
  max_z_num = list(0, 1, 2, 3)
)

presynth4 <- presynth(
  roadmap = rm |>
    add_constraints(constraints4),
  synth_spec = synth_spec3
)

# synthesize!
set.seed(1)
postsynth4 <- synthesize(presynth = presynth4)
#> Synthesizing 1/4 bill_length_mm...

#> Synthesizing 2/4 flipper_length_mm...

#> Synthesizing 3/4 body_mass_g...

#> Synthesizing 4/4 bill_depth_mm...
postsynth4$synthetic_data
#> # A tibble: 15 × 7
#>    species   island    sex    bill_length_mm flipper_length_mm body_mass_g
#>    <fct>     <fct>     <fct>           <dbl>             <dbl>       <dbl>
#>  1 Gentoo    Biscoe    male             49.5              217.       6001.
#>  2 Gentoo    Biscoe    male             49.8              218.       6309.
#>  3 Gentoo    Biscoe    female           45.1              208.       6000 
#>  4 Gentoo    Biscoe    male             59.9              215.       6000 
#>  5 Gentoo    Biscoe    female           42.4              210.       6000 
#>  6 Chinstrap Dream     female           47.4              195.       4000 
#>  7 Chinstrap Dream     female           41.0              208.       4000 
#>  8 Adelie    Dream     male             39.9              195.       4000 
#>  9 Chinstrap Dream     female           47.4              198.       4000 
#> 10 Adelie    Dream     female           36.5              184.       3072.
#> 11 Adelie    Torgersen male             40.8              199.       4000 
#> 12 Adelie    Torgersen female           37.6              196.       4000 
#> 13 Adelie    Torgersen female           37.3              170.       3582.
#> 14 Adelie    Torgersen male             41.4              189.       3461.
#> 15 Adelie    Torgersen female           37.7              182.       3400.
#> # ℹ 1 more variable: bill_depth_mm <dbl>

Example 5

tidysynthesis can generate multiple replicates. This means that all input conditions are the same, but, due to random sampling, the syntheses themselves differ. The replicates() functionality allows for the creation of these replicates (see the documentation ?replicates() for different kinds of replicate specification).

replicates5 <- replicates(model_sample_replicates = 5)

presynth5 <- presynth(
  roadmap = rm |>
    add_replicates(replicates5),
  synth_spec = synth_spec2
)

# synthesize!
set.seed(1)
suppressMessages(synth5 <- synthesize(presynth = presynth5))

glimpse(synth5[[1]]$synthetic_data)
#> Rows: 15
#> Columns: 7
#> $ species           <fct> Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Chinstrap, C…
#> $ island            <fct> Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Dream, Dream…
#> $ sex               <fct> male, male, female, male, female, female, female, ma…
#> $ bill_length_mm    <dbl> 49.6, 50.0, 45.2, 59.6, 42.8, 46.6, 40.9, 39.2, 47.7…
#> $ flipper_length_mm <dbl> 220, 215, 222, 216, 218, 195, 185, 186, 190, 176, 17…
#> $ body_mass_g       <dbl> 5350, 5550, 4700, 5500, 5000, 3550, 3700, 3800, 2900…
#> $ bill_depth_mm     <dbl> 15.3, 15.9, 14.5, 15.9, 13.7, 18.0, 16.6, 17.2, 18.7…
glimpse(synth5[[2]]$synthetic_data)
#> Rows: 15
#> Columns: 7
#> $ species           <fct> Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Chinstrap, C…
#> $ island            <fct> Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Dream, Dream…
#> $ sex               <fct> male, male, female, male, female, female, female, ma…
#> $ bill_length_mm    <dbl> 50.4, 51.4, 44.9, 51.9, 47.5, 46.5, 48.4, 37.6, 50.1…
#> $ flipper_length_mm <dbl> 228, 220, 208, 230, 209, 189, 198, 187, 202, 190, 18…
#> $ body_mass_g       <dbl> 6000, 5550, 4850, 5700, 5200, 3200, 3500, 4500, 3700…
#> $ bill_depth_mm     <dbl> 16.4, 15.7, 13.8, 15.3, 14.3, 19.4, 16.8, 19.5, 17.3…
glimpse(synth5[[3]]$synthetic_data)
#> Rows: 15
#> Columns: 7
#> $ species           <fct> Gentoo, Gentoo, Gentoo, Gentoo, Gentoo, Chinstrap, C…
#> $ island            <fct> Biscoe, Biscoe, Biscoe, Biscoe, Biscoe, Dream, Dream…
#> $ sex               <fct> male, male, female, male, female, female, female, ma…
#> $ bill_length_mm    <dbl> 51.3, 48.5, 46.0, 50.0, 48.1, 45.2, 46.4, 41.4, 45.7…
#> $ flipper_length_mm <dbl> 220, 225, 210, 230, 220, 200, 192, 189, 200, 195, 17…
#> $ body_mass_g       <dbl> 5800, 5050, 4950, 5600, 3950, 3000, 3500, 4100, 2700…
#> $ bill_depth_mm     <dbl> 15.6, 16.3, 13.1, 17.0, 14.0, 18.9, 17.3, 17.5, 16.9…

Getting help

Please share bugs and feature requests on GitHub. Minimal reproducible examples are appreciated.

Copy Link

Version

Install

install.packages('tidysynthesis')

Version

0.1.2

License

GPL-3

Maintainer

Aaron Williams

Last Published

November 11th, 2025

Functions in tidysynthesis (0.1.2)

.name_to_inspect

constant mapping between component names and inspections
example_na

A df with different types of missingness
invert.step_BoxCox

Invert a Box-Cox transformation
print.presynth

print method for presynth objects
enforce_custom_na

Redefine NA value for a dataset.
print.replicates

Print the replicates object to the console with formatting
invert

An S3 method for inverting a step
print.constraints

Print the constraints object to the console with formatting
print.visit_sequence

Print method for visit_sequence objects
print.synth_spec

Print the replicates object to the console with formatting
enforce_schema

Enforce a roadmap's schema on its existing data
enforce_na

Add missing values where values should be missing according to _NA variables
presynth

Create a presynth object
construct_tuners

Construct a list of tuning grids for hyperparameter tuning predictive models
ks_distance

Kolmogorov-Smirnov distance
construct_samplers

Construct a list of samplers for synthesis
sample_glm

Sample the conditional distribution created by a generalized linear model
sample_rpart

Sample the conditional distribution created by a CART model
roadmap

Create a roadmap
print.schema

Print the schema object to the console with formatting
noise

Create a noise object
print.start_method

Print the start_method object to the console with formatting
synth_spec

Create a synth_spec object
print.noise

Print the noise object to the console with formatting
add_sequence_manual

Add to visit sequence using a manual method
convert_na_to_level

Convert NA values to "NA" for categorical variables
convert_level_to_na

Convert "NA" values to NA for categorical variables
invert.step_log

Invert a log transformation
expand_na

Add new variables that indicate if a value is "missing" or "not missing" for original variables that contain NA
invert.step_YeoJohnson

Invert a Yeo-Johnson transformation
replicates

Create a replicates object
replicates_api

Add, update, or reset a replicates object within an existing roadmap.
sample_lm

Sample the conditional distribution created by a linear model
sample_ranger

Sample the conditional distribution created by a ranger rf model
print.postsynth

Print the postsynth object to the console with formatting
synthesize

Synthesize a data set
synth_spec_extractor_api

Add, update, or remove extractors from a synth_spec object
schema

Generate a schema object.
synth_spec_tuner_api

Add, update, or remove tuners from a synth_spec object
synth_spec_sampler_api

Add, update, or remove samplers from a synth_spec object
tune_synthesis

Generate syntheses from multiple presynth objects.
synth_spec_noise_api

Add, update, or remove noise from a synth_spec object
synth_spec_recipes_api

Add, update, or remove recipe recipes from a synth_spec object
synth_spec_model_api

Add, update, or remove custom models from a synth_spec object
synth_spec_is_component

Inspections for synth_spec components
start_method

Create a start_method object.
schema_api

Add, update, or reset a schema object within an existing roadmap.
visit_sequence

Generate a visit sequence.
update_presynth

Update presynth object
start_method_api

Add, update, or reset a start method within an existing roadmap.
update_synth_spec

Tidy API calls ----------------------------------------------------------
start_resample

Specify a resampling scheme for start_data
visit_sequence_api

Add or reset a visit_sequence object within an existing roadmap.
add_noise_kde

Add normal noise to predicted values with variances calculated for ntiles using Gaussian kernel density estimators
add_noise_disc_gaussian

Add discrete normal noise with mean 0 to predicted values with constant variance
add_noise_disc_laplace

Add discrete Laplace noise with mean 0 to predicted values with constant variance
acs_conf_nw

American Community Survey confidential microdata (without weights)
add_noise_gaussian

Add normal noise with mean 0 to predicted values with constant variance
add_noise_laplace

Add Laplace noise with mean 0 to predicted values with constant variance
add_noise_cat_unif

Inject noise into a categorical random variable by mixing a sample of uniform records into the predictions.
add_sequence_factor

Add to visit sequence for factor variables
construct_models

Construct a list of models for synthesis
construct_extractors

Construct a list of extractors for parsnip models
constraints

Create a constraints object
add_sequence_numeric

Add to visit sequence for numeric variables
collapse_na

Collapse data frames with _NA variables to coerce related variables to include NA
acs_start

American Community Survey starting microdata (with weights)
acs_start_nw

American Community Survey starting microdata (without weights)
acs_conf

American Community Survey confidential microdata (with weights)
constraints_api

Add, update, or reset a constraints object within an existing roadmap.
construct_recipes

Construct a sequence of model recipes for sequential synthesis
construct_noise

Construct a list of noise objects for synthesis