Learn R Programming

rsdv — The R Synthetic Data Vault

Synthetic data generation in R (Gaussian Copula based, extensible to deep generative models)

rsdv is an R implementation of Python’s Synthetic Data Vault (SDV) framework (Patki, Wedge, and Veeramachaneni 2016). It generates synthetic tabular data using Gaussian copula models, with built-in quality and privacy evaluation.

Installation

# Development version
remotes::install_github("kvenkita/rsdv")

Quick start

library(rsdv)
#> 
#> Attaching package: 'rsdv'
#> The following object is masked from 'package:base':
#> 
#>     sample

set.seed(42)

# Describe column types
meta <- metadata(adult_income) |>
  set_column_type("id",         "id") |>
  set_column_type("age",        "numerical") |>
  set_column_type("occupation", "categorical") |>
  set_column_type("income",     "categorical") |>
  set_primary_key("id")

# Fit a GaussianCopula synthesizer
syn       <- gaussian_copula_synthesizer(meta)
syn       <- fit(syn, adult_income)

# Generate 500 synthetic rows
synth_data <- sample(syn, n = 500)

# Evaluate quality
qr <- quality_report(real = adult_income, synthetic = synth_data,
                     metadata = meta)
print(qr)
#> == rsdv Quality Report ==
#> 
#> Column Similarity (KS, numerical):
#>   age                  0.960
#>   fnlwgt               0.936
#>   education_num        0.776
#>   capital_gain         0.468
#>   capital_loss         0.484
#>   hours_per_week       0.724
#> 
#> Column Similarity (TVD, categorical):
#>   workclass            0.973
#>   education            0.942
#>   marital_status       0.988
#>   occupation           0.935
#>   relationship         0.970
#>   race                 0.988
#>   sex                  1.000
#>   native_country       0.956
#>   income               0.972
#> 
#> Property scores:
#>   Column Shapes        0.871
#>   Column Pair Trends   0.893
#>     (correlation 0.965, contingency 0.864)
#> 
#> Overall Score:               0.882

quality_report() aggregates metrics into the two-property hierarchy used by SDMetrics — Column Shapes (per-column marginal fidelity) and Column Pair Trends (correlation similarity for numerical pairs, contingency similarity for categorical pairs) — with the overall score the mean of the two.

diagnostic_report() complements it with structural-validity checks (value ranges, category adherence, key uniqueness), and sample_conditions() generates rows that hold given categorical values fixed:

# Validity checks
diagnostic_report(adult_income, synth_data, meta)

# Conditional generation
sample_conditions(syn, data.frame(income = ">50K", .n = 20))

Related work

  • Python SDV: sdv-dev/SDV
  • Synthetic Data Vault paper: Patki et al., IEEE DSAA 2016
  • CTGAN: Xu et al., NeurIPS 2019 (implemented in companion package rsdv.torch)

Copy Link

Version

Install

install.packages('rsdv')

Version

0.2.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Kailas Venkitasubramanian

Last Published

June 9th, 2026

Functions in rsdv (0.2.0)

nndr

Nearest-Neighbor Distance Ratio privacy score
ml_efficacy

ML efficacy: train-on-synthetic / test-on-real accuracy ratio (TSTR)
is_fitted

Check whether a synthesizer has been fitted
custom_constraint

Constraint: arbitrary row-wise predicate
metadata_to_json

Serialize metadata to a JSON string
set_primary_key

Set the primary key column of the metadata
diagnostic_report

Generate a diagnostic (validity) report for synthetic data
metadata_from_json

Deserialize metadata from a JSON string
reexports

Objects exported from other packages
rsdv-package

rsdv: Synthetic Tabular Data Generation with Gaussian Copulas
print.custom_constraint

Print method for a custom_constraint
ks_similarity

Kolmogorov-Smirnov similarity score per numerical column
quality_report

Generate a quality report comparing real and synthetic data
print.rsdv_metadata

Print method for rsdv_metadata
print.rsdv_diagnostic_report

Print method for rsdv_diagnostic_report
privacy_report

Generate a privacy report comparing real and synthetic data
inequality_constraint

Constraint: col_a must be less than / greater than col_b
validate_data

Validate that a data frame is compatible with metadata
print.rsdv_quality_report

Print method for rsdv_quality_report
print.rsdv_privacy_report

Print method for rsdv_privacy_report
save_metadata

Save metadata to a JSON file
set_column_type

Set the type of a column in metadata
print.inequality_constraint

Print method for an inequality_constraint
print.fixed_combinations_constraint

Print method for a fixed_combinations_constraint
sample_conditions

Sample synthetic rows that match fixed column values (conditional sampling)
tvd_similarity

Total variation distance similarity score per categorical column
sample

Sample synthetic rows from a fitted synthesizer
print.equality_constraint

Print method for an equality_constraint
adult_income

Adult Income dataset (500-row sample)
contingency_similarity

Contingency similarity between real and synthetic categorical column pairs
check_constraint

Check a single constraint against each row of a data frame
autoplot.rsdv_privacy_report

Plot a privacy report
attribute_disclosure_risk

Attribute disclosure risk
add_constraint

Add a constraint to metadata
correlation_similarity

Correlation similarity between real and synthetic numerical column pairs
check_constraints

Check all constraints in metadata against a data frame
autoplot.rsdv_diagnostic_report

Plot a diagnostic report
autoplot.rsdv_quality_report

Plot a quality report
fixed_combinations_constraint

Constraint: only observed column combinations are valid
equality_constraint

Constraint: two columns must be equal row-wise
gaussian_copula_synthesizer

Create a Gaussian Copula synthesizer
load_metadata

Load metadata from a JSON file
metadata

Create a metadata object describing a dataset's column types