contingency_similarity

For each pair of categorical columns, compares the joint (normalized
contingency) distributions of real and synthetic data via total variation
distance, scoring <code>1 - TVD</code> (the SDMetrics <code>ContingencySimilarity</code> score).
This is the categorical analogue of correlation similarity and captures
categorical-vs-categorical dependence.

Generates synthetic tabular data from real datasets using
Gaussian copula models, with parametric marginal selection for
numerical columns and a cumulative-frequency embedding that brings
categorical and boolean columns into the same joint copula. Includes
a metadata system with column types and primary keys, declarative
constraints enforced via rejection sampling, conditional sampling,
and quality, validity and privacy reports modeled on those of the
'SDMetrics' library. Inspired by the Python 'SDV' (Synthetic Data
Vault) library by 'DataCebo'; see Patki, Wedge and Veeramachaneni
(2016) "The Synthetic Data Vault" <doi:10.1109/DSAA.2016.49>.

Kailas Venkitasubramanian

rsdv

Synthetic Tabular Data Generation with Gaussian Copulas

contingency_similarity function

<dl><dt>real</dt>
<dd>A data frame of real data.</dd>
<dt>synthetic</dt>
<dd>A data frame of synthetic data.</dd>
<dt>meta</dt>
<dd>An <code>rsdv_metadata</code> object.</dd></dl>

Arguments

Contingency similarity between real and synthetic categorical column pairs — contingency_similarity

<dl>

<dt>real</dt>
<dd>A data frame of real data.</dd>


<dt>synthetic</dt>
<dd>A data frame of synthetic data.</dd>


<dt>meta</dt>
<dd>An <code>rsdv_metadata</code> object.</dd>

</dl>

contingency_similarity: Contingency similarity between real and synthetic categorical column pairs

Description

Usage

Value

Arguments

Examples