Produce tables from observed and synthesized data and calculates utility measures to compare them with their expectation if the synthesising model is correct.
utility.tab(object, data, vars = NULL, ngroups = 5, useNA = TRUE,
print.tables = length(vars) < 4, print.stats = 'VW',
print.zdiff = FALSE, digits = 2, …) # S3 method for utility.tab
print(x, print.tables = x$print.tables,
print.zdiff = x$print.zdiff, print.stats = x$print.stats,
digits = x$digits, …)
an object of class synds
, which stands for 'synthesised
data set'. It is typically created by function syn()
or
syn.strata()
and it includes object$m
number of synthesised
data set(s), as well as object$syn
the synthesised data set,
if m = 1
, or a list of m
such data sets.
the original (observed) data set.
a single string or a vector of strings with the names of variables to be used to form the table.
if numerical (non-factor) variables are included they will be
classified into this number of groups to form tables. Classification is
performed using classIntervals()
function for n = ngroups
.
By default, to avoid problems for variables with a small number of unique
values, style = "fisher"
. Arguments of classIntervals()
may be,
however, specified in the call to utility.tab()
.
determines if NA values are to be included in tables.
a logical value that determines if tables of observed and synthesised are to be printed.
Determines which chi-squred statistics to print to compare the observed and synthetic tables : 'VW' for Voas Williams, 'FT' for Freeman Tukey or c('VW','FT') for both.
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
an integer indicating the number of decimal places
for printing statistics, tab.zdiff
and mean results for m > 1
.
additional parameters; can be passed to classIntervals() function.
an object of class utility.tab
.
An object of class utility.tab
which is a list with the following
components:
number of synthetic data sets in object, i.e. object$m
.
a table from the observed data.
a vector with object$m
values for the Freeman Tukey
utility measure.
a vector with object$m
values for the Voas Williamson
utility measure.
a vector of degrees of freedom for the chi-square tests which equal to one minus the number of cells in the table with any observed or synthesised counts.
a vector with ratios of UtabFT
to df
.
a vector with ratios of UtabVW
to df
.
a vector with object$m
p-values for the chi-square
tests for the Freeman Tukey utility measure.
a vector with object$m
p-values for the chi-square
tests for the Voas Williamson utility measure.
a vector of length object$m
with number of cells
not contributing to the statistics.
a table from the observed data.
a table or a list of m
tables from the synthetic data.
a table or a list of m
tables of Z statistics for
differences between observed and synthesised cells of the tables. Large
absolute values indicate a large contribution to lack-of-fit.
number of observation in the original dataset.
Forms tables of observed and synthesised values for the variables
specified in vars
. Two utility measures are calculated from the cells
of the tables, a measure of fit proposed by Voas and Williams
sum((observed-synthesied)^2/[(observed + synthesised)/2)])
and one
proposed by Freeman and Tukey 4*sum((observed^(0.5)-synthesised^(0.5))^2))
.
In both cases those cells where observed and synthesised are both zero do not
contribute to the sum. If the synthesising model is correct both of these
measures should have chi-square distributions for large samples.
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. 10.18637/jss.v074.i11.
Read, T.R.C. and Cressie, N.A.C. (1988) Goodness--of--Fit Statistics for Discrete Multivariate Data, Springer--Verlag, New York.
Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.
# NOT RUN {
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital")]
s1 <- syn(ods, m = 10)
utility.tab(s1, ods, vars = c("marital", "sex"))
s2 <- syn(ods, m = 1)
utility.tab(s2, ods, vars = c("marital", "age"), ngroups = 3, print.tables = TRUE)
u2 <- utility.tab(s2, ods, vars = c("marital", "age"), style = "pretty")
print(u2, print.tables = TRUE, print.zdiff = TRUE)
# }
Run the code above in your browser using DataLab