Produce tables from observed and synthesized data and calculates utility measures to compare them with their expectation if the synthesising model is correct.
utility.tab(object, data, vars = NULL, ngroups = 5, useNA = TRUE,
print.tables = length(vars) < 4, print.stats = 'VW',
print.zdiff = FALSE, digits = 2, …) # S3 method for utility.tab
print(x, print.tables = x$print.tables,
print.zdiff = x$print.zdiff, print.stats = x$print.stats,
digits = x$digits, …)
an object of class synds, which stands for 'synthesised
data set'. It is typically created by function syn() or
syn.strata() and it includes object$m number of synthesised
data set(s), as well as object$syn the synthesised data set,
if m = 1, or a list of m such data sets.
the original (observed) data set.
a single string or a vector of strings with the names of variables to be used to form the table.
if numerical (non-factor) variables are included they will be
classified into this number of groups to form tables. Classification is
performed using classIntervals() function for n = ngroups.
By default, to avoid problems for variables with a small number of unique
values, style = "fisher". Arguments of classIntervals() may be,
however, specified in the call to utility.tab().
determines if NA values are to be included in tables.
a logical value that determines if tables of observed and synthesised are to be printed.
Determines which chi-squred statistics to print to compare the observed and synthetic tables : 'VW' for Voas Williams, 'FT' for Freeman Tukey or c('VW','FT') for both.
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
an integer indicating the number of decimal places
for printing statistics, tab.zdiff and mean results for m > 1.
additional parameters; can be passed to classIntervals() function.
an object of class utility.tab.
An object of class utility.tab which is a list with the following
components:
number of synthetic data sets in object, i.e. object$m.
a table from the observed data.
a vector with object$m values for the Freeman Tukey
utility measure.
a vector with object$m values for the Voas Williamson
utility measure.
a vector of degrees of freedom for the chi-square tests which equal to one minus the number of cells in the table with any observed or synthesised counts.
a vector with ratios of UtabFT to df.
a vector with ratios of UtabVW to df.
a vector with object$m p-values for the chi-square
tests for the Freeman Tukey utility measure.
a vector with object$m p-values for the chi-square
tests for the Voas Williamson utility measure.
a vector of length object$m with number of cells
not contributing to the statistics.
a table from the observed data.
a table or a list of m tables from the synthetic data.
a table or a list of m tables of Z statistics for
differences between observed and synthesised cells of the tables. Large
absolute values indicate a large contribution to lack-of-fit.
number of observation in the original dataset.
Forms tables of observed and synthesised values for the variables
specified in vars. Two utility measures are calculated from the cells
of the tables, a measure of fit proposed by Voas and Williams
sum((observed-synthesied)^2/[(observed + synthesised)/2)]) and one
proposed by Freeman and Tukey 4*sum((observed^(0.5)-synthesised^(0.5))^2)).
In both cases those cells where observed and synthesised are both zero do not
contribute to the sum. If the synthesising model is correct both of these
measures should have chi-square distributions for large samples.
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. 10.18637/jss.v074.i11.
Read, T.R.C. and Cressie, N.A.C. (1988) Goodness--of--Fit Statistics for Discrete Multivariate Data, Springer--Verlag, New York.
Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.
# NOT RUN {
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital")]
s1 <- syn(ods, m = 10)
utility.tab(s1, ods, vars = c("marital", "sex"))
s2 <- syn(ods, m = 1)
utility.tab(s2, ods, vars = c("marital", "age"), ngroups = 3, print.tables = TRUE)
u2 <- utility.tab(s2, ods, vars = c("marital", "age"), style = "pretty")
print(u2, print.tables = TRUE, print.zdiff = TRUE)
# }
Run the code above in your browser using DataLab