utility.tab: Tabular utility

Description

Produce tables from observed and synthesized data and calculates utility measures to compare them with their expectation if the synthesising model is correct.

Usage

utility.tab(object, data, vars = NULL, ngroups = 5, useNA = TRUE,
            print.tables = length(vars) < 4, print.stats = 'VW',
            print.zdiff = FALSE, digits = 2, …) 
# S3 method for utility.tab
print(x, print.tables = x$print.tables, 
  print.zdiff = x$print.zdiff, print.stats = x$print.stats, 
  digits = x$digits, …)

Arguments

object

an object of class synds, which stands for 'synthesised data set'. It is typically created by function syn() or syn.strata() and it includes object$m number of synthesised data set(s), as well as object$syn the synthesised data set, if m = 1, or a list of m such data sets.

data

the original (observed) data set.

vars

a single string or a vector of strings with the names of variables to be used to form the table.

ngroups

if numerical (non-factor) variables are included they will be classified into this number of groups to form tables. Classification is performed using classIntervals() function for n = ngroups. By default, to avoid problems for variables with a small number of unique values, style = "fisher". Arguments of classIntervals() may be, however, specified in the call to utility.tab().

useNA

determines if NA values are to be included in tables.

print.tables

a logical value that determines if tables of observed and synthesised are to be printed.

print.stats

Determines which chi-squred statistics to print to compare the observed and synthetic tables : 'VW' for Voas Williams, 'FT' for Freeman Tukey or c('VW','FT') for both.

print.zdiff

a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.

digits

an integer indicating the number of decimal places for printing statistics, tab.zdiff and mean results for m > 1.

…

additional parameters; can be passed to classIntervals() function.

an object of class utility.tab.

Value

An object of class utility.tab which is a list with the following components:

number of synthetic data sets in object, i.e. object$m.

tab.obs

a table from the observed data.

UtabFT

a vector with object$m values for the Freeman Tukey utility measure.

UtabVW

a vector with object$m values for the Voas Williamson utility measure.

a vector of degrees of freedom for the chi-square tests which equal to one minus the number of cells in the table with any observed or synthesised counts.

ratioFT

a vector with ratios of UtabFT to df.

ratioVW

a vector with ratios of UtabVW to df.

pvalFT

a vector with object$m p-values for the chi-square tests for the Freeman Tukey utility measure.

pvalVW

a vector with object$m p-values for the chi-square tests for the Voas Williamson utility measure.

nempty

a vector of length object$m with number of cells not contributing to the statistics.

tab.obs

a table from the observed data.

tab.syn

a table or a list of m tables from the synthetic data.

tab.zdiff

a table or a list of m tables of Z statistics for differences between observed and synthesised cells of the tables. Large absolute values indicate a large contribution to lack-of-fit.

number of observation in the original dataset.

Details

Forms tables of observed and synthesised values for the variables specified in vars. Two utility measures are calculated from the cells of the tables, a measure of fit proposed by Voas and Williams sum((observed-synthesied)^2/[(observed + synthesised)/2)]) and one proposed by Freeman and Tukey 4*sum((observed^(0.5)-synthesised^(0.5))^2)). In both cases those cells where observed and synthesised are both zero do not contribute to the sum. If the synthesising model is correct both of these measures should have chi-square distributions for large samples.

References

Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. 10.18637/jss.v074.i11.

Read, T.R.C. and Cressie, N.A.C. (1988) Goodness--of--Fit Statistics for Discrete Multivariate Data, Springer--Verlag, New York.

Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.

Examples

Run this code

# NOT RUN {
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital")]

s1 <- syn(ods, m = 10)
utility.tab(s1, ods, vars = c("marital", "sex"))

s2 <- syn(ods, m = 1)
utility.tab(s2, ods, vars = c("marital", "age"), ngroups = 3, print.tables = TRUE)
u2 <- utility.tab(s2, ods, vars = c("marital", "age"), style = "pretty")
print(u2, print.tables = TRUE, print.zdiff = TRUE)
# }

Run the code above in your browser using DataLab