compare.fit.synds: Compare model estimates based on synthesised and observed data

Description

The same model that was used for the synthesised data set is fitted to the observed data set. The coefficients with confidence intervals for the observed data is plotted together with their estimates from synthetic data. When more than one synthetic data set has been generated (object$m>1) combining rules are applied. Analysis-specific utility measures are used to evaluate differences between synthetic and observed data.

Usage

# S3 method for fit.synds
compare(object, data, plot = "Z", 
  return.result = TRUE, return.plot = TRUE, plot.intercept = FALSE, 
  lwd = 1, lty = 1, lcol = c("#1A3C5A","#4187BF"), 
  dodge.height = .5, point.size = 2.5, partly = FALSE, ...)
# S3 method for compare.fit.synds
print(x, …)

Arguments

object

an object of type fit.synds created by fitting a model to synthesised data set using function glm.synds or lm.synds.

data

an original observed data set.

plot

values to be plotted: "Z" (Z scores) or "coef" (coefficients).

return.result

a logical value indicating whether a table of estimates should be returned.

return.plot

a logical value indicating whether a confidence interval plot should be returned.

plot.intercept

a logical value indicating whether estimates for intercept should be plotted.

lwd

the line type.

lty

the line width.

lcol

line colours.

dodge.height

size of vertical shifts for confidence intervals to prevent overlaping.

point.size

size of plotting symbols used to plot point estimates of coefficients.

partly

a logical value indicating whether data are partly synthesised.

…

additional parameters passed to ggplot.

an object of class compare.fit.synds.

Value

An object of class compare.fit.synds which is a list with the following components:

call

the original call to fit the model to the synthesised data set.

coef.obs

a data frame including estimates based on the observed data: coefficients (Beta), their standard errors (se(Beta)) and Z scores (Z).

coef.syn

a data frame including (combined) estimates based on the synthesised data: point estimates of observed data coefficients (B.syn), standard errors of those estimates (se(B.syn)), estimates of the observed standard errors (se(Beta).syn), Z scores estimates (Z.syn) and their standard errors (se(Z.syn)). Note that se(B.syn) and se(Z.syn) give the standard errors of the mean of the m syntheses and can be made very small by increasing m.

coef.diff

a data frame containing standardized differences in coeffcient estimates and corresponding p values.

ci.overlap

a data frame containing the percentage of overlap between the estimated synthetic confidence intervals and the original sample confidence intervals for each parameter, calculated as suggested by Karr et al. (2006).

ci.plot

ggplot of the the coefficients with confidence intervals for models based on observed and synthetic data.

If return.result was set to FALSE then coef.obs, coef.obs, coef.diff and ci.overlap are all NULL. If return.plot was set to FALSE, ci.plot is NULL.

Details

This function can be used to evaluate whether the model used for synthesis is appropriate for the fitted model. If this is the case the estimates from the synthetic data (B.syn and Z.syn) should not differ from the estimates from the observed data (Beta and Z) by more than would be expected from the standard errors (se(B.syn) and se(Z.syn)).

References

Karr, A., Kohnen, C.N., Organian, A., Reiter, J.P. and Sanil, A.P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician, 60(3), 224-232.

Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74(11), 1-26. 10.18637/jss.v074.i11.

Examples

Run this code

# NOT RUN {
ods <- SD2011[,c("sex","age","edu","smoke")]
s1 <- syn(ods, m = 5)
f1 <- glm.synds(smoke ~ sex + age + edu, data = s1, family = "binomial")
compare(f1, ods) 
compare(f1, ods, plot = "coef")
# }

Run the code above in your browser using DataLab