syn: Generating synthetic data sets

Description

Generates synthetic version(s) of a data set.

Usage

syn(data, method = vector("character", length = ncol(data)),  visit.sequence = (1:ncol(data)), predictor.matrix = NULL,   m = 1, k = nrow(data), proper = FALSE, minnumlevels = 5,   maxfaclevels = 60, rules = NULL, rvalues = NULL,  cont.na = NULL, semicont = NULL, smoothing = NULL,  event = NULL, denom = NULL, drop.not.used = FALSE, drop.pred.only = FALSE, default.method = c("normrank", "logreg", "polyreg", "polr"),  diagnostics = FALSE, print.flag = TRUE, seed = "sample", ...) 
"print"(x, ...)

Arguments

data

a data frame or a matrix (n x p) containing the original data. Observations are in rows and variables are in columns.

method

a single string or a vector of strings of length ncol(data) specifying the synthesising method to be used for each variable in the data. Order of variables is exactly the same as in data. If specified as a single string, the same method is used for all variables in a visit sequence unless a data type or a position in a visit sequence requires a different method. If method is set to "parametric" the default synthesising method specified by the default.method argument are applied. Variables that are transformations of other variables can be synthesised using a passive method that is specified as a string starting with ~. Variables that need not to be synthesised have the empty method "". By default all variables are synthesised using ctree implementation of a CART model. See details for more information.

visit.sequence

a character vector of names of variables or an integer vector of their column indices specifying the order of synthesis. The default sequence 1:ncol(data) implies that column variables are synthesised from left to right. See details for more information.

predictor.matrix

a square matrix of size ncol(data) specifying the set of column predictors to be used for each target variable in the row. Each entry has value 0 or 1. A value of 1 means that the column variable is used as a predictor for the row variable. Order of variables is exactly the same as in data. By default all variables that are earlier in the visit sequence are used as predictors. For the default visit sequence (1:ncol(data)) the default predictor.matrix will have values of 1 in the lower triangle. See details for more information.

number of synthetic copies of the original (observed) data to be generated. The default is m = 1.

a size of the synthetic data set (k x p), which can be smaller or greater than the size of the original data set (n x p). The default is nrow(data) which means that the number of individuals in the synthesised data is the same as in the original (observed) data (k = n).

proper

a logical value with default set to FALSE. If TRUE proper synthesis is conducted.

minnumlevels

a minimum number of values a numeric variable should have to be treated as numeric. Numeric variables with fewer levels than minnumlevels are changed into factors.

maxfaclevels

a maximum number of factor levels that can be handled. It can be increased but it may cause computational problems, especially for parametric methods.

rules

a named list of rules for restricted values. Restricted values are those that are determined explicitly by values of other variables. The names of the list elements must correspond to the variables names for which the rules need to be specified.

rvalues

a named list of the values corresponding to the rules specified by rules.

cont.na

a named list of codes for missing values for continuous variables if different from the R missing data code NA. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.

semicont

a named list of values at which semi-continuous variables have spikes. The names of the list elements must correspond to the names of the semi-continuous variables.

smoothing

a named list specifying smoothing method ("density" or "") to be used for selected variables. Smoothing can only be applied to continuous variables synthesised using sample, ctree, cart or normrank method. The names of the list elements must correspond to the names of the variables whose values are to be smoothed. Smoothing is applied to the synthesised values. For "density" smoothing a Gaussian kernel density estimator is applied with bandwidth selected using the Sheather-Jones 'solve-the-equation' method (see bw.SJ).

event

a named list specifying for survival data the names of corresponding event indicators. The names of the list elements must correspond to the names of the survival variables.

denom

a named list specifying for variables to be modelled using binomial regression the names of corresponding denominator variables. The names of the list elements must correspond to the names of the variables to to be modelled using binomial regression.

drop.not.used

a logical value. If TRUE (default) variables not used in synthesis are not saved in the synthesised data and are not included in the corresponding synthesis parameters.

drop.pred.only

a logical value. If TRUE (default) variables not synthesised and used as predictors only are not saved in the synthesised data.

default.method

a vector of four strings containing the default parametric synthesising methods for numerical variables, factors with two levels, unordered factors with more than two levels and ordered factors with more than two levels respectively. They are used when method is set to "parametric" or when there is an inconsistency between variable type and provided method.

diagnostics

a logical value. If TRUE diagnostic information are appended to the value of the function. If FALSE (default) only the synthesised data are saved.

print.flag

if TRUE (default) synthesising history and information messages will be printed at the console. For silent computation use print.flag = FALSE.

seed

an integer to be used as an argument for the set.seed(). If no integer is provided, the default "sample" will generate one and it will be stored. To prevent generating an integer set seed to NA.

...

additional arguments to be passed to synthesising functions. See section 'Details' below for more information.

an object of class synds; a result of a call to syn.

Value

call: an original call to syn.
m: number of synthetic versions of the original (observed) data.
syn: a data frame (for m = 1) or a list of m data frames (for m > 1) with synthetic data set(s).
method: a vector of synthesising methods applied to each variable in the saved synthesised data.
visit.sequence: a vector of column indices of the visiting sequence. The indices refer to the columns in the saved synthesised data.
predictor.matrix: a matrix specifying the set of predictors used for each variable in the saved synthesised data.
smoothing: a vector specifying smoothing methods applied to each variable in the saved synthesised data.
event: a vector of integers specifying for survival data the column indices for corresponding event indicators. The indices refer to the columns in the saved synthesised data.
denom: a vector of integers specifying for variables modelled using binomial regression the column indices for corresponding denominator variables. The indices refer to the columns in the saved synthesised data.
proper: a logical value indicating whether proper synthesis was conducted.
n: a number of cases in the original data.
k: a number of cases in the synthesised data.
rules: a list of rules for restricted values applied to the synthetic data.
rvalues: a list of the values corresponding to the rules specified by rules.
cont.na: a list of codes for missing values for continuous variables.
semicont: a list of values for semi-continuous variables at which they have spikes.
drop.not.used: a logical value indicating whether variables not used in synthesis are saved in the synthesised data and corresponding synthesis parameters.
drop.pred.only: a logical value indicating whether variables not synthesised and used as predictors only are saved in the synthesised data.
seed: an integer used as a set.seed() argument.
var.lab: a vector of variable labels for data imported from SPSS using read.obs().
val.lab: a list value labels for factors for data imported from SPSS using read.obs().
obs.vars: a vector of all variable names in the observed data set.

Details

Only variables that are in visit.sequence with corresponding non-empty method are synthesised. The only exceptions are event indicators. They are synthesised along with the corresponding time to event variables and should not be included in visit.sequence. All other variables (not in visit.sequence or in visit.sequence with a corresponding blank method) can be used as predictors. Including them in visit.sequence generates a default predictor.matrix reflecting the order of variables in the visit.sequence otherwise predictor.matrix has to be adjusted accordingly. All predictors of the variables that are not in visit.sequence or are in visit.sequence but with a blank method are removed from predictor.matrix.

Variables to be synthesised that are not synthesised yet cannot be used as predictors. Also all variables used in passive synthesis or in restricted values rules (rules) have to be synthesised before the variables they apply to.

Mismatch between data type and synthesising method stops execution and print an error message but numeric variables with number of levels less than minnumlevels are changed into factors and methods are changed automatically, if necessary, to methods for categorical variables. Methods for variables not in a visit sequence will be changed into blank.

The built-in elementary synthesising methods include:

ctree, cart: classification and regression trees (CART)

survctree

classification and regression trees (CART) for duration time data (parametric methods for survival data are not implemented yet)

norm

normal linear regression

normrank

normal linear regression preserving the marginal distribution

lognorm, sqrtnorm, cubertnorm

normal linear regression after natural logarithmic, square root and cube root transformation of a dependent variable respectively

logreg

logistic regression

polyreg

unordered polytomous regression

polr

ordered polytomous regression

pmm

predictive mean matching

sample

random sample from the observed data

passive

function of other synthesised data

The functions corresponding to these methods are called syn.method, where method is a string with the name of a synthesising method. For instance a function corresponding to ctree function is called syn.ctree. A new synthesising method can be introduced by writing a function named syn.newmethod and then specifying method parameter of syn function as "newmethod". Additional parameters can be passed to synthesising methods as part of the dots argument. They have to be named using period-separated method and parameter name (method.parameter). For instance, in order to set a minbucket (minimum number of observations in any terminal node of a CART model) for a ctree synthesising method, ctree.minbucket has to be specified. The parameters are method-specific and will be used for all variables to be synthesised using that method. See help for syn.method for further details about the allowed parameters for a specific method.

Examples

Run this code

### selection of variables
vars <- c("sex","age","marital","income","ls","smoke")
ods  <- SD2011[1:2000,vars]
 
### default synthesis
s1 <- syn(ods)
s1
  
### synthesis with default parametric methods
s2 <- syn(ods, method = "parametric", seed = 1)
s2$method
  
### multiple synthesis of selected variables with customised methods
s3 <- syn(ods, visit.sequence = c(2, 1, 4, 5), m = 2,
          method = c("logreg","sample","","normrank", "ctree",""),
          ctree.minbucket = 10)
summary(s3)
summary(s3, msel = 1:2)
  
### adjustment to the default predictor matrix 
s4.ini <- syn(data = ods, visit.sequence = c(1, 2, 5, 3),
              m = 0, drop.not.used = FALSE)
pM.cor <- s4.ini$predictor.matrix
pM.cor["marital","ls"] <- 0
s4 <- syn(data = ods, visit.sequence = c(1, 2, 5, 3),
          predictor.matrix = pM.cor)
  
### handling missing values in continuous variables
s5 <- syn(ods, cont.na = list(income = c(NA, -8)))
  
### rules for restricted values - marital status of males under 18 should be 'single'
s6 <- syn(ods, rules = list(marital = "age < 18 & sex == 'MALE'"),
          rvalues = list(marital = 'SINGLE'), method = "parametric", seed = 1)
with(s6$syn, table(marital[age < 18 & sex == 'MALE']))
### results for default parametric synthesis without the rule  
with(s2$syn, table(marital[age < 18 & sex == 'MALE']))