syn(data, method = vector("character", length = ncol(data)), visit.sequence = (1:ncol(data)), predictor.matrix = NULL, m = 1, k = nrow(data), proper = FALSE, minnumlevels = 5, maxfaclevels = 60, rules = NULL, rvalues = NULL, cont.na = NULL, semicont = NULL, smoothing = NULL, event = NULL, denom = NULL, drop.not.used = FALSE, drop.pred.only = FALSE, default.method = c("normrank", "logreg", "polyreg", "polr"), diagnostics = FALSE, print.flag = TRUE, seed = "sample", ...)
"print"(x, ...)
n
x p
) containing the original data.
Observations are in rows and variables are in columns.ncol(data)
specifying the synthesising method to be
used for each variable in the data. Order of variables is exactly the
same as in data
. If specified as a single string, the same method
is used for all variables in a visit sequence unless a data type or
a position in a visit sequence requires a different method.
If method
is set to "parametric"
the
default synthesising method specified by the default.method
argument
are applied. Variables that are transformations of other variables can
be synthesised using a passive method that is specified as a string
starting with ~
. Variables that need not to be synthesised have the empty
method ""
. By default all variables are synthesised using
ctree
implementation of a CART model. See details for more
information.1:ncol(data)
implies that column variables are
synthesised from left to right. See details for more information.ncol(data)
specifying
the set of column predictors to be used for each target variable in the row.
Each entry has value 0 or 1. A value of 1 means that the column
variable is used as a predictor for the row variable. Order of
variables is exactly the same as in data
. By default all
variables that are earlier in the visit sequence are used as predictors.
For the default visit sequence (1:ncol(data)
) the default
predictor.matrix
will have values of 1 in the lower triangle.
See details for more information.m = 1
.k x p
),
which can be smaller or greater than the size of the original data
set (n x p
). The default is nrow(data)
which means
that the number of individuals in the synthesised data is the same
as in the original (observed) data (k = n
).FALSE
.
If TRUE
proper synthesis is conducted.minnumlevels
are changed into factors.rules
.R
missing data code NA
.
The names of the list elements must correspond to the variables names for
which the missing data codes need to be specified."density"
or ""
) to be used for selected variables. Smoothing can only be
applied to continuous variables synthesised using sample
,
ctree
, cart
or normrank
method. The names of the
list elements must correspond to the names of the variables whose values
are to be smoothed. Smoothing is applied to the synthesised values.
For "density"
smoothing a Gaussian kernel density estimator is
applied with bandwidth selected using the Sheather-Jones
'solve-the-equation' method (see bw.SJ
).TRUE
(default) variables not
used in synthesis are not saved in the synthesised data and are not
included in the corresponding synthesis parameters.TRUE
(default) variables not
synthesised and used as predictors only are not saved in the synthesised
data.method
is set to "parametric"
or
when there is an inconsistency between variable type and provided method.TRUE
diagnostic information are
appended to the value of the function. If FALSE
(default) only the
synthesised data are saved.TRUE
(default) synthesising history and
information messages will be printed at the console. For silent
computation use print.flag = FALSE
.set.seed()
.
If no integer is provided, the default "sample"
will generate one
and it will be stored. To prevent generating an integer set seed
to NA
.synds
; a result of a call to syn
.synds
, which stands for 'synthesised
data set'. It is a list with the following components:syn
.m = 1
) or a list of m
data frames
(for m > 1
) with synthetic data set(s).rules
.set.seed()
argument.read.obs()
.read.obs()
.visit.sequence
with corresponding non-empty
method
are synthesised. The only exceptions are event indicators. They
are synthesised along with the corresponding time to event variables and should
not be included in visit.sequence
. All other variables (not in
visit.sequence
or in visit.sequence
with a corresponding blank method)
can be used as predictors. Including them in visit.sequence
generates
a default predictor.matrix
reflecting the order of variables in
the visit.sequence
otherwise predictor.matrix
has to be
adjusted accordingly. All predictors of the variables that are not in
visit.sequence
or are in visit.sequence
but with a blank method
are removed from predictor.matrix
. Variables to be synthesised that are not synthesised yet cannot be used
as predictors. Also all variables used in passive synthesis or in restricted
values rules (rules
) have to be synthesised before the variables they
apply to.
Mismatch between data type and synthesising method stops execution and
print an error message but numeric variables with number of levels less
than minnumlevels
are changed into factors and methods are changed
automatically, if necessary, to methods for categorical variables.
Methods for variables not in a visit sequence will be changed into blank.
The built-in elementary synthesising methods include:
The functions corresponding to these methods are called syn.method
,
where method
is a string with the name of a synthesising method.
For instance a function corresponding to ctree
function is called
syn.ctree
. A new synthesising method can be introduced by writing
a function named syn.newmethod
and then specifying method
parameter of syn
function as "newmethod"
.
Additional parameters can be passed to synthesising methods as part of the
dots
argument. They have to be named using period-separated method and
parameter name (method.parameter
). For instance, in order to set
a minbucket
(minimum number of observations in any terminal node of
a CART model) for a ctree
synthesising method, ctree.minbucket
has to be specified. The parameters are method-specific and will be used for
all variables to be synthesised using that method. See help for
syn.method
for further details about the allowed parameters for
a specific method.
compare.synds
, summary.synds
### selection of variables
vars <- c("sex","age","marital","income","ls","smoke")
ods <- SD2011[1:2000,vars]
### default synthesis
s1 <- syn(ods)
s1
### synthesis with default parametric methods
s2 <- syn(ods, method = "parametric", seed = 1)
s2$method
### multiple synthesis of selected variables with customised methods
s3 <- syn(ods, visit.sequence = c(2, 1, 4, 5), m = 2,
method = c("logreg","sample","","normrank", "ctree",""),
ctree.minbucket = 10)
summary(s3)
summary(s3, msel = 1:2)
### adjustment to the default predictor matrix
s4.ini <- syn(data = ods, visit.sequence = c(1, 2, 5, 3),
m = 0, drop.not.used = FALSE)
pM.cor <- s4.ini$predictor.matrix
pM.cor["marital","ls"] <- 0
s4 <- syn(data = ods, visit.sequence = c(1, 2, 5, 3),
predictor.matrix = pM.cor)
### handling missing values in continuous variables
s5 <- syn(ods, cont.na = list(income = c(NA, -8)))
### rules for restricted values - marital status of males under 18 should be 'single'
s6 <- syn(ods, rules = list(marital = "age < 18 & sex == 'MALE'"),
rvalues = list(marital = 'SINGLE'), method = "parametric", seed = 1)
with(s6$syn, table(marital[age < 18 & sex == 'MALE']))
### results for default parametric synthesis without the rule
with(s2$syn, table(marital[age < 18 & sex == 'MALE']))
Run the code above in your browser using DataLab