Generates a synthetic version of a data.frame, with
similar characteristics to the original. See Details for the algorithm used.
synthetic(
data,
model_expression = ranger(x = x, y = y),
predict_expression = predict(model, data = xsynth)$predictions,
missingness_expression = NULL,
verbose = TRUE
)A data.frame of which to make a synthetic version.
An R-expression to estimate a model. Defaults to
ranger(x = x, y = y), which uses the fast implementation of random
forests in ranger. The expression is evaluated in an
environment containing objects x and y, where x is a
data.frame with the predictor variables, and y is a
vector of outcome values (see Details).
An R-expression to generate predicted values based
on the model estimated by model_expression. Defaults to
predict(model, data = xsynth)$predictions. This expression must return
a vector of predicted values. The expression is evaluated in an
environment containing objects model and xsynth, where
model is the model estimated by model_expression, and
xsynth is the data.frame of synthetic data used to predict the
next column (see Details).
Optional. An R-expression to impute missing
values. Defaults to NULL, which means listwise deletion is used. The
expression is evaluated in an environment containing the object data,
as specified in the call to synthetic. It must return a
data.frame with the same dimensions and column names as the original
data. For example, use missingness_expression =
missRanger::missRanger(data = data) for a fast implementation of the
excellent 'missForest' single imputation technique.
Logical, Default: TRUE. Whether to show a progress bar while running the algorithm and provide informative messages.
A data.frame with synthetic data, based on data.
This function uses a simple algorithm to generate a synthetic dataset with similar characteristics to the original. The algorithm is as follows:
Let x be the original data.frame, with columns 1:j
Let xsynth be a synthetic data.frame, with columns 1:j
Column 1 of xsynth is a bootstrapped version of column 1 of x
Using model_expression, a predictive model is built for column
c, for c along 2:j, with c predicted from columns 1:(c-1) of the original
data.
Using predict_expression, columns 1:(c-1) of the synthetic data
are used to predict synthetic values for column c.
Variables are thus imputed in order of occurrence in the data.frame.
To impute in a different order, reorder the data.
Note that, for data synthesis to work properly, it is essential that the
class of variables is defined correctly. The default algorithm
ranger supports numeric, integer, and factor types.
Other types of variables should be converted to one of these types, or users
can use a custom model_expression when calling synthetic.
# NOT RUN {
iris_syn <- synthetic(iris)
iris_missings <- iris
for(i in 1:10){
iris_missings[sample.int(nrow(iris_missings), 1, replace = TRUE),
sample.int(ncol(iris_missings), 1, replace = TRUE)] <- NA
}
iris_miss_syn <- synthetic(iris_missings)
# }
Run the code above in your browser using DataLab