synthetic: Generate synthetic data

Description

Generates a synthetic version of a data.frame, with similar characteristics to the original. See Details for the algorithm used.

Usage

synthetic(
  data,
  model_expression = ranger(x = x, y = y),
  predict_expression = predict(model, data = xsynth)$predictions,
  missingness_expression = NULL,
  verbose = TRUE
)

Arguments

data

A data.frame of which to make a synthetic version.

model_expression

An R-expression to estimate a model. Defaults to ranger(x = x, y = y), which uses the fast implementation of random forests in ranger. The expression is evaluated in an environment containing objects x and y, where x is a data.frame with the predictor variables, and y is a vector of outcome values (see Details).

predict_expression

An R-expression to generate predicted values based on the model estimated by model_expression. Defaults to predict(model, data = xsynth)$predictions. This expression must return a vector of predicted values. The expression is evaluated in an environment containing objects model and xsynth, where model is the model estimated by model_expression, and xsynth is the data.frame of synthetic data used to predict the next column (see Details).

missingness_expression

Optional. An R-expression to impute missing values. Defaults to NULL, which means listwise deletion is used. The expression is evaluated in an environment containing the object data, as specified in the call to synthetic. It must return a data.frame with the same dimensions and column names as the original data. For example, use missingness_expression = missRanger::missRanger(data = data) for a fast implementation of the excellent 'missForest' single imputation technique.

verbose

Logical, Default: TRUE. Whether to show a progress bar while running the algorithm and provide informative messages.

Value

A data.frame with synthetic data, based on data.

Details

This function uses a simple algorithm to generate a synthetic dataset with similar characteristics to the original. The algorithm is as follows:

Let x be the original data.frame, with columns 1:j
Let xsynth be a synthetic data.frame, with columns 1:j
Column 1 of xsynth is a bootstrapped version of column 1 of x
Using model_expression, a predictive model is built for column c, for c along 2:j, with c predicted from columns 1:(c-1) of the original data.
Using predict_expression, columns 1:(c-1) of the synthetic data are used to predict synthetic values for column c.

Variables are thus imputed in order of occurrence in the data.frame. To impute in a different order, reorder the data.

Note that, for data synthesis to work properly, it is essential that the class of variables is defined correctly. The default algorithm ranger supports numeric, integer, and factor types. Other types of variables should be converted to one of these types, or users can use a custom model_expression when calling synthetic.

Examples

Run this code

# NOT RUN {
iris_syn <- synthetic(iris)
iris_missings <- iris
for(i in 1:10){
  iris_missings[sample.int(nrow(iris_missings), 1, replace = TRUE),
                sample.int(ncol(iris_missings), 1, replace = TRUE)] <- NA
}
iris_miss_syn <- synthetic(iris_missings)
# }

Run the code above in your browser using DataLab