The adversarial random forest (ARF) algorithm partitions data into fully
factorized leaves where features are jointly independent. ARFs are trained
iteratively, with alternating rounds of generation and discrimination. In
the first instance, synthetic data is generated via independent bootstraps of
each feature, and a RF classifier is trained to distinguish between real and
fake samples. In subsequent rounds, synthetic data is generated separately in
each leaf, using splits from the previous forest. This creates increasingly
realistic data that satisfies local independence by construction. The
algorithm converges when a RF cannot reliably distinguish between the two
classes, i.e. when OOB accuracy falls below 0.5 + delta
.
ARFs are useful for several unsupervised learning tasks, such as density
estimation (see forde
) and data synthesis (see
forge
). For the former, we recommend increasing the number of
trees for improved performance (typically on the order of 100-1000 depending
on sample size).
Integer variables are recoded with a warning (set verbose = FALSE
to
silence these). Default behavior is to convert integer variables with six or
more unique values to numeric, while those with up to five unique values are
treated as ordered factors. To override this behavior, explicitly recode
integer variables to the target type prior to training.
Note: convergence is not guaranteed in finite samples. The max_iters
argument sets an upper bound on the number of training rounds. Similar
results may be attained by increasing delta
. Even a single round can
often give good performance, but data with strong or complex dependencies may
require more iterations. With the default early_stop = TRUE
, the
adversarial loop terminates if performance does not improve from one round
to the next, in which case further training may be pointless.