train_rf: Train a Random Forest

Description

Train a random forest with ranger from a dataframe of writer profiles estimated with get_cluster_fill_rates. train_rf calculates the distance between all pairs of writer profiles using one or more distance measures. Currently, the available distance measures are absolute, Manhattan, Euclidean, maximum, and cosine.

Usage

train_rf(
  df,
  ntrees,
  distance_measures,
  output_dir = NULL,
  run_number = 1,
  downsample_diff_pairs = TRUE
)

Value

A random forest

Arguments

df: A dataframe of writer profiles created with get_cluster_fill_rates
ntrees: An integer number of decision trees to use
distance_measures: A vector of distance measures. Any combination of 'abs', 'euc', 'man', 'max', and 'cos' may be used.
output_dir: A path to a directory where the random forest will be saved.
run_number: An integer used for both the set.seed function and to distinguish between different runs on the same input dataframe.
downsample_diff_pairs: Whether to downsample the number of different writer distances before training the random forest. If TRUE, the different writer distances will be randomly sampled, resulting in the same number of different writer and same writer pairs.

Details

The absolute distance between two n-length vectors of cluster fill rates, a and b, is a vector of the same length as a and b. It can be calculated as abs(a-b) where subtraction is performed element-wise, then the absolute value of each element is returned. More specifically, element i of the vector is \(|a_i - b_i|\) for \(i=1,2,...,n\).

The Manhattan distance between two n-length vectors of cluster fill rates, a and b, is \(\sum_{i=1}^n |a_i - b_i|\). In other words, it is the sum of the absolute distance vector.

The Euclidean distance between two n-length vectors of cluster fill rates, a and b, is \(\sqrt{\sum_{i=1}^n (a_i - b_i)^2}\). In other words, it is the sum of the elements of the absolute distance vector.

The maximum distance between two n-length vectors of cluster fill rates, a and b, is \(\max_{1 \leq i \leq n}{\{|a_i - b_i|\}}\). In other words, it is the sum of the elements of the absolute distance vector.

The cosine distance between two n-length vectors of cluster fill rates, a and b, is \(\sum_{i=1}^n (a_i - b_i)^2 / (\sqrt{\sum_{i=1}^n a_i^2}\sqrt{\sum_{i=1}^n b_i^2})\).

Examples

Run this code

rforest <- train_rf(
  df = train,
  ntrees = 200,
  distance_measures = c("euc"),
  run_number = 1,
  downsample = TRUE
)

Run the code above in your browser using DataLab