Train a random forest with ranger from a dataframe of writer profiles
estimated with get_cluster_fill_rates
. train_rf
calculates
the distance between all pairs of writer profiles using one or more distance
measures. Currently, the available distance measures are absolute, Manhattan,
Euclidean, maximum, and cosine.
train_rf(
df,
ntrees,
distance_measures,
output_dir = NULL,
run_number = 1,
downsample_diff_pairs = TRUE
)
A random forest
A dataframe of writer profiles created with
get_cluster_fill_rates
An integer number of decision trees to use
A vector of distance measures. Any combination of 'abs', 'euc', 'man', 'max', and 'cos' may be used.
A path to a directory where the random forest will be saved.
An integer used for both the set.seed function and to distinguish between different runs on the same input dataframe.
Whether to downsample the number of different writer distances before training the random forest. If TRUE, the different writer distances will be randomly sampled, resulting in the same number of different writer and same writer pairs.
The absolute distance between two n-length vectors of cluster fill rates, a and b, is a vector of the same length as a and b. It can be calculated as abs(a-b) where subtraction is performed element-wise, then the absolute value of each element is returned. More specifically, element i of the vector is \(|a_i - b_i|\) for \(i=1,2,...,n\).
The Manhattan distance between two n-length vectors of cluster fill rates, a and b, is \(\sum_{i=1}^n |a_i - b_i|\). In other words, it is the sum of the absolute distance vector.
The Euclidean distance between two n-length vectors of cluster fill rates, a and b, is \(\sqrt{\sum_{i=1}^n (a_i - b_i)^2}\). In other words, it is the sum of the elements of the absolute distance vector.
The maximum distance between two n-length vectors of cluster fill rates, a and b, is \(\max_{1 \leq i \leq n}{\{|a_i - b_i|\}}\). In other words, it is the sum of the elements of the absolute distance vector.
The cosine distance between two n-length vectors of cluster fill rates, a and b, is \(\sum_{i=1}^n (a_i - b_i)^2 / (\sqrt{\sum_{i=1}^n a_i^2}\sqrt{\sum_{i=1}^n b_i^2})\).
rforest <- train_rf(
df = train,
ntrees = 200,
distance_measures = c("euc"),
run_number = 1,
downsample = TRUE
)
Run the code above in your browser using DataLab