get_distances: Get Distances

Description

Calculate distances using between all pairs of cluster fill rates in a data frame using one or more distance measures. The available distance measures absolute distance, Manhattan distance, Euclidean distance, maximum distance, and cosine distance.

Usage

get_distances(df, distance_measures)

Value

A dataframe of distances

Arguments

df: A dataframe of cluster fill rates created with get_cluster_fill_rates and an added column that contains a writer ID.
distance_measures: A vector of distance measures. Use 'abs' to calculate the absolute difference, 'man' for the Manhattan distance, 'euc' for the Euclidean distance, 'max' for the maximum absolute distance, and 'cos' for the cosine distance. The vector can be a single distance, or any combination of these five distance measures.

Details

The absolute distance between two n-length vectors of cluster fill rates, a and b, is a vector of the same length as a and b. It can be calculated as abs(a-b) where subtraction is performed element-wise, then the absolute value of each element is returned. More specifically, element i of the vector is \(|a_i - b_i|\) for \(i=1,2,...,n\).

The Manhattan distance between two n-length vectors of cluster fill rates, a and b, is \(\sum_{i=1}^n |a_i - b_i|\). In other words, it is the sum of the absolute distance vector.

The Euclidean distance between two n-length vectors of cluster fill rates, a and b, is \(\sqrt{\sum_{i=1}^n (a_i - b_i)^2}\). In other words, it is the sum of the elements of the absolute distance vector.

The maximum distance between two n-length vectors of cluster fill rates, a and b, is \(\max_{1 \leq i \leq n}{\{|a_i - b_i|\}}\). In other words, it is the sum of the elements of the absolute distance vector.

The cosine distance between two n-length vectors of cluster fill rates, a and b, is \(\sum_{i=1}^n (a_i - b_i)^2 / (\sqrt{\sum_{i=1}^n a_i^2}\sqrt{\sum_{i=1}^n b_i^2})\).

Examples

Run this code


rates <- test[1:3, ]
# calculate maximum and Euclidean distances between the first 3 documents in test.
distances <- get_distances(df = rates, distance_measures = c("max", "euc"))

# calculate maximum and distances between all documents in test.
distances <- get_distances(df = test, distance_measures = c("man"))

Run the code above in your browser using DataLab