Learn R Programming

lingdist (version 1.0)

edit_dist_df: Compute edit distance between all row pairs of a dataframe

Description

Compute average edit distance between all row pairs of a dataframe, empty or NA cells are ignored. If all values in a row are not valid strings, all average distances involving this row is set to -1.

Usage

edit_dist_df(
  data,
  cost_mat = NULL,
  delim = "",
  squareform = FALSE,
  symmetric = TRUE,
  parallel = FALSE,
  n_threads = 2L
)

Value

A dataframe in long table form if `squareform` is FALSE, otherwise in squareform. If `symmetric` is TRUE, the long table form has \(C_n^2\) rows otherwise \(n^2\) rows.

Arguments

data

DataFrame with n rows and m columns indicating there are n languages or dialects to involve in the calculation and there are at most m words to base on, in which the rownames are the language ids.

cost_mat

Dataframe in squareform indicating the cost values when one symbol is deleted, inserted or substituted by another. Rownames and colnames are symbols. `cost_mat[char1,"_NULL_"]` indicates the cost value of deleting char1 and `cost_mat["_NULL_",char1]` is the cost value of inserting it. When an operation is not defined in the cost_mat, it is set 0 when the two symbols are the same, otherwise 1.

delim

The delimiter separating atomic symbols.

squareform

Whether to return a dataframe in squareform.

symmetric

Whether to the result matrix is symmetric. This depends on whether the `cost_mat` is symmetric.

parallel

Whether to parallelize the computation.

n_threads

The number of threads is used to parallelize the computation. Only meaningful if `parallel` is TRUE.

Examples

Run this code
df <- as.data.frame(rbind(a=c("a_bc_d","d_bc_a"),b=c("b_bc_d","d_bc_a")))
cost.mat <- data.frame()
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_")
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_", squareform=TRUE)
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_", parallel=TRUE, n_threads=4)

Run the code above in your browser using DataLab