Jaccard or Dice similarity for text documents
JACCARD_DICE(
token_list1 = NULL,
token_list2 = NULL,
method = "jaccard",
threads = 1
)
a numeric vector
a list of tokenized text documents (it should have the same length as the token_list2)
a list of tokenized text documents (it should have the same length as the token_list1)
a character string specifying the similarity metric. One of 'jaccard', 'dice'
a numeric value specifying the number of cores to run in parallel
The function calculates either the jaccard or the dice distance between pairs of tokenized text of two lists
library(textTinyR)
lst1 = list(c('use', 'this', 'function', 'to'), c('either', 'compute', 'the', 'jaccard'))
lst2 = list(c('or', 'the', 'dice', 'distance'), c('for', 'two', 'same', 'sized', 'lists'))
out = JACCARD_DICE(token_list1 = lst1, token_list2 = lst2, method = 'jaccard', threads = 1)
Run the code above in your browser using DataLab