Learn R Programming

twitterreport (version 0.15.11)

jaccard_coef: Jaccard coefficient

Description

Calculate the Jaccard (similarity) coefficient between words.

Usage

jaccard_coef(x, ...)
"jaccard_coef"(x, max.size = 1000, dist = FALSE)
"jaccard_coef"(x, max.size = 1000, stopwds = unique(c(tm::stopwords(), letters)), ignore.case = TRUE, dist = FALSE)

Arguments

x
Character vector with the phrases (tweets) to be analyzed
max.size
Max number of words to analyze
dist
When true computes one minus Jaccard coef
stopwds
Character vector of stopwords
ignore.case
When true converts all to lower

Value

A list including a dgCMatrix matrix

Methods (by class)

  • list: Method Processes a list of character vectors such as the one obtained from tw_extract
  • character: Computes the coef from a vector of characters (splits the text)

Details

The Jaccard index is used as a measure of similarity between two elements. In particular for a given pair of elements $x,y$ it is calculated as $$J(S,T) = \frac{|S\cap T|}{|S\cup T|}$$ Where $S$ is the set of groups where $x$ is present and $T$ is the set of groups where $y$. The resulting value is defined between 0 and 1, where 0 corresponds to no similarity at all (the elements don't have a group in common) and 1 represents perfect similarity (both elements are present in the same groups).

References

Conover, M., Ratkiewicz, J., & Francisco, M. (2011). "Political polarization on twitter". Icwsm, 133(26), 89–96. http://doi.org/10.1021/ja202932e