These functions compute matrixes of distances and similarities between
documents or features from a dfm
and return a
dist
object (or a matrix if specific targets are
selected). They are fast and robust because they operate directly on the sparse
dfm objects.
textstat_dist(x, selection = NULL, margin = c("documents", "features"),
method = "euclidean", upper = FALSE, diag = FALSE, p = 2)textstat_simil(x, selection = NULL, margin = c("documents", "features"),
method = "correlation", upper = FALSE, diag = FALSE)
a dfm object
character vector of document names or feature labels from
x
. A "dist"
object is returned if selection is NULL
,
otherwise, a matrix is returned.
identifies the margin of the dfm on which similarity or
difference will be computed: documents
for documents or
features
for word/term features.
method the similarity or distance measure to be used; see Details
whether the upper triangle of the symmetric
whether the diagonal of the distance matrix should be recorded
The power of the Minkowski distance.
textstat_simil
and textstat_dist
return dist
class objects.
textstat_dist
options are: "euclidean"
(default),
"Chisquared"
, "Chisquared2"
, "hamming"
,
"kullback"
. "manhattan"
, "maximum"
, "canberra"
,
and "minkowski"
.
textstat_simil
options are: "correlation"
(default),
"cosine"
, "jaccard"
, "eJaccard"
, "dice"
,
"eDice"
, "simple matching"
, "hamann"
, and
"faith"
.
The "Chisquared"
metric is from Legendre, P., & Gallagher,
E. D. (2001).
"Ecologically
meaningful transformations for ordination of species data".
Oecologia, 129(2), 271<U+2013>280. doi.org/10.1007/s004420100716
The "Chisquared2"
metric is the "Quadratic-Chi" measure from Pele,
O., & Werman, M. (2010).
"The
Quadratic-Chi Histogram Distance Family". In Computer Vision <U+2013> ECCV
2010 (Vol. 6312, pp. 749<U+2013>762). Berlin, Heidelberg: Springer, Berlin,
Heidelberg. doi.org/10.1007/978-3-642-15552-9_54.
"hamming"
is
"kullback"
is the Kullback-Leibler distance, which assumes that
All other measures are described in the proxy package.
# NOT RUN {
# create a dfm from inaugural addresses from Reagan onwards
presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990),
remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
# distances for documents
(d1 <- textstat_dist(presDfm, margin = "documents"))
as.matrix(d1)
# distances for specific documents
textstat_dist(presDfm, "2017-Trump", margin = "documents")
textstat_dist(presDfm, "2005-Bush", margin = "documents", method = "eJaccard")
(d2 <- textstat_dist(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents"))
as.list(d1)
# similarities for documents
(s1 <- textstat_simil(presDfm, method = "cosine", margin = "documents"))
as.matrix(s1)
as.list(s1)
# similarities for for specific documents
textstat_simil(presDfm, "2017-Trump", margin = "documents")
textstat_simil(presDfm, "2017-Trump", method = "cosine", margin = "documents")
textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")
# compute some term similarities
s2 <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine",
margin = "features")
head(as.matrix(s2), 10)
as.list(s2, n = 8)
# }
Run the code above in your browser using DataLab