dfm
. Uses the similarity measures defined in
simil. See pr_DB
for available distance
measures, or how to create your own.similarity(x, selection, n = 10, margin = c("features", "documents"),
method = "correlation", sort = TRUE, normalize = TRUE, digits = 4)## S3 method for class 'dfm,index':
similarity(x, selection, n = 10,
margin = c("features", "documents"), method = "correlation",
sort = TRUE, normalize = TRUE, digits = 4)
n
most similar items will be returned, sorted in
descending order. If n is NULL
, return all items.features
for word/term features or documents
for
documents.pr_DB
TRUE
TRUE
, normalize the dfm by term frequency within
document (so that the dfm values will be relative term frequency within
each document)# create a dfm from inaugural addresses from Reagan onwards
presDfm <- dfm(subset(inaugCorpus, Year>1980), ignoredFeatures=stopwords("english"),
stem=TRUE)
# compute some document similarities
similarity(presDfm, "1985-Reagan", n=5, margin="documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=NULL, margin="documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=NULL, margin="documents", method="cosine")
similarity(presDfm, "2005-Bush", n=NULL, margin="documents", method="eJaccard", sort=FALSE)
# compute some term similarities
similarity(presDfm, c("fair", "health", "terror"), method="cosine")
# compare to tm
require(tm)
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
tdm <- TermDocumentMatrix(crude)
findAssocs(tdm, c("oil", "opec", "xyz"), c(0.75, 0.82, 0.1))
# in quanteda
crudeDfm <- dfm(corpus(crude))
similarity(crudeDfm, c("oil", "opec", "xyz"), normalize=FALSE, digits=2)
Run the code above in your browser using DataLab