udpipe (version 0.8.11)

document_term_frequencies: Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document

Description

Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document

Usage

document_term_frequencies(x, document, ...)

# S3 method for data.frame document_term_frequencies( x, document = colnames(x)[1], term = colnames(x)[2], ... )

# S3 method for character document_term_frequencies( x, document = paste("doc", seq_along(x), sep = ""), split = "[[:space:][:punct:][:digit:]]+", ... )

Value

a data.table with columns doc_id, term, freq indicating how many times a term occurred in each document. If freq occurred in the input dataset the resulting data will have summed the freq. If freq is not in the dataset, will assume that freq is 1 for each row in the input dataset x.

Arguments

x

a data.frame or data.table containing a field which can be considered as a document (defaults to the first column in x) and a field which can be considered as a term (defaults to the second column in x). If the dataset also contains a column called 'freq', this will be summed over instead of counting the number of rows occur by document/term combination.
If x is a character vector containing several terms, the text will be split by the argument split before doing the agregation at the document/term level.

document

If x is a data.frame, the column in x which identifies a document. If x is a character vector then document is a vector of the same length as x where document[i] is the document id which corresponds to the text in x[i].

...

further arguments passed on to the methods

term

If x is a data.frame, the column in x which identifies a term. Defaults to the second column in x.

split

The regular expression to be used if x is a character vector. This will split the character vector x in pieces by the provides split argument. Defaults to splitting according to spaces/punctuations/digits.

Methods (by class)

  • data.frame: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document

  • character: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document

Examples

Run this code
# \dontshow{
data.table::setDTthreads(1)
# }
##
## Calculate document_term_frequencies on a data.frame
##
data(brussels_reviews_anno)
# \dontshow{
brussels_reviews_anno <- subset(brussels_reviews_anno, language %in% "nl")
# }
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")])
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")])
str(x)

brussels_reviews_anno$my_doc_id <- paste(brussels_reviews_anno$doc_id, 
                                         brussels_reviews_anno$sentence_id)
x <- document_term_frequencies(brussels_reviews_anno[, c("my_doc_id", "lemma")])

##
## Calculate document_term_frequencies on a character vector
##
data(brussels_reviews)
x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, 
                               split = " ")
x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, 
                               split = "[[:space:][:punct:][:digit:]]+")
                               
##
## document-term-frequencies on several fields to easily include bigram and trigrams
##
library(data.table)
x <- as.data.table(brussels_reviews_anno)
x <- x[, token_bigram  := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)]
x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)]
x <- document_term_frequencies(x = x, 
                               document = "doc_id", 
                               term = c("token", "token_bigram", "token_trigram"))
head(x)

Run the code above in your browser using DataLab