udpipe (version 0.3)

document_term_frequencies: Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document

Description

Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document

Usage

document_term_frequencies(x, document, ...)

# S3 method for data.frame document_term_frequencies(x, document = colnames(x)[1], term = colnames(x)[2], ...)

# S3 method for character document_term_frequencies(x, document = paste("doc", seq_along(x), sep = ""), split = "[[:space:][:punct:][:digit:]]+", ...)

Arguments

x

a data.frame or data.table containing a field which can be considered as a document (defaults to the first column in x) and a field which can be considered as a term (defaults to the second column in x). If the dataset also contains a column called 'freq', this will be summed over instead of counting the number of rows occur by document/term combination. If x is a character vector containing several terms, the text will be split by the argument split before doing the agregation at the document/term level.

document

If x is a data.frame, the column in x which identifies a document. If x is a character vector then document is a vector of the same length as x where document[i] is the document id which corresponds to the text in x[i].

...

further arguments passed on to the methods

term

If x is a data.frame, the column in x which identifies a term. Defaults to the second column in x.

split

The regular expression to be used if x is a character vector. This will split the character vector x in pieces by the provides split argument. Defaults to splitting according to spaces/punctuations/digits.

Value

a data.table with columns document, term and the summed freq. If freq is not in the dataset, will assume that freq is 1 for each row in x.

Methods (by class)

  • data.frame: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document

  • character: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document

Examples

Run this code
# NOT RUN {
##
## Calculate document_term_frequencies on a data.frame
##
data(brussels_reviews_anno)
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")])
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")])
str(x)

brussels_reviews_anno$my_doc_id <- paste(brussels_reviews_anno$doc_id, 
                                         brussels_reviews_anno$sentence_id)
x <- document_term_frequencies(brussels_reviews_anno[, c("my_doc_id", "lemma")])

##
## Calculate document_term_frequencies on a character vector
##
data(brussels_reviews)
x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, 
                               split = " ")
x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, 
                               split = "[[:space:][:punct:][:digit:]]+")
                               
##
## document-term-frequencies on several fields to easily include bigram and trigrams
##
library(data.table)
x <- as.data.table(brussels_reviews_anno)
x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)]
x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)]
x <- document_term_frequencies(x = x, 
                               document = "doc_id", 
                               term = c("token", "token_bigram", "token_trigram"))
head(x)
# }

Run the code above in your browser using DataLab