udpipe (version 0.3)

document_term_matrix: Create a document/term matrix from a data.frame with 1 row per document/term

Description

Create a document/term matrix from a data.frame with 1 row per document/term as returned by document_term_frequencies

Usage

document_term_matrix(x, vocabulary, ...)

# S3 method for data.frame document_term_matrix(x, vocabulary, ...)

# S3 method for DocumentTermMatrix document_term_matrix(x, ...)

# S3 method for TermDocumentMatrix document_term_matrix(x, ...)

# S3 method for simple_triplet_matrix document_term_matrix(x, ...)

Arguments

x

a data.frame with columns document, term and freq indicating how many times a term occurred in that specific document. This is what document_term_frequencies returns.

vocabulary

a character vector of terms which should be present in the document term matrix even if they did not occur in the x

...

further arguments currently not used

Value

an sparse object of class dgCMatrix with in the rows the documents and in the columns the terms containing the frequencies provided in x extended with terms which were not in x but were provided in vocabulary. The rownames of this resulting object contain the doc_id from x

Methods (by class)

  • data.frame: Construct a document term matrix from a data.frame with columns doc_id, term, freq

  • DocumentTermMatrix: Convert an object of class DocumentTermMatrix from the tm package to a sparseMatrix

  • TermDocumentMatrix: Convert an object of class TermDocumentMatrix from the tm package to a sparseMatrix with the documents in the rows and the terms in the columns

  • simple_triplet_matrix: Convert an object of class simple_triplet_matrix from the slam package to a sparseMatrix

See Also

sparseMatrix, document_term_frequencies

Examples

Run this code
# NOT RUN {
x <- data.frame(doc_id = c(1, 1, 2, 3, 4), 
 term = c("A", "C", "Z", "X", "G"), 
 freq = c(1, 5, 7, 10, 0))
document_term_matrix(x)
document_term_matrix(x, vocabulary = LETTERS)

## Example on larger dataset
data(brussels_reviews_anno)
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")])
dtm <- document_term_matrix(x)
dim(dtm)

## example showing the vocubulary argument
## allowing you to making sure terms which are not in the data are provided in the resulting dtm
allterms <- unique(x$term)
dtm <- document_term_matrix(head(x, 1000), vocabulary = allterms)

##
## Example adding bigrams/trigrams to the document term matrix
## Mark that this can also be done using ?dtm_cbind
##
library(data.table)
x <- as.data.table(brussels_reviews_anno)
x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)]
x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)]
x <- document_term_frequencies(x = x, 
                               document = "doc_id", 
                               term = c("token", "token_bigram", "token_trigram"))
dtm <- document_term_matrix(x)
# }

Run the code above in your browser using DataLab