Learn R Programming

polmineR (version 0.7.9)

as.TermDocumentMatrix: Generate TermDocumentMatrix / DocumentTermMatrix.

Description

Methods to generate the classes TermDocumentMatrix or DocumentTermMatrix as defined in the tm package. These classes inherit from the simple_triplet_matrix-class defined in the slam-package. There are many text mining applications for document-term matrices. A DocumentTermMatrix is required as input by the topicmodels package, for instance.

Usage

as.TermDocumentMatrix(x, ...)

# S4 method for character as.TermDocumentMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)

# S4 method for character as.DocumentTermMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)

# S4 method for bundle as.TermDocumentMatrix(x, col, p_attribute = NULL, verbose = TRUE, ...)

# S4 method for bundle as.DocumentTermMatrix(x, col, p_attribute = NULL, verbose = TRUE, ...)

# S4 method for partition_bundle as.TermDocumentMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)

# S4 method for partition_bundle as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)

# S4 method for context as.DocumentTermMatrix(x, p_attribute, verbose = TRUE, ...)

# S4 method for context as.TermDocumentMatrix(x, p_attribute, verbose = TRUE, ...)

Arguments

x

a character vector indicating a corpus, or an object of class bundle, or inheriting from class bundle (e.g. partition_bundle)

...

s-attribute definitions used for subsetting the corpus, compare partition-method

p_attribute

p-attribute counting is be based on

s_attribute

s-attribute that defines content of columns, or rows

verbose

logial, whether to output progress messages

col

the column of data.table in slot stat (if x is a bundle) to use of assembling the matrix

Value

a TermDocumentMatrix

Details

The method can be applied on objects of the class character, bundle, or classes inheriting from the bundle class.

If x refers to a corpus (i.e. is a length 1 character vector), a TermDocumentMatrix, or DocumentTermMatrix will be generated for subsets of the corpus based on the s_attribute provided. Counts are performed for the p_attribute. Further parameters provided (passed in as ... are interpreted as s-attributes that define a subset of the corpus for splitting it according to s_attribute. If struc values for s_attribute are not unique, the necessary aggregation is performed, slowing things somewhat down.

If x is a bundle or a class inheriting from it, the counts or whatever measure is present in the stat slots (in the column indicated by col) will be turned into the values of the sparse matrix that is generated. A special case is the generation of the sparse matrix based on a partition_bundle that does not yet include counts. In this case, a p_attribute needs to be provided. Then counting will be performed, too.

Examples

Run this code
# NOT RUN {
use("polmineR")
 
# do-it-yourself 
p <- partition("GERMAPARLMINI", date = ".*", regex = TRUE)
pB <- partition_bundle(p, s_attribute = "date")
pB <- enrich(pB, p_attribute="word")
tdm <- as.TermDocumentMatrix(pB, col = "count")
   
 # leave the counting to the as.TermDocumentMatrix-method
pB2 <- partition_bundle(p, s_attribute = "date")
tdm <- as.TermDocumentMatrix(pB2, p_attribute = "word", verbose = TRUE)
   
# diretissima
tdm <- as.TermDocumentMatrix("GERMAPARLMINI", p_attribute = "word", s_attribute = "date")
# }

Run the code above in your browser using DataLab