Learn R Programming

polmineR (version 0.7.8)

as.TermDocumentMatrix: Generate TermDocumentMatrix / DocumentTermMatrix.

Description

Method to generate the classes TermDocumentMatrix or DocumentTermMatrix as defined in the tm package. These classes inherit from the simple_triplet_matrix-class defined in the slam-package. There are many text mining applications for document-term matrices. A DocumentTermMatrix is required as input by the topicmodels package, for instance.

Usage

as.TermDocumentMatrix(x, ...)

# S4 method for character as.TermDocumentMatrix(x, pAttribute, sAttribute, verbose = TRUE, ...)

# S4 method for character as.DocumentTermMatrix(x, pAttribute, sAttribute, verbose = TRUE, ...)

# S4 method for bundle as.TermDocumentMatrix(x, col, pAttribute = NULL, verbose = TRUE)

# S4 method for bundle as.DocumentTermMatrix(x, col)

# S4 method for partitionBundle as.TermDocumentMatrix(x, pAttribute = NULL, col = NULL, verbose = TRUE)

# S4 method for partitionBundle as.DocumentTermMatrix(x, pAttribute = NULL, col = NULL, verbose = TRUE)

# S4 method for context as.DocumentTermMatrix(x, pAttribute, verbose = TRUE)

# S4 method for context as.TermDocumentMatrix(x, pAttribute, verbose = TRUE)

Arguments

x

a character vector indicating a corpus, or an object of class bundle, or inheriting from class bundle (e.g. partitionBundle)

...

s-attribute definitions used for subsetting the corpus, compare partition-method

pAttribute

p-attribute counting is be based on

sAttribute

s-attribute that defines content of columns, or rows

verbose

logial, whether to output progress messages

col

the column of data.table in slot stat (if x is a bundle) to use of assembling the matrix

Value

a TermDocumentMatrix

Details

The method can be applied on objects of the class character, bundle, or classes inheriting from the bundle class.

If x refers to a corpus (i.e. is a length 1 character vector), a TermDocumentMatrix, or DocumentTermMatrix will be generated for subsets of the corpus based on the sAttribute provided. Counts are performed for the pAttribute. Further parameters provided (passed in as ... are interpreted as s-attributes that define a subset of the corpus for splitting it according to sAttribute. If struc values for sAttribute are not unique, the necessary aggregation is performed, slowing things somewhat down.

If x is a bundle or a class inheriting from it, the counts or whatever measure is present in the stat slots (in the column indicated by col) will be turned into the values of the sparse matrix that is generated. A special case is the generation of the sparse matrix based on a partitionBundle that does not yet include counts. In this case, a pAttribute needs to be provided. Then counting will be performed, too.

Examples

Run this code
# NOT RUN {
use("polmineR")
 
# do-it-yourself 
p <- partition("GERMAPARLMINI", date=".*", regex=TRUE)
pB <- partitionBundle(p, sAttribute = "date")
pB <- enrich(pB, pAttribute="word")
tdm <- as.TermDocumentMatrix(pB, col = "count")
   
 # leave the counting to the as.TermDocumentMatrix-method
pB2 <- partitionBundle(p, sAttribute = "date")
tdm <- as.TermDocumentMatrix(pB2, pAttribute = "word", verbose = TRUE)
   
# diretissima
tdm <- as.TermDocumentMatrix("GERMAPARLMINI", pAttribute = "word", sAttribute = "date")
# }

Run the code above in your browser using DataLab