Corpus

0th

Percentile

Corpora

Representing and computing on corpora.

Details

Corpora are collections of documents containing (natural language) text. In packages which employ the infrastructure provided by package tm, such corpora are represented via the virtual S3 class Corpus: such packages then provide S3 corpus classes extending the virtual base class (such as VCorpus provided by package tm itself).

All extension classes must provide accessors to extract subsets ([), individual documents ([[), and metadata (meta). The function length must return the number of documents, and as.list must construct a list holding the documents.

A corpus can have two types of metadata (accessible via meta). Corpus metadata contains corpus specific metadata in form of tag-value pairs. Document level metadata contains document specific metadata but is stored in the corpus as a data frame. Document level metadata is typically used for semantic reasons (e.g., classifications of documents form an own entity due to some high-level information like the range of possible values) or for performance reasons (single access instead of extracting metadata of each document).

See Also

VCorpus, and PCorpus for the corpora classes provided by package tm.

DCorpus for a distributed corpus class provided by package tm.plugin.dc.

Aliases
  • Corpus
Documentation reproduced from package tm, version 0.6-2, License: GPL-3

Community examples

sprasha6 at Nov 25, 2018 tm v0.7-5

## WORD CLOUD example # install.packages('tm') # install.packages('SnowballC') library(tm) library(SnowballC) #load the dataset dataset_original = read.csv(file.choose(), stringsAsFactors = FALSE) corpus = VCorpus(VectorSource(dataset_original$Review)) corpus = tm_map(corpus, content_transformer(tolower)) corpus = tm_map(corpus, removeNumbers) corpus = tm_map(corpus, removePunctuation) corpus = tm_map(corpus, removeWords, stopwords()) corpus = tm_map(corpus, stemDocument) corpus = tm_map(corpus, stripWhitespace) # Creating the Bag of Words model dtm = DocumentTermMatrix(corpus) dtm = removeSparseTerms(dtm, 0.999) dataset = as.data.frame(as.matrix(dtm)) dataset$Liked = dataset_original$Liked # Encoding the target feature as factor dataset$Liked = factor(dataset$Liked, levels = c(0, 1)) #wordCloud library(wordcloud) dtm = DocumentTermMatrix(corpus) dtm = removeSparseTerms(dtm, 0.999) dataset = as.matrix(dtm) v = sort(colSums(dataset),decreasing=TRUE) myNames = names(v) d = data.frame(word=myNames,freq=v) wordcloud(d$word, colors=c(3,4),random.color=FALSE, d$freq, min.freq=80)