tm_map
Transformations on Corpora
Interface to apply transformation functions (also denoted as mappings) to corpora.
Usage
"tm_map"(x, FUN, ...)
"tm_map"(x, FUN, ..., lazy = FALSE)
Arguments
- x
- A corpus.
- FUN
- a transformation function taking a text document as input and
returning a text document. The function
content_transformer
can be used to create a wrapper to get and set the content of text documents. - ...
- arguments to
FUN
. - lazy
- a logical. Lazy mappings are mappings which are delayed until the content is accessed. It is useful for large corpora if only few documents will be accessed. In such a case it avoids the computationally expensive application of the mapping to all elements in the corpus.
Value
-
A corpus with
FUN
applied to each document in x
. In case
of lazy mappings only internal flags are set. Access of individual documents
triggers the execution of the corresponding transformation function.
Note
Lazy transformations change R's standard evaluation semantics.
See Also
getTransformations
for available transformations.
Examples
data("crude")
## Document access triggers the stemming function
## (i.e., all other documents are not stemmed yet)
tm_map(crude, stemDocument, lazy = TRUE)[[1]]
## Use wrapper to apply character processing function
tm_map(crude, content_transformer(tolower))
## Generate a custom transformation function which takes the heading as new content
headings <- function(x)
PlainTextDocument(meta(x, "heading"),
id = meta(x, "id"),
language = meta(x, "language"))
inspect(tm_map(crude, headings))
Community examples
docs <- tm_map(docs, removeWords, c("tá", "ta", "pra", "tô", "to", "bem", "mete", "frente", "chegou", "joga", "vai", "vem", "assim", "pro", "vou", "desde", "fiz", "vim", "não", "nao", "logo", "entra", "hora", "muito", "cima", "sim", "ligado", "tchau", "música", "musica", "vários", "varios", "vão", "vao", "todas", "chora", "toma", "lá", "tomar", "som", "vamo", "ponta", "tomo", "sabe", "todo", "chama", "pura", "ver", "fazer", "pega", "falar", "fim", "passa", "tirando", "nada", "pois", "faz", "mim", "sei", "tambem", "jeito", "deu", "cada", "mó", "sao", "são", "nova", "moleque", "muleque", "gente", "pesado", "porque", "pouco", "forte", "problema", "lado", "entao", "daqui", "deu", "cada", "mó", "sao", "são", "nova", "moleque", "muleque", "gente", "pesado", "porque", "pouco", "forte", "problema", "lado", "entao", "daqui"))
library(tm) Sample_data <- read.csv("D:/Projects/Machine Learning/WorkSpace/ML_20170719/ML_New/LOB.csv") data_frame<- do.call('rbind', lapply(Sample_data, as.data.frame)) myCorpus <- Corpus(VectorSource(data_frame)) myCorpus <- tm_map(myCorpus, tolower) myCorpus <- tm_map(myCorpus, PlainTextDocument) myCorpus<- tm_map(myCorpus,removePunctuation) myCorpus <- tm_map(myCorpus, removeNumbers) #myCorpus <- tm_map(myCorpus, removeWords,stopwords("english")) myCorpus <- tm_map(myCorpus, stripWhitespace)