build_dtm

corpus

Value between 0 and 1 indicating the proportion of documents
with no occurrences of a term above which that term should be dropped. By default
all terms are kept (<code>sparsity=1</code>).

sparsity

A vector of terms to which the matrix should be restricted.
By default, all words with more than <code>min_length</code> characters are considered.

dictionary

Whether to remove stopwords appearing in a language-specific list
(see <code><a rd-options="tm:stopwords" href="/link/tm%3A%3Astopwords?package=R.temis&version=0.1.3&to=tm%3Astopwords" data-mini-rdoc="tm:stopwords::tm::stopwords">tm::stopwords</a></code>).

remove_stopwords

Whether to convert all text to lower case.

tolower

Whether to remove all punctuation from text before
tokenizing terms.

remove_punctuation

Whether to remove all numbers from text before
tokenizing terms.

remove_numbers

The minimal number of characters for a word to be retained.

min_length

Compute document-term matrix from a corpus.

An integrated solution to perform
a series of text mining tasks such as importing and cleaning a corpus, and
analyses like terms and documents counts, lexical summary, terms
co-occurrences and documents similarity measures, graphs of terms,
correspondence analysis and hierarchical clustering. Corpora can be imported
from spreadsheet-like files, directories of raw text files,
as well as from 'Dow Jones Factiva', 'LexisNexis', 'Europresse' and 'Alceste' files.

Milan Bouchet-Valat

R.temis

Integrated Text Mining Solution

Gilles Bastin

Antoine Chollet

build_dtm function

Whether to remove stopwords appearing in a language-specific list
(see <code><a rd-options='tm:stopwords' href='tm::stopwords'>tm::stopwords</a></code>).

build_dtm: build_dtm

Description

Usage

Arguments

Value

Examples