lma_dtm: Document-Term Matrix Creation

Description

Creates a document-term matrix (dtm) from a set of texts.

Usage

lma_dtm(text, exclude = NULL, context = NULL, replace.special = FALSE,
  numbers = FALSE, punct = FALSE, urls = TRUE, emojis = FALSE,
  to.lower = TRUE, word.break = " +", dc.min = 0, dc.max = Inf,
  sparse = TRUE, tokens.only = FALSE)

Value

A sparse matrix (or regular matrix if sparse = FALSE), with a row per text, and column per term, or a list if tokens.only = TRUE. Includes an attribute with options (opts), and attributes with word count (WC) and column sums (colsums) if tokens.only = FALSE.

Arguments

text

Texts to be processed. This can be a vector (such as a column in a data frame) or list. When a list, these can be in the form returned with tokens.only = TRUE, or a list with named vectors, where names are tokens and values are frequencies or the like.

exclude

A character vector of words to be excluded. If exclude is a single string matching 'function', lma_dict(1:9) will be used.

context

A character vector used to reformat text based on look- ahead/behind. For example, you might attempt to disambiguate like by reformatting certain likes (e.g., context = c('(i) like*', '(you) like*', '(do) like'), where words in parentheses are the context for the target word, and asterisks denote partial matching). This would be converted to regular expression (i.e., '(? <= i) like\\b') which, if matched, would be replaced with a coded version of the word (e.g., "Hey, i like that!" would become "Hey, i i-like that!"). This would probably only be useful for categorization, where a dictionary would only include one or another version of a word (e.g., the LIWC 2015 dictionary does something like this with like, and LIWC 2007 did something like this with kind (of), both to try and clean up the posemo category).

replace.special

Logical: if TRUE, special characters are replaced with regular equivalents using the lma_dict special function.

numbers

Logical: if TRUE, numbers are preserved.

punct

Logical: if TRUE, punctuation is preserved.

urls

Logical: if FALSE, attempts to replace all urls with "repurl".

emojis

Logical: if TRUE, attempts to replace emojis (e.g., ":(" would be replaced with "repfrown").

to.lower

Logical: if FALSE, words with different capitalization are treated as different terms.

word.break

A regular expression string determining the way words are split. Default is ' +' which breaks words at one or more blank spaces. You may also like to break by dashes or slashes ('[ /-]+'), depending on the text.

dc.min

Numeric: excludes terms appearing in the set number or fewer documents. Default is 0 (no limit).

dc.max

Numeric: excludes terms appearing in the set number or more. Default is Inf (no limit).

sparse

Logical: if FALSE, a regular dense matrix is returned.

tokens.only

Logical: if TRUE, returns a list rather than a matrix, with these entries:

`tokens`	A vector of indices with terms as names.
`frequencies`	A vector of counts with terms as names.
`WC`	A vector of term counts for each document.
`indices`	A list with a vector of token indices for each document.

Examples

Run this code

text <- c(
  "Why, hello there! How are you this evening?",
  "I am well, thank you for your inquiry!",
  "You are a most good at social interactions person!",
  "Why, thank you! You're not all bad yourself!"
)

lma_dtm(text)

# return tokens only
(tokens <- lma_dtm(text, tokens.only = TRUE))

## convert those to a regular DTM
lma_dtm(tokens)

# convert a list-representation to a sparse matrix
lma_dtm(list(
  doc1 = c(why = 1, hello = 1, there = 1),
  doc2 = c(i = 1, am = 1, well = 1)
))

Run the code above in your browser using DataLab