corporaexplorer (version 0.6.2)

matrix_via_r: Create document term matrix for fast search of single words

Description

The characters removed

Usage

matrix_via_r(df, matrix_without_punctuation = TRUE)

Arguments

df

A "data_dok" tibble

matrix_without_punctuation

Should punctuation and digits be stripped from the text before constructing the document term matrix? If TRUE, the default:

  • The corporaexplorer object will be lighter and most searches in the corpus exploration app will be faster.

  • Searches including punctuation and digits will be carried out in the full text documents.

  • The only "risk" with this strategy is that the corpus exploration app in some cases can produce false positives. E.g. searching for the term "donkey" will also find the term "don%key". This should not be a problem for the vast opportunity of use cases, but if one so desires, there are three different solutions: set this parameter to FALSE, create a corporaexplorerobject without a matrix by setting the use_matrix parameter to FALSE, or run run_corpus_explorer with the use_matrix parameter set to FALSE.

If FALSE, the corporaexplorer object will be larger, and most simple searches will be slower.

Value

List: 1) Document term matrix (data.table), 2) word vector (character vector).