The characters removed
matrix_via_r(df, matrix_without_punctuation = TRUE)
A "data_dok" tibble
Should punctuation and digits be stripped
from the text before constructing the document term matrix? If TRUE
,
the default:
The corporaexplorer object will be lighter and most searches in the corpus exploration app will be faster.
Searches including punctuation and digits will be carried out in the full text documents.
The only "risk" with this strategy is that the corpus exploration
app in some cases can produce false positives. E.g. searching for the
term "donkey" will also find the term "don%key".
This should not be a problem for the vast opportunity of use cases, but if
one so desires, there are three different solutions: set this parameter to
FALSE
, create a corporaexplorerobject without a matrix by setting
the use_matrix
parameter to FALSE
, or run
run_corpus_explorer
with the
use_matrix
parameter set to FALSE
.
If FALSE
, the corporaexplorer object will be larger, and most
simple searches will be slower.
List: 1) Document term matrix (data.table), 2) word vector (character vector).