This function prepares raw texts for use with TextEmbeddingModel.
bow_pp_create_basic_text_rep(
data,
vocab_draft,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
split_tags = FALSE,
language_stopwords = "de",
use_lemmata = FALSE,
to_lower = FALSE,
min_termfreq = NULL,
min_docfreq = NULL,
max_docfreq = NULL,
window = 5,
weights = 1/(1:5),
trace = TRUE
)
Returns a list
of class basic_text_rep
with the following components.
dfm:
Document-Feature-Matrix. Rows correspond to the documents. Columns represent
the number of tokens in the document.
fcm:
Feature-Co-Occurance-Matrix.
information:
list
containing information about the used vocabulary. These are:
n_sentence:
Number of sentences
n_document_segments:
Number of document segments/raw texts
n_token_init:
Number of initial tokens
n_token_final:
Number of final tokens
n_lemmata:
Number of lemmas
configuration:
list
containing information if the vocabulary was
created with lower cases and if the vocabulary uses original tokens or lemmas.
language_model:
list
containing information about the applied
language model. These are:
model:
the udpipe language model
label:
the label of the udpipe language model
upos:
the applied universal part-of-speech tags
language:
the language
vocab:
a data.frame
with the original vocabulary
vector
containing the raw texts.
Object created with bow_pp_create_vocab_draft.
bool
TRUE
if punctuation should be removed.
bool
TRUE
if symbols should be removed.
bool
TRUE
if numbers should be removed.
bool
TRUE
if urls should be removed.
bool
TRUE
if separators should be removed.
bool
TRUE
if hyphens should be split into several tokens.
bool
TRUE
if tags should be split.
string
Abbreviation for the language for which stopwords should be
removed.
bool
TRUE
lemmas instead of original tokens should be used.
bool
TRUE
if tokens or lemmas should be used with lower cases.
int
Minimum frequency of a token to be part of the vocabulary.
int
Minimum appearance of a token in documents to be part of the vocabulary.
int
Maximum appearance of a token in documents to be part of the vocabulary.
int
size of the window for creating the feature-co-occurance matrix.
vector
weights for the corresponding window. The vector length must be equal to the window size.
bool
TRUE
if information about the progress should be
printed to console.
Other Preparation:
bow_pp_create_vocab_draft()