This function prepares raw texts for use with TextEmbeddingModel.
bow_pp_create_basic_text_rep(
data,
vocab_draft,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
split_tags = FALSE,
language_stopwords = "de",
use_lemmata = FALSE,
to_lower = FALSE,
min_termfreq = NULL,
min_docfreq = NULL,
max_docfreq = NULL,
window = 5,
weights = 1/(1:5),
trace = TRUE
)Returns a list of class basic_text_rep with the following components.
dfm: Document-Feature-Matrix. Rows correspond to the documents. Columns represent
the number of tokens in the document.
fcm: Feature-Co-Occurance-Matrix.
information: list containing information about the used vocabulary. These are:
n_sentence: Number of sentences
n_document_segments: Number of document segments/raw texts
n_token_init: Number of initial tokens
n_token_final: Number of final tokens
n_lemmata: Number of lemmas
configuration: list containing information if the vocabulary was
created with lower cases and if the vocabulary uses original tokens or lemmas.
language_model: list containing information about the applied
language model. These are:
model: the udpipe language model
label: the label of the udpipe language model
upos: the applied universal part-of-speech tags
language: the language
vocab: a data.frame with the original vocabulary
vector containing the raw texts.
Object created with bow_pp_create_vocab_draft.
bool TRUE if punctuation should be removed.
bool TRUE if symbols should be removed.
bool TRUE if numbers should be removed.
bool TRUE if urls should be removed.
bool TRUE if separators should be removed.
bool TRUE if hyphens should be split into several tokens.
bool TRUE if tags should be split.
string Abbreviation for the language for which stopwords should be
removed.
bool TRUE lemmas instead of original tokens should be used.
bool TRUE if tokens or lemmas should be used with lower cases.
int Minimum frequency of a token to be part of the vocabulary.
int Minimum appearance of a token in documents to be part of the vocabulary.
int Maximum appearance of a token in documents to be part of the vocabulary.
int size of the window for creating the feature-co-occurance matrix.
vector weights for the corresponding window. The vector length must be equal to the window size.
bool TRUE if information about the progress should be
printed to console.
Other Preparation:
bow_pp_create_vocab_draft()