Function for creating a first draft of a vocabulary This function creates a list of tokens which refer to specific universal part-of-speech tags (UPOS) and provides the corresponding lemmas.
bow_pp_create_vocab_draft(
path_language_model,
data,
upos = c("NOUN", "ADJ", "VERB"),
label_language_model = NULL,
language = NULL,
chunk_size = 100,
trace = TRUE
)
list
with the following components.
vocab:
data.frame
containing the tokens, lemmas, tokens in lower case, and
lemmas in lower case.
ud_language_model
udpipe language model that is used for tagging.
label_language_model
Label of the udpipe language model.
language
Language of the raw texts.
upos
Used univerisal part-of-speech tags.
n_sentence
int
Estimated number of sentences in the raw texts.
n_token
int
Estimated number of tokens in the raw texts.
n_document_segments
int
Estimated number of document segments/raw texts.
string
Path to a udpipe language model that
should be used for tagging and lemmatization.
vector
containing the raw texts.
vector
containing the universal part-of-speech tags which
should be used to build the vocabulary.
string
Label for the udpipe language model used.
string
Name of the language (e.g., English, German)
int
Number of raw texts which should be processed at once.
bool
TRUE
if information about the progress should be printed to console.
Other Preparation:
bow_pp_create_basic_text_rep()