- x
a (quanteda) corpus
object
- target
(character) vector of words
- first_vec
(character) vector of words
- second_vec
(character) vector of words
- pre_trained
(numeric) a F x D matrix corresponding to pretrained embeddings,
usually trained on the same corpus as that used for x
.
F = number of features and D = embedding dimensions.
rownames(pre_trained) = set of features for which there is a pre-trained embedding
- transform_matrix
(numeric) a D x D 'a la carte' transformation matrix.
D = dimensions of pretrained embeddings.
- group_var
(character) variable name in corpus object defining grouping variable
- window
(numeric) - defines the size of a context (words around the target)
- norm
(character) - "l2" for l2 normalized cosine similarity and "none" for dot product
- remove_punct
(logical) - if TRUE
remove all characters in the Unicode
"Punctuation" [P]
class
- remove_symbols
(logical) - if TRUE
remove all characters in the Unicode
"Symbol" [S]
class
- remove_numbers
(logical) - if TRUE
remove tokens that consist only of
numbers, but not words that start with digits, e.g. 2day
- remove_separators
(logical) - if TRUE
remove separators and separator
characters (Unicode "Separator" [Z]
and "Control" [C]
categories)
- valuetype
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching
- hard_cut
(logical) - if TRUE then a context must have window
x 2 tokens,
if FALSE it can have window
x 2 or fewer (e.g. if a doc begins with a target word,
then context will have window
tokens rather than window
x 2)
- case_insensitive
(logical) - if TRUE
, ignore case when matching a
target patter