get_tfidf

either an annotation object or a data frame with columns equal to
the inputs given to <code>doc_var</code> and <code>token_var</code>

object

the desired return type. The options <code>tfidf</code>, <code>tf</code>, and <code>idf</code>
return a list with the desired matrix, the document ids, and the vocabulary set.
The option <code>all</code> returns a list with all three as well as the ids and vocabulary.
For consistency, <code>vocab</code> all returns a list but this only contains the ids
and vocabulary set.

type

the weighting scheme for the term frequency matrix. The selection <code>lognorm</code> takes one plus
the log of the raw frequency (or zero if zero), <code>binary</code> encodes a zero one matrix
indicating simply whether the token exists at all in the document, <code>raw</code> returns
raw counts, and <code>dnorm</code> uses double normalization.

tf_weight

the weighting scheme for the inverse document matrix. The selection <code>idf</code> gives the
logarithm of the simple inverse frequency, <code>smooth</code> gives the logarithm of one plus
the simple inverse frequency, and <code>prob</code> gives the log odds of the the token occurring
in a randomly selected document.

idf_weight

the minimum proportion of documents a token should be in to be included in the vocabulary

min_df

the maximum proportion of documents a token should be in to be included in the vocabulary

max_df

the maximum number of tokens in the vocabulary

max_features

character vector. The name of the column in <code>object</code> that contains the document ids,
unless <code>object</code> is an annotation object, in which case it&#39;s the column of the token
matrix to use as the document id.

doc_var

character vector. The name of the column in <code>object</code> that contains the tokens,
unless <code>object</code> is an annotation object, in which case it&#39;s the column of the token
matrix to use as the tokens (generally either <code>lemma</code> or <code>word</code>).

token_var

character vector. The vocabulary set to use in constructing the matrices. Will be computed
within the function if set to <code>NULL</code>. When supplied, the options <code>min_df</code>, <code>max_df</code>,
and <code>max_features</code> are ignored.

vocabulary

Given an annotation object, this function returns the term-frequency inverse document
frequency (tf-idf) matrix from the extracted lemmas. A data frame with a document id
column and token column can be also be given, which allows the user to preprocess and
filter the desired tokens to include.

Provides a set of fast tools for converting a textual corpus into a set of normalized tables. Users may make use of a Python back end with 'spaCy' <https://spacy.io> or the Java back end 'CoreNLP' <http://stanfordnlp.github.io/CoreNLP/>. A minimal back end with no external dependencies is also provided. Exposed annotation tasks include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and word embeddings. Summary statistics regarding token unigram, part of speech tag, and dependency type frequencies are also included to assist with analyses.

get_tfidf: Construct the TF-IDF Matrix from Annotation or Data Frame

Description

Usage

Arguments

Value

Examples