stylest2_select_vocab: Cross-validation based term selection

Description

K-fold cross validation to determine the optimal cutoff on the term frequency distribution under which to drop terms.

Usage

stylest2_select_vocab(
  dfm,
  smoothing = 0.5,
  cutoffs = c(50, 60, 70, 80, 90, 99),
  nfold = 5,
  terms = NULL,
  term_weights = NULL,
  fill = FALSE,
  fill_weight = NULL,
  suppress_warning = TRUE
)

Value

List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation.

Arguments

dfm: a quanteda dfm object.
smoothing: the smoothing parameter value for smoothing the dfm. Should be a numeric scalar, default to 0.5.
cutoffs: a numeric vector of cutoff candidates.
nfold: number of folds for the cross-validation
terms: If not NULL, terms to be used in the model. If NULL, use all terms.
term_weights: Named vector of distances (or any weights) per term in the vocab. Names should correspond to the term.
fill: Should missing values in term weights be filled? Defaults to FALSE.
fill_weight: Numeric value to fill in as weight for any term which does not have a weight specified in term_weights.
suppress_warning: TRUE/FALSE, indicate whether to suppress warnings from stylest2_fit().

Examples

Run this code

data(novels_dfm)
stylest2_select_vocab(dfm=novels_dfm)

Run the code above in your browser using DataLab