Selects optimal vocabulary quantile(s) for model fitting using performance on predicting out-of-sampletexts.
stylest_select_vocab(
x,
speaker,
filter = NULL,
smooth = 0.5,
nfold = 5,
cutoff_pcts = c(50, 60, 70, 80, 90, 99),
cutoffs_term_weights = NULL,
fill_method = "value",
fill_weight = 1,
weight_varname = "mean_distance"
)
Corpus as text vector. May be a corpus_frame
object
Vector of speaker labels. Should be the same length as
x
if not NULL
, a corpus
text_filter
value for smoothing. Defaults to 0.5
Number of folds for cross-validation. Defaults to 5
Vector of cutoff percentages to test. Defaults to
c(50, 60, 70, 80, 90, 99)
Named list of dataframes of term weights,
where the names correspond to the cutoff_pcts
. Each dataframe
should have one column $word and a second column $weight_varname
containing the weight for the word.
See the vignette for details.
if "value"
(default), fill_weight
is
used to fill any terms with NA
weight. If "mean"
, the
mean term_weight should be used as the fill value
numeric value to fill in as weight for any term
which does not have a weight specified in term_weights
,
default=1.0
Name of the column in each term_weights dataframe containing
the weights, default="mean_distance"
List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation
# NOT RUN {
data(novels_excerpts)
stylest_select_vocab(novels_excerpts$text, novels_excerpts$author, cutoff_pcts = c(50, 90))
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab