
Last chance! 50% off unlimited learning
Sale ends in
textTrainN() computes cross-validated correlations for different sample-sizes of a data set. The cross-validation process can be repeated several times to enhance the reliability of the evaluation.
textTrainN(
x,
y,
sample_percents = c(25, 50, 75, 100),
handle_word_embeddings = "individually",
n_cross_val = 1,
sampling_strategy = "subsets",
use_same_penalty_mixture = TRUE,
model = "regression",
penalty = 10^seq(-16, 16),
mixture = c(0),
seed = 2024,
...
)
A tibble containing correlations for each sample. If n_cross_val > 1, correlations for each new cross-validation, along with standard-deviation, mean and standard error of correlation is included in the tibble. The information in the tibble is visualised via the textTrainNPlot function.
Word embeddings from textEmbed (or textEmbedLayerAggregation). If several word embedding are provided in a list they will be concatenated.
Numeric variable to predict.
(numeric) Numeric vector that specifies the percentages of the total number of data points to include in each sample (default = c(25,50,75,100), i.e., correlations are evaluated for 25 each new sample.
Determine whether to use a list of word embeddings or an individual word_embedding (default = "individually", also "concatenate"). If a list of word embeddings are provided, then they will be concatenated.
(numeric) Value that determines the number of times to repeat the cross-validation (i.e., number of tests). (default = 1, i.e., cross-validation is only performed once). Warning: The training process gets proportionately slower to the number of cross-validations, resulting in a time complexity that increases with a factor of n (n cross-validations).
Sample a "random" sample for each subset from all data or sample a "subset" from the larger subsets (i.e., each subset contain the same data).
If TRUE it only searches the penalty and mixture search grid once, and then use the same thereafter; if FALSE, it searches the grid every time.
Type of model. Default is "regression"; see also "logistic" and "multinomial" for classification.
(numeric) Hyper parameter that is tuned (default = 10^seq(-16,16)).
A number between 0 and 1 (inclusive) that reflects the proportion of L1 regularization (i.e. lasso) in the model (for more information see the linear_reg-function in the parsnip-package). When mixture = 1, it is a pure lasso model while mixture = 0 indicates that ridge regression is being used (specific engines only).
(numeric) Set different seed (default = 2024).
Additional parameters from textTrainRegression.
See textTrainNPlot
.
# Compute correlations for 25%, 50%, 75% and 100% of the data in word_embeddings and perform
# cross-validation thrice.
if (FALSE) {
tibble_to_plot <- textTrainN(
x = word_embeddings_4$texts$harmonytext,
y = Language_based_assessment_data_8$hilstotal,
sample_percents = c(25, 50, 75, 100),
n_cross_val = 3
)
# tibble_to_plot contains correlation-coefficients for each cross_validation and
# standard deviation and mean value for each sample. The tibble can be plotted
# using the testTrainNPlot function.
# Examine tibble
tibble_to_plot
}
Run the code above in your browser using DataLab