Abstract class for neural nets with 'keras'/'tensorflow' and 'pytorch'.
Objects of this class are used for assigning texts to classes/categories. For
the creation and training of a classifier an object of class EmbeddedText and a factor
are necessary. The object of class EmbeddedText contains the numerical text
representations (text embeddings) of the raw texts generated by an object of class
TextEmbeddingModel. The factor contains the classes/categories for every
text. Missing values (unlabeled cases) are supported. For predictions an object of class
EmbeddedText has to be used which was created with the same text embedding model as
for training.
model('tensorflow_model()')
Field for storing the tensorflow model after loading.
model_config('list()')
List for storing information about the configuration of the model. This
information is used to predict new data.
model_config$n_rec: Number of recurrent layers.
model_config$n_hidden: Number of dense layers.
model_config$target_levels: Levels of the target variable. Do not change this manually.
model_config$input_variables: Order and name of the input variables. Do not change this manually.
model_config$init_config: List storing all parameters passed to method new().
last_training('list()')
List for storing the history and the results of the last training. This
information will be overwritten if a new training is started.
last_training$learning_time: Duration of the training process.
config$history: History of the last training.
config$data: Object of class table storing the initial frequencies of the passed data.
config$data_pb:l Matrix storing the number of additional cases (test and training) added
during balanced pseudo-labeling. The rows refer to folds and final training.
The columns refer to the steps during pseudo-labeling.
config$data_bsc_test: Matrix storing the number of cases for each category used for testing
during the phase of balanced synthetic units. Please note that the
frequencies include original and synthetic cases. In case the number
of original and synthetic cases exceeds the limit for the majority classes,
the frequency represents the number of cases created by cluster analysis.
config$date: Time when the last training finished.
config$config: List storing which kind of estimation was requested during the last training.
config$config$use_bsc: TRUE if balanced synthetic cases were requested. FALSE
if not.
config$config$use_baseline: TRUE if baseline estimation were requested. FALSE
if not.
config$config$use_bpl: TRUE if balanced, pseudo-labeling cases were requested. FALSE
if not.
reliability('list()')
List for storing central reliability measures of the last training.
reliability$test_metric: Array containing the reliability measures for the validation data for
every fold, method, and step (in case of pseudo-labeling).
reliability$test_metric_mean: Array containing the reliability measures for the validation data for
every method and step (in case of pseudo-labeling). The values represent
the mean values for every fold.
reliability$raw_iota_objects: List containing all iota_object generated with the package iotarelr
for every fold at the start and the end of the last training.
reliability$raw_iota_objects$iota_objects_start: List of objects with class iotarelr_iota2 containing the
estimated iota reliability of the second generation for the baseline model
for every fold.
If the estimation of the baseline model is not requested, the list is
set to NULL.
reliability$raw_iota_objects$iota_objects_end: List of objects with class iotarelr_iota2 containing the
estimated iota reliability of the second generation for the final model
for every fold. Depending of the requested training method these values
refer to the baseline model, a trained model on the basis of balanced
synthetic cases, balanced pseudo labeling or a combination of balanced
synthetic cases with pseudo labeling.
reliability$raw_iota_objects$iota_objects_start_free: List of objects with class iotarelr_iota2 containing the
estimated iota reliability of the second generation for the baseline model
for every fold.
If the estimation of the baseline model is not requested, the list is
set to NULL.Please note that the model is estimated without
forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.
reliability$raw_iota_objects$iota_objects_end_free: List of objects with class iotarelr_iota2 containing the
estimated iota reliability of the second generation for the final model
for every fold. Depending of the requested training method, these values
refer to the baseline model, a trained model on the basis of balanced
synthetic cases, balanced pseudo-labeling or a combination of balanced
synthetic cases and pseudo-labeling.
Please note that the model is estimated without
forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.
reliability$iota_object_start: Object of class iotarelr_iota2 as a mean of the individual objects
for every fold. If the estimation of the baseline model is not requested, the list is
set to NULL.
reliability$iota_object_start_free: Object of class iotarelr_iota2 as a mean of the individual objects
for every fold. If the estimation of the baseline model is not requested, the list is
set to NULL.
Please note that the model is estimated without
forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.
reliability$iota_object_end: Object of class iotarelr_iota2 as a mean of the individual objects
for every fold.
Depending on the requested training method, this object
refers to the baseline model, a trained model on the basis of balanced
synthetic cases, balanced pseudo-labeling or a combination of balanced
synthetic cases and pseudo-labeling.
reliability$iota_object_end_free: Object of class iotarelr_iota2 as a mean of the individual objects
for every fold.
Depending on the requested training method, this object
refers to the baseline model, a trained model on the basis of balanced
synthetic cases, balanced pseudo-labeling or a combination of balanced
synthetic cases and pseudo-labeling.
Please note that the model is estimated without
forcing the Assignment Error Matrix to be in line with the assumption of weak superiority.
reliability$standard_measures_end: Object of class list containing the final
measures for precision, recall, and f1 for every fold.
Depending of the requested training method, these values
refer to the baseline model, a trained model on the basis of balanced
synthetic cases, balanced pseudo-labeling or a combination of balanced
synthetic cases and pseudo-labeling.
reliability$standard_measures_mean: matrix containing the mean
measures for precision, recall, and f1 at the end of every fold.
new()Creating a new instance of this class.
TextEmbeddingClassifierNeuralNet$new(
ml_framework = aifeducation_config$get_framework(),
name = NULL,
label = NULL,
text_embeddings = NULL,
targets = NULL,
hidden = c(128),
rec = c(128),
self_attention_heads = 0,
intermediate_size = NULL,
attention_type = "fourier",
add_pos_embedding = TRUE,
rec_dropout = 0.1,
repeat_encoder = 1,
dense_dropout = 0.4,
recurrent_dropout = 0.4,
encoder_dropout = 0.1,
optimizer = "adam"
)ml_frameworkstring Framework to use for training and inference.
ml_framework="tensorflow" for 'tensorflow' and ml_framework="pytorch"
for 'pytorch'
nameCharacter Name of the new classifier. Please refer to
common name conventions. Free text can be used with parameter label.
labelCharacter Label for the new classifier. Here you can use
free text.
text_embeddingsAn object of classTextEmbeddingModel.
targetsfactor containing the target values of the classifier.
hiddenvector containing the number of neurons for each dense layer.
The length of the vector determines the number of dense layers. If you want no dense layer,
set this parameter to NULL.
recvector containing the number of neurons for each recurrent layer.
The length of the vector determines the number of dense layers. If you want no dense layer,
set this parameter to NULL.
self_attention_headsinteger determining the number of attention heads
for a self-attention layer. Only relevant if attention_type="multihead"
intermediate_sizeint determining the size of the projection layer within
a each transformer encoder.
attention_typestring Choose the relevant attention type. Possible values
are "fourier" and multihead.
add_pos_embeddingbool TRUE if positional embedding should be used.
rec_dropoutdouble ranging between 0 and lower 1, determining the
dropout between bidirectional gru layers.
repeat_encoderint determining how many times the encoder should be
added to the network.
dense_dropoutdouble ranging between 0 and lower 1, determining the
dropout between dense layers.
recurrent_dropoutdouble ranging between 0 and lower 1, determining the
recurrent dropout for each recurrent layer. Only relevant for keras models.
encoder_dropoutdouble ranging between 0 and lower 1, determining the
dropout for the dense projection within the encoder layers.
optimizerObject of class keras.optimizers.
Returns an object of class TextEmbeddingClassifierNeuralNet which is ready for training.
train()Method for training a neural net.
TextEmbeddingClassifierNeuralNet$train(
data_embeddings,
data_targets,
data_n_test_samples = 5,
balance_class_weights = TRUE,
use_baseline = TRUE,
bsl_val_size = 0.25,
use_bsc = TRUE,
bsc_methods = c("dbsmote"),
bsc_max_k = 10,
bsc_val_size = 0.25,
bsc_add_all = FALSE,
use_bpl = TRUE,
bpl_max_steps = 3,
bpl_epochs_per_step = 1,
bpl_dynamic_inc = FALSE,
bpl_balance = FALSE,
bpl_max = 1,
bpl_anchor = 1,
bpl_min = 0,
bpl_weight_inc = 0.02,
bpl_weight_start = 0,
bpl_model_reset = FALSE,
sustain_track = TRUE,
sustain_iso_code = NULL,
sustain_region = NULL,
sustain_interval = 15,
epochs = 40,
batch_size = 32,
dir_checkpoint,
trace = TRUE,
keras_trace = 2,
pytorch_trace = 2,
n_cores = 2
)data_embeddingsObject of class TextEmbeddingModel.
data_targetsFactor containing the labels for cases
stored in data_embeddings. Factor must be named and has to use the
same names used in data_embeddings.
data_n_test_samplesint determining the number of cross-fold
samples.
balance_class_weightsbool If TRUE class weights are
generated based on the frequencies of the training data with the method
Inverse Class Frequency'. If FALSE each class has the weight 1.
use_baselinebool TRUE if the calculation of a baseline
model is requested. This option is only relevant for use_bsc=TRUE or
use_pbl=TRUE. If both are FALSE, a baseline model is calculated.
bsl_val_sizedouble between 0 and 1, indicating the proportion of cases of each class
which should be used for the validation sample during the estimation of the baseline model.
The remaining cases are part of the training data.
use_bscbool TRUE if the estimation should integrate
balanced synthetic cases. FALSE if not.
bsc_methodsvector containing the methods for generating
synthetic cases via 'smotefamily'. Multiple methods can
be passed. Currently bsc_methods=c("adas"), bsc_methods=c("smote")
and bsc_methods=c("dbsmote") are possible.
bsc_max_kint determining the maximal number of k which is used
for creating synthetic units.
bsc_val_sizedouble between 0 and 1, indicating the proportion of cases of each class
which should be used for the validation sample during the estimation with synthetic cases.
bsc_add_allbool If FALSE only synthetic cases necessary to fill
the gab between the class and the major class are added to the data. If TRUE all
generated synthetic cases are added to the data.
use_bplbool TRUE if the estimation should integrate
balanced pseudo-labeling. FALSE if not.
bpl_max_stepsint determining the maximum number of steps during
pseudo-labeling.
bpl_epochs_per_stepint Number of training epochs within every step.
bpl_dynamic_incbool If TRUE, only a specific percentage
of cases is included during each step. The percentage is determined by
\(step/bpl_max_steps\). If FALSE, all cases are used.
bpl_balancebool If TRUE, the same number of cases for
every category/class of the pseudo-labeled data are used with training. That
is, the number of cases is determined by the minor class/category.
bpl_maxdouble between 0 and 1, setting the maximal level of
confidence for considering a case for pseudo-labeling.
bpl_anchordouble between 0 and 1 indicating the reference
point for sorting the new cases of every label. See notes for more details.
bpl_mindouble between 0 and 1, setting the minimal level of
confidence for considering a case for pseudo-labeling.
bpl_weight_incdouble value how much the sample weights
should be increased for the cases with pseudo-labels in every step.
bpl_weight_startdobule Starting value for the weights of the
unlabeled cases.
bpl_model_resetbool If TRUE, model is re-initialized at every
step.
sustain_trackbool If TRUE energy consumption is tracked
during training via the python library codecarbon.
sustain_iso_codestring ISO code (Alpha-3-Code) for the country. This variable
must be set if sustainability should be tracked. A list can be found on
Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
sustain_regionRegion within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html
sustain_intervalinteger Interval in seconds for measuring power
usage.
epochsint Number of training epochs.
batch_sizeint Size of batches.
dir_checkpointstring Path to the directory where
the checkpoint during training should be saved. If the directory does not
exist, it is created.
tracebool TRUE, if information about the estimation
phase should be printed to the console.
keras_traceint keras_trace=0 does not print any
information about the training process from keras on the console.
pytorch_traceint pytorch_trace=0 does not print any
information about the training process from pytorch on the console.
pytorch_trace=1 prints a progress bar. pytorch_trace=2 prints
one line of information for every epoch.
n_coresint Number of cores used for creating synthetic units.
bsc_max_k: All values from 2 up to bsc_max_k are successively used. If
the number of bsc_max_k is too high, the value is reduced to a number that
allows the calculating of synthetic units.
bpl_anchor: With the help of this value, the new cases are sorted. For
this aim, the distance from the anchor is calculated and all cases are arranged
into an ascending order.
Function does not return a value. It changes the object into a trained classifier.
predict()Method for predicting new data with a trained neural net.
TextEmbeddingClassifierNeuralNet$predict(newdata, batch_size = 32, verbose = 1)newdataObject of class TextEmbeddingModel or
data.frame for which predictions should be made.
batch_sizeint Size of batches.
verboseint verbose=0 does not cat any
information about the training process from keras on the console.
verbose=1 prints a progress bar. verbose=2 prints
one line of information for every epoch.
Returns a data.frame containing the predictions and
the probabilities of the different labels for each case.
check_embedding_model()Method for checking if the provided text embeddings are created with the same TextEmbeddingModel as the classifier.
TextEmbeddingClassifierNeuralNet$check_embedding_model(text_embeddings)text_embeddingsObject of class EmbeddedText.
TRUE if the underlying TextEmbeddingModel are the same.
FALSE if the models differ.
get_model_info()Method for requesting the model information
TextEmbeddingClassifierNeuralNet$get_model_info()list of all relevant model information
get_text_embedding_model()Method for requesting the text embedding model information
TextEmbeddingClassifierNeuralNet$get_text_embedding_model()list of all relevant model information on the text embedding model
underlying the classifier
set_publication_info()Method for setting publication information of the classifier
TextEmbeddingClassifierNeuralNet$set_publication_info(
authors,
citation,
url = NULL
)authorsList of authors.
citationFree text citation.
urlURL of a corresponding homepage.
Function does not return a value. It is used for setting the private members for publication information.
get_publication_info()Method for requesting the bibliographic information of the classifier.
TextEmbeddingClassifierNeuralNet$get_publication_info()list with all saved bibliographic information.
set_software_license()Method for setting the license of the classifier.
TextEmbeddingClassifierNeuralNet$set_software_license(license = "GPL-3")licensestring containing the abbreviation of the license or
the license text.
Function does not return a value. It is used for setting the private member for the software license of the model.
get_software_license()Method for getting the license of the classifier.
TextEmbeddingClassifierNeuralNet$get_software_license()licensestring containing the abbreviation of the license or
the license text.
string representing the license for the software.
set_documentation_license()Method for setting the license of the classifier's documentation.
TextEmbeddingClassifierNeuralNet$set_documentation_license(
license = "CC BY-SA"
)licensestring containing the abbreviation of the license or
the license text.
Function does not return a value. It is used for setting the private member for the documentation license of the model.
get_documentation_license()Method for getting the license of the classifier's documentation.
TextEmbeddingClassifierNeuralNet$get_documentation_license()licensestring containing the abbreviation of the license or
the license text.
Returns the license as a string.
set_model_description()Method for setting a description of the classifier.
TextEmbeddingClassifierNeuralNet$set_model_description(
eng = NULL,
native = NULL,
abstract_eng = NULL,
abstract_native = NULL,
keywords_eng = NULL,
keywords_native = NULL
)engstring A text describing the training of the learner,
its theoretical and empirical background, and the different output labels
in English.
nativestring A text describing the training of the learner,
its theoretical and empirical background, and the different output labels
in the native language of the classifier.
abstract_engstring A text providing a summary of the description
in English.
abstract_nativestring A text providing a summary of the description
in the native language of the classifier.
keywords_engvector of keyword in English.
keywords_nativevector of keyword in the native language of the classifier.
Function does not return a value. It is used for setting the private members for the description of the model.
get_model_description()Method for requesting the model description.
TextEmbeddingClassifierNeuralNet$get_model_description()list with the description of the classifier in English
and the native language.
save_model()Method for saving a model to 'Keras v3 format', 'tensorflow' SavedModel format or h5 format.
TextEmbeddingClassifierNeuralNet$save_model(dir_path, save_format = "default")dir_pathstring() Path of the directory where the model should be
saved.
save_formatFormat for saving the model. For 'tensorflow'/'keras' models
"keras" for 'Keras v3 format',
"tf" for SavedModel
or "h5" for HDF5.
For 'pytorch' models "safetensors" for 'safetensors' or
"pt" for 'pytorch' via pickle.
Use "default" for the standard format. This is keras for
'tensorflow'/'keras' models and safetensors for 'pytorch' models.
Function does not return a value. It saves the model to disk.
load_model()Method for importing a model from 'Keras v3 format', 'tensorflow' SavedModel format or h5 format.
TextEmbeddingClassifierNeuralNet$load_model(dir_path, ml_framework = "auto")dir_pathstring() Path of the directory where the model is
saved.
ml_frameworkstring Determines the machine learning framework
for using the model. Possible are ml_framework="pytorch" for 'pytorch',
ml_framework="tensorflow" for 'tensorflow', and ml_framework="auto".
Function does not return a value. It is used to load the weights of a model.
get_package_versions()Method for requesting a summary of the R and python packages' versions used for creating the classifier.
TextEmbeddingClassifierNeuralNet$get_package_versions()Returns a list containing the versions of the relevant
R and python packages.
get_sustainability_data()Method for requesting a summary of tracked energy consumption during training and an estimate of the resulting CO2 equivalents in kg.
TextEmbeddingClassifierNeuralNet$get_sustainability_data()Returns a list containing the tracked energy consumption,
CO2 equivalents in kg, information on the tracker used, and technical
information on the training infrastructure.
get_ml_framework()Method for requesting the machine learning framework used for the classifier.
TextEmbeddingClassifierNeuralNet$get_ml_framework()Returns a string describing the machine learning framework used
for the classifier
clone()The objects of this class are cloneable with this method.
TextEmbeddingClassifierNeuralNet$clone(deep = FALSE)deepWhether to make a deep clone.