DataManagerClassifier: Data manager for classification tasks

Description

Abstract class for managing the data and samples during training a classifier. DataManagerClassifier is used with all classifiers based on text embeddings.

Arguments

Value

Objects of this class are used for ensuring the correct data management for training different types of classifiers. They are also used for data augmentation by creating synthetic cases with different techniques.

Public fields

config

('list')
Field for storing configuration of the DataManagerClassifier.

state

('list')
Field for storing the current state of the DataManagerClassifier.

datasets

('list')
Field for storing the data sets used during training. All elements of the list are data sets of class datasets.arrow_dataset.Dataset. The following data sets are available:

data_labeled: all cases which have a label.
data_unlabeled: all cases which have no label.
data_labeled_synthetic: all synthetic cases with their corresponding labels.
data_labeled_pseudo: subset of data_unlabeled if pseudo labels were estimated by a classifier.

name_idx

('named vector')
Field for storing the pairs of indexes and names of every case. The pairs for labeled and unlabeled data are separated.

samples

('list')
Field for storing the assignment of every cases to a train, validation or test data set depending on the concrete fold. Only the indexes and not the names are stored. In addition, the list contains the assignment for the final training which excludes a test data set. If the DataManagerClassifier uses i folds the sample for the final training can be requested with i+1.

Methods

Public methods

Method `new()`

Creating a new instance of this class.

Usage

DataManagerClassifier$new(
  data_embeddings,
  data_targets,
  folds = 5,
  val_size = 0.25,
  pad_value = -100,
  class_levels,
  one_hot_encoding = TRUE,
  add_matrix_map = TRUE,
  sc_methods = "knnor",
  sc_min_k = 1,
  sc_max_k = 10,
  trace = TRUE,
  n_cores = auto_n_cores()
)

Arguments

data_embeddings: EmbeddedText, LargeDataSetForTextEmbeddings Object of class EmbeddedText or LargeDataSetForTextEmbeddings.

data_targets

factor containing the labels for cases stored in embeddings. Factor must be named and has to use the same names as used in in the embeddings.

folds

int determining the number of cross-fold samples. Allowed values: 1 <= x

val_size

double between 0 and 1, indicating the proportion of cases which should be used for the validation sample during the estimation of the model. The remaining cases are part of the training data. Allowed values: 0 < x < 1

pad_value

int Value indicating padding. This value should no be in the range of regluar values for computations. Thus it is not recommended to chance this value. Default is -100. Allowed values: x <= -100

class_levels

vector containing the levels (categories or classes) within the target data. Please note that order matters. For ordinal data please ensure that the levels are sorted correctly with later levels indicating a higher category/class. For nominal data the order does not matter.

one_hot_encoding

bool If TRUE all labels are converted to one hot encoding.

add_matrix_map

bool If TRUE all embeddings are transformed into a two dimensional matrix. The number of rows equals the number of cases. The number of columns equals times*features.

sc_methods

string containing the method for generating synthetic cases. Allowed values: 'knnor'

sc_min_k

int determining the minimal number of k which is used for creating synthetic units. Allowed values: 1 <= x

sc_max_k

int determining the maximal number of k which is used for creating synthetic units. Allowed values: 1 <= x

trace

bool TRUE if information about the estimation phase should be printed to the console.

n_cores

DataManagerClassifier$set_state(iteration, step = NULL)

Arguments

iteration: int determining the current iteration of the training. That is iteration determines the fold to use for training, validation, and testing. If i is the number of fold i+1 request the sample for the final training. For requesting the sample for the final training iteration can take a string "final".

step

DataManagerClassifier$get_dataset(
  inc_labeled = TRUE,
  inc_unlabeled = FALSE,
  inc_synthetic = FALSE,
  inc_pseudo_data = FALSE
)

Arguments

inc_labeled: bool If TRUE the data set includes all cases which have labels.

inc_unlabeled

bool If TRUE the data set includes all cases which have no labels.

inc_synthetic

bool If TRUE the data set includes all synthetic cases with their corresponding labels.

inc_pseudo_data

bool If TRUE the data set includes all cases which have pseudo labels.

Returns

Returns an object of class datasets.arrow_dataset.Dataset containing the requested kind of data along with all requested transformations for training. Please note that this method returns a data sets that is designed for training only. The corresponding validation data set is requested with get_val_dataset and the corresponding test data set with get_test_dataset.

Method `get_val_dataset()`

Method for requesting a data set for validation depending in the current state of the DataManagerClassifier.

Usage

DataManagerClassifier$get_val_dataset()

Returns

Returns an object of class datasets.arrow_dataset.Dataset containing the requested kind of data along with all requested transformations for validation. The corresponding data set for training can be requested with get_dataset and the corresponding data set for testing with get_test_dataset.

Method `get_test_dataset()`

Method for requesting a data set for testing depending in the current state of the DataManagerClassifier.

Usage

DataManagerClassifier$get_test_dataset()

Returns

Method `create_synthetic()`

Method for generating synthetic data used during training. The process uses all labeled data belonging to the current state of the DataManagerClassifier.

Usage

DataManagerClassifier$create_synthetic(trace = TRUE, inc_pseudo_data = FALSE)

Arguments

trace: bool If TRUE information on the process are printed to the console.

inc_pseudo_data

bool If TRUE data with pseudo labels are used in addition to the labeled data for generating synthetic cases.

Returns

This method does nothing return. It generates a new data set for synthetic cases which are stored as an object of class datasets.arrow_dataset.Dataset in the field datasets$data_labeled_synthetic. Please note that a call of this method will override an existing data set in the corresponding field.

Method `add_replace_pseudo_data()`

Method for adding data with pseudo labels generated by a classifier

Usage

DataManagerClassifier$add_replace_pseudo_data(inputs, labels)

Arguments

inputs: array or matrix representing the input data.

labels

factor containing the corresponding pseudo labels.

Returns

This method does nothing return. It generates a new data set for synthetic cases which are stored as an object of class datasets.arrow_dataset.Dataset in the field datasets$data_labeled_pseudo. Please note that a call of this method will override an existing data set in the corresponding field.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

DataManagerClassifier$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Description

Arguments

Value

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Returns

Method get_config()

Usage

Returns

Method get_labeled_data()

Usage

Returns

Method get_unlabeled_data()

Usage

Returns

Method get_samples()

Usage

Returns

Method set_state()

Usage

Arguments

Returns

Method get_n_folds()

Usage

Returns

Method get_n_classes()

Usage

Returns

Method get_statistics()

Usage

Returns

Method contains_unlabeled_data()

Usage

Returns

Method get_dataset()

Usage

Arguments

Returns

Method get_val_dataset()

Usage

Returns

Method get_test_dataset()

Usage

Returns

Method create_synthetic()

Usage

Arguments

Returns

Method add_replace_pseudo_data()

Usage

Arguments

Returns

Method clone()

Usage

Arguments

Method `new()`

Method `get_config()`

Method `get_labeled_data()`

Method `get_unlabeled_data()`

Method `get_samples()`

Method `set_state()`

Method `get_n_folds()`

Method `get_n_classes()`

Method `get_statistics()`

Method `contains_unlabeled_data()`

Method `get_dataset()`

Method `get_val_dataset()`

Method `get_test_dataset()`

Method `create_synthetic()`

Method `add_replace_pseudo_data()`

Method `clone()`