xgb.DataBatch: Structure for Data Batches

Description

Helper function to supply data in batches of a data iterator when constructing a DMatrix from external memory through xgb.ExtMemDMatrix() or through xgb.QuantileDMatrix.from_iterator().

This function is only meant to be called inside of a callback function (which is passed as argument to function xgb.DataIter() to construct a data iterator) when constructing a DMatrix through external memory - otherwise, one should call xgb.DMatrix() or xgb.QuantileDMatrix().

The object that results from calling this function directly is not like an xgb.DMatrix - i.e. cannot be used to train a model, nor to get predictions - only possible usage is to supply data to an iterator, from which a DMatrix is then constructed.

For more information and for example usage, see the documentation for xgb.ExtMemDMatrix().

Usage

xgb.DataBatch(
  data,
  label = NULL,
  weight = NULL,
  base_margin = NULL,
  feature_names = colnames(data),
  feature_types = NULL,
  group = NULL,
  qid = NULL,
  label_lower_bound = NULL,
  label_upper_bound = NULL,
  feature_weights = NULL
)

Value

An object of class xgb.DataBatch, which is just a list containing the data and parameters passed here. It does not inherit from xgb.DMatrix.

Arguments

data

Batch of data belonging to this batch.

Note that not all of the input types supported by xgb.DMatrix() are possible to pass here. Supported types are:

matrix, with types numeric, integer, and logical. Note that for types integer and logical, missing values might not be automatically recognized as as such - see the documentation for parameter missing in xgb.ExtMemDMatrix() for details on this.
data.frame, with the same types as supported by 'xgb.DMatrix' and same conversions applied to it. See the documentation for parameter data in xgb.DMatrix() for details on it.
CSR matrices, as class dgRMatrix from package "Matrix".

label

Label of the training data. For classification problems, should be passed encoded as integers with numeration starting at zero.

weight

Weight for each instance.

Note that, for ranking task, weights are per-group. In ranking task, one weight is assigned to each group (not each data point). This is because we only care about the relative ordering of data points within each group, so it doesn't make sense to assign weights to individual data points.

base_margin

Base margin used for boosting from existing model.

In the case of multi-output models, one can also pass multi-dimensional base_margin.

feature_names

Set names for features. Overrides column names in data frame and matrix.

Note: columns are not referenced by name when calling predict, so the column order there must be the same as in the DMatrix construction, regardless of the column names.

feature_types

Set types for features.

If data is a data.frame and passing feature_types is not supplied, feature types will be deduced automatically from the column types.

Otherwise, one can pass a character vector with the same length as number of columns in data, with the following possible values:

"c", which represents categorical columns.
"q", which represents numeric columns.
"int", which represents integer columns.
"i", which represents logical (boolean) columns.

Note that, while categorical types are treated differently from the rest for model fitting purposes, the other types do not influence the generated model, but have effects in other functionalities such as feature importances.

Important: Categorical features, if specified manually through feature_types, must be encoded as integers with numeration starting at zero, and the same encoding needs to be applied when passing data to predict(). Even if passing factor types, the encoding will not be saved, so make sure that factor columns passed to predict have the same levels.

group

Group size for all ranking group.

qid

Query ID for data samples, used for ranking.

label_lower_bound

Lower bound for survival training.

label_upper_bound

Upper bound for survival training.

feature_weights

Set feature weights for column sampling.

Description

Usage

Value

Arguments

See Also