oml_data: Interface to OpenML Data Sets

Description

This is the class for data sets served on OpenML. This object can also be constructed using the sugar function odt().

Arguments

mlr3 Integration

A mlr3::Task can be obtained by calling mlr3::as_task(). The target column must either be the default target (this is the default behaviour) or one of $feature_names. In case the target is specified to be one of $feature_names, the default target is added to the features of the task.
A mlr3::DataBackend can be obtained by calling mlr3::as_data_backend(). Depending on the selected file-type, the returned backend is a mlr3::DataBackendDataTable (arff) or mlr3db::DataBackendDuckDB (parquet). Note that a converted backend can contain columns beyond the target and the features (id column or ignore columns).

Name conversion

Column names that don't comply with R's naming scheme are renamed (see base::make.names()). This means that the names can differ from those on OpenML.

File Format

The datasets stored on OpenML are either stored as (sparse) ARFF or parquet. When creating a new OMLData object, the constructor argument parquet allows to switch between arff and parquet. Note that not necessarily all data files are available as parquet. The option mlr3oml.parquet can be used to set a default. If parquet is TRUE but not available, "arff" will be used as a fallback.

ARFF Files

This package comes with an own reader for ARFF files, based on data.table::fread(). For sparse ARFF files and if the RWeka package is installed, the reader automatically falls back to the implementation in (RWeka::read.arff()).

Parquet Files

For the handling of parquet files, we rely on duckdb and DBI.

Super class

mlr3oml::OMLObject -> OMLData

Active bindings

qualities

(data.table())
Data set qualities (performance values), downloaded from the JSON API response and converted to a data.table::data.table() with columns "name" and "value".

tags

(character())
Returns all tags of the object.

parquet

(logical(1))
Whether to use parquet.

data

(data.table())
Returns the data (without the row identifier and ignore id columns).

features

(data.table())
Information about data set features (including target), downloaded from the JSON API response and converted to a data.table::data.table() with columns:

"index" (integer()): Column position.
"name" (character()): Name of the feature.
"data_type" (factor()): Type of the feature: "nominal" or "numeric".
"nominal_value" (list()): Levels of the feature, or NULL for numeric features.
"is_target" (logical()): TRUE for target column, FALSE otherwise.
"is_ignore" (logical()): TRUE if this feature should be ignored. Ignored features are removed automatically from the data set.
"is_row_identifier" (logical()): TRUE if the column encodes a row identifier. Row identifiers are removed automatically from the data set.
"number_of_missing_values" (integer()): Number of missing values in the column.

target_names

(character())
Name of the default target, as extracted from the OpenML data set description.

feature_names

(character())
Name of the features, as extracted from the OpenML data set description.

nrow

(integer())
Number of observations, as extracted from the OpenML data set qualities.

ncol

(integer())
Number of features (including targets), as extracted from the table of data set features. This excludes row identifiers and ignored columns.

license

(character())
Returns all license of the dataset.

parquet_path

(character())
Downloads the parquet file (or loads from cache) and returns the path of the parquet file. Note that this also normalizes the names of the parquet file.

Methods

Public methods

Inherited methods

mlr3oml::OMLObject$help()

Method `new()`

Creates a new instance of this R6 class.

Usage

OMLData$new(
  id,
  parquet = parquet_default(),
  test_server = test_server_default()
)

Arguments

id: (integer(1))
OpenML id for the object.

parquet

(logical(1))
Whether to use parquet instead of arff. If parquet is not available, it will fall back to arff. Defaults to value of option "mlr3oml.parquet" or FALSE if not set.

test_server

(character(1))
Whether to use the OpenML test server or public server. Defaults to value of option "mlr3oml.test_server", or FALSE if not set.

Method `print()`

Prints the object. For a more detailed printer, convert to a mlr3::Task via as_task().

Usage

OMLData$print()

Method `download()`

Downloads the whole object for offline usage.

Usage

OMLData$download()

Method `quality()`

Returns the value of a single OpenML data set quality.

Usage

OMLData$quality(name)

Arguments

name: (character(1))
Name of the quality to extract.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

OMLData$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

Vanschoren J, van Rijn JN, Bischl B, Torgo L (2014). “OpenML.” ACM SIGKDD Explorations Newsletter, 15(2), 49--60. tools:::Rd_expr_doi("10.1145/2641190.2641198").

Examples

Run this code

# For technical reasons, examples cannot be included in this R package.
# Instead, these are some relevant resources:
#
# Large-Scale Benchmarking chapter in the mlr3book:
# https://mlr3book.mlr-org.com/chapters/chapter11/large-scale_benchmarking.html
#
# Package Article:
# https://mlr3oml.mlr-org.com/articles/tutorial.html

Run the code above in your browser using DataLab

Description

Arguments

mlr3 Integration

Name conversion

File Format

ARFF Files

Parquet Files

Super class

Active bindings

Methods

Public methods

Method new()

Usage

Arguments

Method print()

Usage

Method download()

Usage

Method quality()

Usage

Arguments

Method clone()

Usage

Arguments

References

Examples

Method `new()`

Method `print()`

Method `download()`

Method `quality()`

Method `clone()`