This is the class for data sets served on OpenML.
This object can also be constructed using the sugar function odt()
.
A mlr3::Task can be obtained by calling mlr3::as_task()
.
The target column must either be the default target (this is the default behaviour) or one of $feature_names
.
In case the target is specified to be one of $feature_names
, the default target is added to the features
of the task.
A mlr3::DataBackend can be obtained by calling mlr3::as_data_backend()
. Depending on the
selected file-type, the returned backend is a mlr3::DataBackendDataTable (arff) or
mlr3db::DataBackendDuckDB (parquet).
Note that a converted backend can contain columns beyond the target and the features (id column or ignore columns).
Column names that don't comply with R's naming scheme are renamed (see base::make.names()
).
This means that the names can differ from those on OpenML.
The datasets stored on OpenML are either stored as (sparse) ARFF or parquet.
When creating a new OMLData
object, the constructor argument parquet
allows to switch
between arff and parquet. Note that not necessarily all data files are available as parquet.
The option mlr3oml.parquet
can be used to set a default.
If parquet
is TRUE
but not available, "arff"
will be used as a fallback.
This package comes with an own reader for ARFF files, based on data.table::fread()
.
For sparse ARFF files and if the RWeka package is installed, the reader
automatically falls back to the implementation in (RWeka::read.arff()
).
mlr3oml::OMLObject
-> OMLData
qualities
(data.table()
)
Data set qualities (performance values), downloaded from the JSON API response and
converted to a data.table::data.table()
with columns "name"
and "value"
.
tags
(character()
)
Returns all tags of the object.
parquet
(logical(1)
)
Whether to use parquet.
data
(data.table()
)
Returns the data (without the row identifier and ignore id columns).
features
(data.table()
)
Information about data set features (including target), downloaded from the JSON API response and
converted to a data.table::data.table()
with columns:
"index"
(integer()
): Column position.
"name"
(character()
): Name of the feature.
"data_type"
(factor()
): Type of the feature: "nominal"
or "numeric"
.
"nominal_value"
(list()
): Levels of the feature, or NULL
for numeric features.
"is_target"
(logical()
): TRUE
for target column, FALSE
otherwise.
"is_ignore"
(logical()
): TRUE
if this feature should be ignored.
Ignored features are removed automatically from the data set.
"is_row_identifier"
(logical()
): TRUE
if the column encodes a row identifier.
Row identifiers are removed automatically from the data set.
"number_of_missing_values"
(integer()
): Number of missing values in the column.
target_names
(character()
)
Name of the default target, as extracted from the OpenML data set description.
feature_names
(character()
)
Name of the features, as extracted from the OpenML data set description.
nrow
(integer()
)
Number of observations, as extracted from the OpenML data set qualities.
ncol
(integer()
)
Number of features (including targets), as extracted from the table of data set features.
This excludes row identifiers and ignored columns.
license
(character()
)
Returns all license of the dataset.
parquet_path
(character()
)
Downloads the parquet file (or loads from cache) and returns the path of the parquet file.
Note that this also normalizes the names of the parquet file.
Inherited methods
new()
Creates a new instance of this R6 class.
OMLData$new(
id,
parquet = parquet_default(),
test_server = test_server_default()
)
id
(integer(1)
)
OpenML id for the object.
parquet
(logical(1)
)
Whether to use parquet instead of arff.
If parquet is not available, it will fall back to arff.
Defaults to value of option "mlr3oml.parquet"
or FALSE
if not set.
test_server
(character(1)
)
Whether to use the OpenML test server or public server.
Defaults to value of option "mlr3oml.test_server"
, or FALSE
if not set.
print()
Prints the object.
For a more detailed printer, convert to a mlr3::Task via as_task()
.
OMLData$print()
download()
Downloads the whole object for offline usage.
OMLData$download()
quality()
Returns the value of a single OpenML data set quality.
OMLData$quality(name)
name
(character(1)
)
Name of the quality to extract.
clone()
The objects of this class are cloneable with this method.
OMLData$clone(deep = FALSE)
deep
Whether to make a deep clone.
Vanschoren J, van Rijn JN, Bischl B, Torgo L (2014). “OpenML.” ACM SIGKDD Explorations Newsletter, 15(2), 49--60. tools:::Rd_expr_doi("10.1145/2641190.2641198").
# For technical reasons, examples cannot be included in this R package.
# Instead, these are some relevant resources:
#
# Large-Scale Benchmarking chapter in the mlr3book:
# https://mlr3book.mlr-org.com/chapters/chapter11/large-scale_benchmarking.html
#
# Package Article:
# https://mlr3oml.mlr-org.com/articles/tutorial.html
Run the code above in your browser using DataLab