This is the abstract base class for task objects like TaskClassif and TaskRegr.
Tasks serve two purposes:
Tasks wrap a DataBackend, an object to transparently interface different data storage types.
Tasks store meta-information, such as the role of the individual columns in the DataBackend. For example, for a classification task a single column must be marked as target column, and others as features.
Predefined (toy) tasks are stored in the mlr3misc::Dictionary mlr_tasks,
e.g. iris
or boston_housing
.
R6::R6Class object.
Note: This object is typically constructed via a derived classes, e.g. TaskClassif or TaskRegr.
t = Task$new(id, task_type, backend)
id
:: character(1)
Identifier for the task.
task_type
:: character(1)
Set in the classes which inherit from this class.
Must be an element of mlr_reflections$task_types$type.
backend
:: DataBackend
Either a DataBackend, or any object which is convertible to a DataBackend with as_data_backend()
.
E.g., a data.frame()
will be converted to a DataBackendDataTable.
backend
:: DataBackend.
col_info
:: data.table::data.table()
Table with with 3 columns:
"id"
stores the name of the column.
"type"
holds the storage type of the variable, e.g. integer
, numeric
or character
.
"levels"
stores a vector of distinct values (levels) for factor and character variables.
col_roles
:: named list()
Each column (feature) can have an arbitrary number of the following roles:
"feature"
: Regular feature used in the model fitting process.
"target"
: Target variable.
"name"
: Row names / observation labels. To be used in plots.
"order"
: Data returned by $data()
is ordered by this column (or these columns).
"group"
: During resampling, observations with the same value of the variable with role "group"
are marked as "belonging together". They will be exclusively assigned to be either in the training set
or in the test set for each resampling iteration. Only up to one column may have this role.
"stratum"
: Stratification variables. Multiple discrete columns may have this role.
"weight"
: Observation weights. Only up to one column (assumed to be discrete) may have this role.
col_roles
keeps track of the roles with a named list, the elements are named by column role and each element is a character vector of column names.
To alter the roles, just modify the list, e.g. with R's set functions (intersect()
, setdiff()
, union()
, …).
row_roles
:: named list()
Each row (observation) can have an arbitrary number of roles in the learning task:
"use"
: Use in train / predict / resampling.
"validation"
: Hold the observations back unless explicitly requested.
Validation sets are not yet completely integrated into the package.
row_roles
keeps track of the roles with a named list, elements are named by row role and each element is a integer()
or character()
vector of row ids.
To alter the roles, just modify the list, e.g. with R's set functions (intersect()
, setdiff()
, union()
, …).
feature_names
:: character()
Return all column names with role == "feature"
.
feature_types
:: data.table::data.table()
Returns a table with columns id
and type
where id
are the column names of "active" features of the task
and type
is the storage type.
hash
:: character(1)
Hash (unique identifier) for this object.
id
:: character(1)
Identifier of the Task.
ncol
:: integer(1)
Returns the total number of cols with role "target" or "feature".
nrow
:: integer(1)
Return the total number of rows with role "use".
row_ids
:: (integer()
| character()
)
Returns the row ids of the DataBackend for observations with with role "use".
target_names
:: character()
Returns all column names with role "target".
task_type
:: character(1)
Stores the type of the Task.
properties
:: character()
Set of task properties. Possible properties are are stored in
mlr_reflections$task_properties.
The following properties are currently standardized and understood by tasks in mlr3:
"strata"
: The task is resampled using one or more stratification variables (role "stratum"
).
"groups"
: The task comes with grouping/blocking information (role "group"
).
"weights"
: The task comes with observation weights (role "weight"
).
Note that above listed properties are calculated from the $col_roles
and must not be set explicitly.
strata
:: data.table::data.table()
If the task has columns designated with role "stratum"
, returns a table with one subpopulation per row and two columns:
N
(integer()
) with the number of observations in the subpopulation and row_id
(list of integer()
| list of character()
) as list
column with the row ids in the respective subpopulation.
Returns NULL
if there are is no stratification variable.
See Resampling for more information on stratification.
groups
:: data.table::data.table()
If the task has a column with designated role "group", table with two columns:
row_id
(integer()
| character()
) and the grouping variable group
(vector()
).
Returns NULL
if there are is no grouping column.
See Resampling for more information on grouping.
weights
:: data.table::data.table()
If the task has a column with designated role "weight", table with two columns:
row_id
(integer()
| character()
) and the observation weights weight
(numeric()
).
Returns NULL
if there are is no weight column.
man
:: character(1)
String in the format [pkg]::[topic]
pointing to a manual page for this object.
data(rows = NULL, cols = NULL, data_format = NULL)
(integer()
| character()
, character(1)
, character(1)
) -> any
Returns a slice of the data from the DataBackend in the data format specified by data_format
(depending on the DataBackend, but usually a data.table::data.table()
).
Rows are additionally subsetted to only contain observations with role "use", and
columns are filtered to only contain features with roles "target" and "feature".
If invalid rows
or cols
are specified, an exception is raised.
formula(rhs = ".")
character()
-> stats::formula()
Constructs a stats::formula()
, e.g. [target] ~ [feature_1] + [feature_2] + ... + [feature_k]
, using
the features provided in argument rhs
(defaults to all columns with role "feature"
, symbolized by "."
).
levels(cols = NULL)
character()
-> named list()
Returns the distinct values for columns referenced in cols
with storage type "character", "factor" or "ordered".
Argument cols
defaults to all such columns with role "target"
or "feature"
.
Note that this function ignores the row roles, it returns all levels available in the DataBackend.
To update the stored level information, e.g. after filtering a task, call $droplevels()
.
droplevels(cols = NULL)
character()
-> self
Updates the cache of stored factor levels, removing all levels not present in the current set of active rows.
cols
defaults to all columns with storage type "character", "factor", or "ordered".
missings(cols = NULL)
character()
-> named integer()
Returns the number of missing observations for columns referenced in cols
.
Considers only active rows with row role "use"
.
Argument cols
defaults to all columns with role "target" or "feature".
head(n = 6)
integer()
-> data.table::data.table()
Get the first n
observations with role "use"
.
set_row_role(rows, new_roles, exclusive = TRUE)
(character()
, character()
, logical(1)
) -> self
Adds the roles new_roles
to rows referred to by rows
.
If exclusive
is TRUE
, the referenced rows will be removed from all other roles.
This function is deprecated and will be removed in the next version in favor of directly modifying $row_roles
.
set_col_role(cols, new_roles, exclusive = TRUE)
(character()
, character()
, logical(1)
) -> self
Adds the roles new_roles
to columns referred to by cols
.
If exclusive
is TRUE
, the referenced columns will be removed from all other roles.
This function is deprecated and will be removed in the next version in favor of directly modifying $col_roles
.
filter(rows)
(integer()
| character()
) -> self
Subsets the task, reducing it to only keep the rows specified in rows
.
This operation mutates the task in-place. See the section on task mutators for more information.
select(cols)
character()
-> self
Subsets the task, reducing it to only keep the features specified in cols
.
Note that you cannot deselect the target column, for obvious reasons.
This operation mutates the task in-place. See the section on task mutators for more information.
cbind(data)
data.frame()
-> self
Adds additional columns to the DataBackend.
The row ids must be provided as column in data
(with column name matching the primary key name of the DataBackend).
If this column is missing, it is assumed that the rows are exactly in the order of t$row_ids
.
In case of name clashes of column names in data
and DataBackend, columns in data
have higher precedence
and virtually overwrite the columns in the DataBackend.
This operation mutates the task in-place. See the section on task mutators for more information.
rbind(data)
data.frame()
-> self
Adds additional rows to the DataBackend.
The new row ids must be provided as column in data
.
If this column is missing, new row ids are constructed automatically.
In case of name clashes of row ids, rows in data
have higher precedence
and virtually overwrite the rows in the DataBackend.
This operation mutates the task in-place. See the section on task mutators for more information.
rename(from, to)
(character()
, character()
) -> self
Renames columns by mapping column names in old
to new column names in new
.
This operation mutates the task in-place. See the section on task mutators for more information.
help()
() -> NULL
Opens the corresponding help page referenced by $man
.
as.data.table(t)
Task -> data.table::data.table()
Returns the complete data as data.table::data.table()
.
The following methods change the task in-place:
Any modification to $col_roles
and row_roles
.
This provides a different "view" on the data without altering the data itself.
filter()
and select()
subset the set of active rows or features in row_roles
or col_roles
, respectively.
This provides a different "view" on the data without altering the data itself.
rbind()
and cbind()
change the task in-place by binding rows or columns to the data, but without modifying the original DataBackend.
Instead, the methods first create a new DataBackendDataTable from the provided new data, and then
merge both backends into an abstract DataBackend which combines the results on-demand.
rename()
wraps the DataBackend of the Task in an additional DataBackend which deals with the renaming. Also updates col_roles
and col_info
.
Other Task: TaskClassif
,
TaskRegr
, TaskSupervised
,
mlr_tasks
# NOT RUN {
# we use the inherited class TaskClassif here,
# Class Task is not intended for direct use
task = TaskClassif$new("iris", iris, target = "Species")
task$nrow
task$ncol
task$feature_names
task$formula()
# de-select "Petal.Width"
task$select(setdiff(task$feature_names, "Petal.Width"))
task$feature_names
# Add new column "foo"
task$cbind(data.frame(foo = 1:150))
task$head()
# }
Run the code above in your browser using DataCamp Workspace