mlr3db (version 0.1.3)

DataBackendDplyr: DataBackend for dplyr/dbplyr

Description

A mlr3::DataBackend using dplyr::tbl() from packages dplyr/dbplyr. This includes tibbles. Allows to let a mlr3::Task interface an out-of-memory data base.

Arguments

Format

R6::R6Class object inheriting from mlr3::DataBackend.

Construction

DataBackendDplyr$new(data, primary_key = NULL, strings_as_factors = TRUE)
  • data :: dplyr::tbl() The data object.

  • primary_key :: character(1) Name of the primary key column.

  • strings_as_factors :: logical(1) || character() Either a character vector of column names to convert to factors, or a single logical flag: if FALSE, no column will be converted, if TRUE all string columns (except the primary key). The backend is queried for distinct values of the respective columns and their levels are stored in $levels.

Alternatively, use mlr3::as_data_backend() on a dplyr::tbl() which will construct a DataBackend for you.

Fields

All fields from mlr3::DataBackend, and additionally:

  • levels :: named list() List of factor levels, named with column names. The columns get automatically converted to factors in $data() and head().

Methods

All methods from mlr3::DataBackend.

Examples

Run this code
# NOT RUN {
# Backend using a in-memory tibble
data = tibble::as_tibble(iris)
data$Sepal.Length[1:30] = NA
data$row_id = 1:150
b = DataBackendDplyr$new(data, primary_key = "row_id")

# Object supports all accessors of DataBackend
print(b)
b$nrow
b$ncol
b$colnames
b$data(rows = 100:101, cols = "Species")
b$distinct(b$rownames, "Species")

# Classification task using this backend
task = mlr3::TaskClassif$new(id = "iris_tibble", backend = b, target = "Species")
print(task)
task$head()

# Create a temporary SQLite data base
con = DBI::dbConnect(RSQLite::SQLite(), ":memory:")
dplyr::copy_to(con, data)
tbl = dplyr::tbl(con, "data")

# Define a backend on a subset of the data base
tbl = dplyr::select_at(tbl, setdiff(colnames(tbl), "Sepal.Width")) # do not use column "Sepal.Width"
tbl = dplyr::filter(tbl, row_id %in% 1:120) # Use only first 120 rows
b = DataBackendDplyr$new(tbl, primary_key = "row_id")
print(b)

# Query disinct values
b$distinct(b$rownames, "Species")

# Query number of missing values
b$missings(b$rownames, b$colnames)

# Note that SQLite does not support factors, column Species has been converted to character
lapply(b$head(), class)

# Cleanup
rm(tbl)
DBI::dbDisconnect(con)
# }

Run the code above in your browser using DataLab