score_info_gain: Scoring via entropy-based filters

Description

Three different information theory (entropy) scores can be computed.

Usage

score_info_gain
score_gain_ratio
score_sym_uncert

Arguments

Value

An S7 object. The primary property of interest is in results. This is a data frame of results that is populated by the fit() method and has columns:

name: The name of the score (e.g., info_gain).
score: The estimates for each predictor.
outcome: The name of the outcome column.
predictor: The names of the predictor inputs.

These data are accessed using object@results (see examples below).

Format

An object of class filtro::class_score_info_gain (inherits from filtro::class_score, S7_object) of length 1.

Details

These objects are used when either:

The predictors are numeric and the outcome is a factor/category, or
The predictors are factors and the outcome is numeric.

In either case, an entropy-based filter (via FSelectorRcpp::information_gain()) is applied with the proper variable roles. Depending on the chosen method, information gain, gain ratio, or symmetrical uncertainty is computed. Larger values are associated with more important predictors.

Estimating the scores

In filtro, the score_* objects define a scoring method (e.g., data input requirements, package dependencies, etc). To compute the scores for a specific data set, the fit() method is used. The main arguments for these functions are:

object: A score class object (e.g., score_info_gain).
formula: A standard R formula with a single outcome on the right-hand side and one or more predictors (or .) on the left-hand side. The data are processed via stats::model.frame()
data: A data frame containing the relevant columns defined by the formula.
...: Further arguments passed to or from other methods.
case_weights: A quantitative vector of case weights that is the same length as the number of rows in data. The default of NULL indicates that there are no case weights.

Missing values are removed for each predictor/outcome combination being scored.

In cases where the underlying computations fail, the scoring proceeds silently, and a missing value is given for the score.

Examples

Run this code


library(dplyr)

# Entropy-based filter for classification tasks

cells_subset <- modeldata::cells |>
  dplyr::select(
    class,
    angle_ch_1,
    area_ch_1,
    avg_inten_ch_1,
    avg_inten_ch_2,
    avg_inten_ch_3
  )

# Information gain
cells_info_gain_res <- score_info_gain |>
  fit(class ~ ., data = cells_subset)
cells_info_gain_res@results

# Gain ratio
cells_gain_ratio_res <- score_gain_ratio |>
  fit(class ~ ., data = cells_subset)
cells_gain_ratio_res@results

# Symmetrical uncertainty
cells_sym_uncert_res <- score_sym_uncert |>
  fit(class ~ ., data = cells_subset)
cells_sym_uncert_res@results

# ----------------------------------------------------------------------------

# Entropy-based filter for regression tasks

ames_subset <- modeldata::ames |>
  dplyr::select(
    Sale_Price,
    MS_SubClass,
    MS_Zoning,
    Lot_Frontage,
    Lot_Area,
    Street
  )
ames_subset <- ames_subset |>
  dplyr::mutate(Sale_Price = log10(Sale_Price))

regression_task <- score_info_gain
regression_task@mode <- "regression"

ames_info_gain_regression_task_res <-
  regression_task |>
  fit(Sale_Price ~ ., data = ames_subset)
ames_info_gain_regression_task_res@results

Run the code above in your browser using DataLab