nsga3fs: NSGA III for Multi-Objective Feature Selection

Description

An adaptation of Non-dominated Sorting Genetic Algorithm III for multi objective feature selection tasks. Non-dominated Sorting Genetic Algorithm III is a genetic algorithm that solves multiple optimization problems simultaneously by applying a non-dominated sorting technique. It uses a reference points based selection operator to explore solution space and preserve diversity. See the paper by K. Deb and H. Jain (2014) <DOI:10.1109/TEVC.2013.2281534> for a detailed description of the algorithm.

Usage

nsga3fs(df, target, obj_list, obj_names, pareto, pop_size, max_gen, model,
  resampling = FALSE, num_features = TRUE, mutation_rate = 0.1,
  threshold = 0.5, feature_cost = FALSE,
  r_measures = list(mlr::mmce), cpus = 1)

Arguments

An original dataset.

target

Name of a column (a string), which contains classification target variable.

obj_list

A List of objective functions to be optimizied. Must be a list of objects of type closure.

obj_names

A Vector of the names of objective functions. Must match the atguments passed to pareto.

pareto

A Pareto criteria for non-dominated sorting. Should be passed in a form: \(low(objective_1)*high(objective_2)\) See description of low for more details.

pop_size

Size of the population.

max_gen

Number of generations.

model

A makeLearner object. A model to be used for classification task.

resampling

A makeResampleDesc object.

num_features

TRUE if algorithm should minimize number of features as one of objectives. You must pass a respective object to pareto as well as obj_names.

mutation_rate

Probability of switching the value of a certain gene to its opposite. Default value 0.1.

threshold

Threshold applied during majority vote when calculating final output. Default value 0.5.

feature_cost

A vector of feacure costs. Must be equal ncol(df)-1. You must pass a respective object to pareto as well as obj_names.

r_measures

A list of performance metrics for makeResampleDesc task. Default "mmce"

cpus

Number of sockets to be used for parallelisation. Default value is 1.

Value

A list with the final Pareto Front:

Raw

A list containing two items:

A list with final Pareto Front individuals
A data.frame containing respective fitness values

Per individual

Same content, structured per individual

Majority vote

Pareto Front majority vote for dataset features

Stat

Runtime, dataset details, model

References

K. Deb, H. Jain (2014) <DOI:10.1109/TEVC.2013.2281534>

Examples

Run this code

# NOT RUN {
xgb_learner <- mlr::makeLearner("classif.xgboost", predict.type = "prob",
                            par.vals = list(
                            objective = "binary:logistic",
                            eval_metric = "error",nrounds = 2))

rsmp <- mlr::makeResampleDesc("CV", iters = 2)
measures <- list(mlr::mmce)

f_auc <- function(pred){auc <- mlr::performance(pred, auc)
                        return(as.numeric(auc))}
objective <- c(f_auc)
o_names <- c("AUC", "nf")
par <- rPref::high(AUC)*rPref::low(nf)

nsga3fs(df = german_credit, target = "BAD", obj_list = objective,
        obj_names = o_names, pareto = par, pop_size = 1, max_gen = 1,
        model = xgb_learner, resampling = rsmp,
        num_features = TRUE, r_measures = measures, cpus = 2)





# }

Run the code above in your browser using DataLab