boost: Boost an Estimation Procedure with a Reweighter and an Aggregator.

Description

Boost an estimation procedure and analyze individual estimator performance using a reweighter, aggregator, and some performance analyzer.

Usage

boost(x, B, reweighter, aggregator, data, .procArgs = NULL, metadata = NULL, initialWeights = rep.int(1, nrow(data))/nrow(data), analyzePerformance = defaultOOBPerformanceAnalysis, .boostBackendArgs = NULL)
"boost"(x, B, reweighter, aggregator, data, .procArgs = NULL, metadata = NULL, initialWeights = rep.int(1, nrow(data))/nrow(data), analyzePerformance = defaultOOBPerformanceAnalysis, .boostBackendArgs = NULL)
"boost"(x, B, reweighter, aggregator, data, .procArgs = NULL, metadata = NULL, initialWeights = rep.int(1, nrow(data))/nrow(data), analyzePerformance = defaultOOBPerformanceAnalysis, .boostBackendArgs = NULL)

Arguments

number of iterations of boost to perform.

a list with entries 'train' and 'predict' or a function that satisfies the definition of an estimation procedure given below. The list input will invoke a call to buildEstimationProcedure. Function input will invoke a call to wrapProcedure, unless the function inherits from 'estimationProcedure'. In either event, metadata may be required to properly wrap x. See the appropriate help documentation.

reweighter

A reweighter, as defined below. If the function does not inherit from 'reweighter', a call to wrapReweighter will be made. See wrapReweighter to determine what metadata, if any, you may need to pass for the wrapper to be boostr compatible

aggregator

An aggregator, as defined below. If the function does not inherit from 'aggregator' a call to wrapAggregator will be made to build a boostr compatible wrapper. See wrapAggregator to determine if any metadata needs to be passed in for this to be successful.

data

a data.frame of matrix to act as the learning set. The columns are assumed to be ordered such that the response variable in the first column and the remaining columns as the predictors. As a convenience, boostBackend comes with a switch, .formatData (defaulted to TRUE) which will look for an argument named formula inside .procArgs and use the value of formula to format data. If you don't want this to happen, or if the data is already properly formatted, include .formatData=FALSE in metadata.

.procArgs

a named list of arguments to pass to the estimation procedure. If x is a list, .procArgs is a named list of lists with entries .trainArgs and .predictArgs and each list is a named list of arguments to pass to x$train and x$predict, respectively. If x is a function, .procArgs is a named list of arguments to pass to x, in addition to data and weights. See 'Examples' below.

initialWeights

a vector of weights used for the first iteration of the ensemble building phase of Boost.

analyzePerformance

a function which accepts an estimator's predictions and the true responses to said predictions (among other arguments) and returns a list of values. If no function is provided, defaultOOBPerformanceAnalysis is used. See wrapPerformanceAnalyzer for metadata that may need to be passed to make analyzePerformance compatible with the boostr framework.

metadata

a named list of arguments to be passed to wrapProcedure, buildEstimationProcedure, wrapReweighter, wrapAggregator, and/or wrapPerformanceAnalyzer.

.boostBackendArgs

a named list of additional arguments to pass to boostBackend.

Value

newdata: a data.frame or matrix whose columns should probably be in the same order as the columns of the data each of the constituent estimators was trained on.

Details

This function is a designed to be an interface between the user and boostBackend when x, reweighter, aggregator and/or analyzePerformance are valid input to the Boost algorithm, but do not have boostr compatible signatures. Hence, boost calls the appropriate wrapper function (with the relevant information from metadata) to convert user supplied functions into boostr compatible functions.

Examples

Run this code

### Demonstrate simple call with just list(train=svm)

library(foreach)
library(iterators)
library(e1071)

svmArgs <- list(formula=Species~., cost=100)
boost(x=list(train=svm),
      reweighter=arcfsReweighter,
      aggregator=arcfsAggregator,
      data=iris,
      .procArgs=list(.trainArgs=svmArgs),
      B=2)

### Demonstrate call with train and predict and custom 
### reweighters and aggregators

df <- within(iris, {
  Setosa <- as.factor(2*as.numeric(Species == "setosa")-1)
  Species <- NULL
})

# custom predict function
newPred <- function(obj, new) {
  predict(obj, new)
}

predMetadata <- c(modelName="obj", predictionSet="new")

# custom reweighter
testReweighterMetadata <- list(
                            reweighterInputWts="w",
                            reweighterInputResponse="truth",
                            reweighterInputPreds="preds",
                            reweighterOutputWts="w")

testReweighter <- function(preds, truth, w) {
  
  wrongPreds <- (preds != truth)
  err <- mean(wrongPreds)
  if (err != 0) {
    new_w <- w / err^(!wrongPreds)
  } else {
    new_w <- runif(n=length(w), min=0, max=1)
  }
  
  
  list(w=new_w, alpha=rnorm(1))
}

# custom aggregator
testAggregatorMetadata <- c(.inputEnsemble="ensemble")

testAggregator <- function(ensemble) {
  weights <- runif(min=0, max=1, n=length(ensemble))
  function(x) {
    preds <- foreach(estimator = iter(ensemble),
                     .combine = rbind) %do% {
                       matrix(as.character(estimator(x)), nrow=1)
                     }
    
    as.factor(predictClassFromWeightedVote(preds, weights))
  }
}

# collect all the relevant metadata
metadata <- c(predMetadata, testReweighterMetadata, testAggregatorMetadata)

# set additional procedure arguments
procArgs <- list(
              .trainArgs=list(
                formula=Setosa ~ .,
                cost=100)
              )

#test boost when irrelevant metadata is passed in.
boostedSVM <- boost(list(train=svm, predict=newPred),
                    B=3,
                    reweighter=testReweighter,
                    aggregator=testAggregator,
                    data=df,
                    metadata=metadata,
                    .procArgs=procArgs,
                    .boostBackendArgs=list(
                      .reweighterArgs=list(fakeStuff=77))
                    )

### Demonstrate customizing 'metadata' for estimation procedure
library(class)

testkNNProcMetadata <- list(learningSet="traindata", predictionSet="testdata")

testkNNProc <- function(formula, traindata, k) {  
  df <- model.frame(formula=formula, data=traindata)
  function(testdata, prob=FALSE) {
    df2 <- tryCatch(model.frame(formula=formula, data=testdata)[, -1],
                    error = function(e) testdata 
    )
    knn(train=df[, -1], test=df2, cl=df[, 1], prob=prob, k=k) 
  }
}

testKNNProcArgs <- list(formula=Setosa ~ ., k = 5)

metadata <- testkNNProcMetadata
boostBackendArgs <- list(.reweighterArgs=list(m=0))

boostedKNN <- boost(x=testkNNProc, B=3,
      reweighter=arcx4Reweighter,
      aggregator=arcx4Aggregator,
      data=df, 
      metadata=metadata,
      .boostBackendArgs=boostBackendArgs,
      .procArgs=testKNNProcArgs)

### Demonstrate using an alternative performance analyzer

testPerfAnalyzer2 <- function(pred, truth, oob, zeta) {
  list(e=mean(pred != truth), z=zeta)
}

testPerfAnalyzer2Metadata <- list(analyzerInputPreds="pred",
                                  analyzerInputResponse="truth",
                                  analyzerInputOObObs="oob")

metadata <- c(metadata, testPerfAnalyzer2Metadata)

boostedkNN <- boost(testkNNProc,
                    B=3,
                    reweighter=vanillaBagger,
                    aggregator=vanillaAggregator,
                    data=df,
                    .procArgs=testKNNProcArgs,
                    metadata=metadata,
                    .boostBackendArgs = list(
                      .analyzePerformanceArgs = list(zeta="77"),
                      .reweighterArgs=list(fakeStuff=77)),
                    analyzePerformance=testPerfAnalyzer2)

Run the code above in your browser using DataLab