FFTrees: Creates a fast-and-frugal trees (FFTrees) object.

Description

This is the workhorse function for the FFTrees package. It creates (one or more) fast-and-frugal decision trees trained on a training dataset and tested on an optional test dataset.

Usage

FFTrees(formula = NULL, data = NULL, data.test = NULL,
  algorithm = "ifan", max.levels = NULL, sens.w = 0.5,
  cost.outcomes = NULL, cost.cues = NULL, stopping.rule = "exemplars",
  stopping.par = 0.1, goal = "wacc", goal.chase = "bacc",
  numthresh.method = "o", decision.labels = c("False", "True"),
  main = NULL, train.p = 1, rounding = NULL, progress = TRUE,
  repeat.cues = TRUE, my.tree = NULL, tree.definitions = NULL,
  do.comp = TRUE, do.cart = TRUE, do.lr = TRUE, do.rf = TRUE,
  do.svm = TRUE, store.data = FALSE, object = NULL, rank.method = NULL,
  force = FALSE, verbose = NULL, comp = NULL)

Arguments

formula

formula. A formula specifying a logical criterion as a function of 1 or more predictors.

data

dataframe. A training dataset.

data.test

dataframe. An optional testing dataset with the same structure as data.

algorithm

character. The algorithm to create FFTs. Can be 'ifan', 'dfan', 'max', or 'zigzag'.

max.levels

integer. The maximum number of levels considered for the trees. Because all permutations of exit structures are considered, the larger max.levels is, the more trees will be created.

sens.w

numeric. A number from 0 to 1 indicating how to weight sensitivity relative to specificity. Only relevant when goal = 'wacc'

cost.outcomes

numeric. A vector of length 4 specifying the costs of a hit, false alarm, miss, and correct rejection rspectively. E.g.; cost.outcomes = c(0, 10, 20, 0) means that a false alarm and miss cost 10 and 20 respectively while correct decisions have no cost.

cost.cues

dataframe. A dataframe with two columns specifying the cost of each cue. The first column should be a vector of cue names, and the second column should be a numeric vector of costs. Cues in the dataset not present in cost.cues are assume to have 0 cost.

stopping.rule

character. A string indicating the method to stop growing trees. "levels" means the tree grows until a certain level. "exemplars" means the tree grows until a certain number of unclassified exemplars remain. "statdelta" means the tree grows until the change in the criterion statistic is less than a specified level.

stopping.par

numeric. A number indicating the parameter for the stopping rule. For stopping.rule == "levels", this is the number of levels. For stopping rule == "exemplars", this is the smallest percentage of examplars allowed in the last level.

goal

character. A string indicating the statistic to maximize when selecting final trees: "acc" = overall accuracy, "wacc" = weighted accuracy, "bacc" = balanced accuracy

goal.chase

character. A string indicating the statistic to maximize when constructing trees: "acc" = overall accuracy, "wacc" = weighted accuracy, "bacc" = balanced accuracy

numthresh.method

character. How should thresholds for numeric cues be determined? "o" will optimize thresholds, while "m" will always use the median.

decision.labels

string. A vector of strings of length 2 indicating labels for negative and positive cases. E.g.; decision.labels = c("Healthy", "Diseased")

main

string. An optional label for the dataset. Passed on to other functions like plot.FFTrees(), and print.FFTrees()

train.p

numeric. What percentage of the data to use for training when data.test is not specified? For example, train.p = .5 will randomly split data into a 50% training set and a 50% test set. train.p = 1, the default, uses all data for training.

rounding

integer. An integer indicating digit rounding for non-integer numeric cue thresholds. The default is NULL which means no rounding. A value of 0 rounds all possible thresholds to the nearest integer, 1 rounds to the nearest .1 (etc.).

progress

logical. Should progress reports be printed? Can be helpful for diagnosis when the function is running slowly.

repeat.cues

logical. Can cues occur multiple times within a tree?

my.tree

string. A string representing an FFT in words. For example, my.tree = "If age > 20, predict TRUE. If sex = [m], predict FALSE. Otherwise, predict TRUE"

tree.definitions

dataframe. An optional hard-coded definition of trees (see details below). If specified, no new trees are created.

do.comp, do.cart, do.lr, do.rf, do.svm

logical. Should alternative algorithms be created for comparison? cart = regular (non-frugal) trees with rpart, lr = logistic regression with glm, rf = random forests with randomForest, svm = support vector machines with e1071. Setting comp = FALSE sets all these arguments to FALSE.

store.data

logical. Should training / test data be stored in the object? Default is FALSE.

object

FFTrees. An optional existing FFTrees object. When specified, no new trees are fitted and the existing trees are applied to data and data.test.

rank.method, verbose, comp

depricated arguments.

force

logical. If TRUE, forces some parameters (like goal) to be as specified by the user even when the algorithm thinks those specifications don't make sense.

Value

An FFTrees object with the following elements

formula: The formula specified when creating the FFTs.
data.desc: Descriptive statistics of the data
cue.accuracies: Marginal accuracies of each cue given a decision threshold calculated with the specified algorithm
tree.definitions: Definitions of each tree created by FFTrees. Each row corresponds to one tree. Different levels within a tree are separated by semi-colons. See above for more details.
tree.stats: Tree definitions and classification statistics. Training and test data are stored separately
cost: A list of cost information for each case in each tree.
level.stats: Cumulative classification statistics at each tree level. Training and test data are stored separately
decision: Final classification decisions. Each row is a case and each column is a tree. For example, row 1 in column 2 is the classification decision of tree number 2 for the first case. Training and test data are stored separately.
levelout: The level at which each case is classified in each tree. Rows correspond to cases and columns correspond to trees. Training and test data are stored separately.
tree.max: The index of the 'final' tree specified by the algorithm. For algorithms that only return a single tree, this value is always 1.
inwords: A verbal definition of tree.max.
auc: Area under the curve statistics
params: A list of defined control parameters (e.g.; algorithm, goal)
comp: Models and classification statistics for competitive classification algorithms: Regularized logistic regression, CART, and random forest.
data: The original training and test data (only included when store.data = TRUE)

Examples

Run this code

# NOT RUN {
 # Create ffts for heart disease
 heart.fft <- FFTrees(formula = diagnosis ~.,
                      data = heartdisease)

 # Print the result for summary info
 heart.fft

 # Plot the best tree
 plot(heart.fft)


# }

Run the code above in your browser using DataLab