This is the workhorse function for the FFTrees
package. It creates (one or more) fast-and-frugal decision trees trained on a training dataset and tested on an optional test dataset.
FFTrees(formula = NULL, data = NULL, data.test = NULL,
algorithm = "ifan", max.levels = NULL, sens.w = 0.5,
cost.outcomes = NULL, cost.cues = NULL, stopping.rule = "exemplars",
stopping.par = 0.1, goal = "wacc", goal.chase = "bacc",
numthresh.method = "o", decision.labels = c("False", "True"),
main = NULL, train.p = 1, rounding = NULL, progress = TRUE,
repeat.cues = TRUE, my.tree = NULL, tree.definitions = NULL,
do.comp = TRUE, do.cart = TRUE, do.lr = TRUE, do.rf = TRUE,
do.svm = TRUE, store.data = FALSE, object = NULL, rank.method = NULL,
force = FALSE, verbose = NULL, comp = NULL)
formula. A formula specifying a logical criterion as a function of 1 or more predictors.
dataframe. A training dataset.
dataframe. An optional testing dataset with the same structure as data.
character. The algorithm to create FFTs. Can be 'ifan'
, 'dfan'
, 'max'
, or 'zigzag'
.
integer. The maximum number of levels considered for the trees. Because all permutations of exit structures are considered, the larger max.levels
is, the more trees will be created.
numeric. A number from 0 to 1 indicating how to weight sensitivity relative to specificity. Only relevant when goal = 'wacc'
numeric. A vector of length 4 specifying the costs of a hit, false alarm, miss, and correct rejection rspectively. E.g.; cost.outcomes = c(0, 10, 20, 0)
means that a false alarm and miss cost 10 and 20 respectively while correct decisions have no cost.
dataframe. A dataframe with two columns specifying the cost of each cue. The first column should be a vector of cue names, and the second column should be a numeric vector of costs. Cues in the dataset not present in cost.cues
are assume to have 0 cost.
character. A string indicating the method to stop growing trees. "levels" means the tree grows until a certain level. "exemplars" means the tree grows until a certain number of unclassified exemplars remain. "statdelta" means the tree grows until the change in the criterion statistic is less than a specified level.
numeric. A number indicating the parameter for the stopping rule. For stopping.rule == "levels", this is the number of levels. For stopping rule == "exemplars", this is the smallest percentage of examplars allowed in the last level.
character. A string indicating the statistic to maximize when selecting final trees: "acc" = overall accuracy, "wacc" = weighted accuracy, "bacc" = balanced accuracy
character. A string indicating the statistic to maximize when constructing trees: "acc" = overall accuracy, "wacc" = weighted accuracy, "bacc" = balanced accuracy
character. How should thresholds for numeric cues be determined? "o"
will optimize thresholds, while "m"
will always use the median.
string. A vector of strings of length 2 indicating labels for negative and positive cases. E.g.; decision.labels = c("Healthy", "Diseased")
string. An optional label for the dataset. Passed on to other functions like plot.FFTrees()
, and print.FFTrees()
numeric. What percentage of the data to use for training when data.test
is not specified? For example, train.p = .5
will randomly split data
into a 50% training set and a 50% test set. train.p = 1
, the default, uses all data for training.
integer. An integer indicating digit rounding for non-integer numeric cue thresholds. The default is NULL which means no rounding. A value of 0 rounds all possible thresholds to the nearest integer, 1 rounds to the nearest .1 (etc.).
logical. Should progress reports be printed? Can be helpful for diagnosis when the function is running slowly.
logical. Can cues occur multiple times within a tree?
string. A string representing an FFT in words. For example, my.tree = "If age > 20, predict TRUE. If sex = [m], predict FALSE. Otherwise, predict TRUE"
dataframe. An optional hard-coded definition of trees (see details below). If specified, no new trees are created.
logical. Should alternative algorithms be created for comparison? cart = regular (non-frugal) trees with rpart
, lr = logistic regression with glm
, rf = random forests with randomForest
, svm = support vector machines with e1071
. Setting comp = FALSE
sets all these arguments to FALSE.
logical. Should training / test data be stored in the object? Default is FALSE.
FFTrees. An optional existing FFTrees object. When specified, no new trees are fitted and the existing trees are applied to data
and data.test
.
depricated arguments.
logical. If TRUE, forces some parameters (like goal) to be as specified by the user even when the algorithm thinks those specifications don't make sense.
An FFTrees
object with the following elements
The formula specified when creating the FFTs.
Descriptive statistics of the data
Marginal accuracies of each cue given a decision threshold calculated with the specified algorithm
Definitions of each tree created by FFTrees
. Each row corresponds to one tree. Different levels within a tree are separated by semi-colons. See above for more details.
Tree definitions and classification statistics. Training and test data are stored separately
A list of cost information for each case in each tree.
Cumulative classification statistics at each tree level. Training and test data are stored separately
Final classification decisions. Each row is a case and each column is a tree. For example, row 1 in column 2 is the classification decision of tree number 2 for the first case. Training and test data are stored separately.
The level at which each case is classified in each tree. Rows correspond to cases and columns correspond to trees. Training and test data are stored separately.
The index of the 'final' tree specified by the algorithm. For algorithms that only return a single tree, this value is always 1.
A verbal definition of tree.max
.
Area under the curve statistics
A list of defined control parameters (e.g.; algorithm
, goal
)
Models and classification statistics for competitive classification algorithms: Regularized logistic regression, CART, and random forest.
The original training and test data (only included when store.data = TRUE
)
# NOT RUN {
# Create ffts for heart disease
heart.fft <- FFTrees(formula = diagnosis ~.,
data = heartdisease)
# Print the result for summary info
heart.fft
# Plot the best tree
plot(heart.fft)
# }
Run the code above in your browser using DataLab