Learn R Programming

⚠️There's a newer version (2.0.0) of this package.Take me there.

FFTrees 1.7.0

The R package FFTrees creates, visualizes and evaluates fast-and-frugal decision trees (FFTs) for solving binary classification tasks following the methods described in Phillips, Neth, Woike & Gaissmaier (2017, as html | PDF).

What are fast-and-frugal trees (FFTs)?

Fast-and-frugal trees (FFTs) are simple and transparent decision algorithms for solving binary classification problems. The key feature making FFTs faster and more frugal than other decision trees is that every node allows for a decision. When predicting new outcomes, the performance of FFTs competes with more complex algorithms and machine learning techniques, such as logistic regression (LR), support-vector machines (SVM), and random forests (RF). Apart from being faster and requiring less information, FFTs tend to be robust against overfitting, and easy to interpret, use, and communicate.

Installation

To install the latest release of FFTrees from CRAN evaluate:

install.packages("FFTrees")

The current development version of FFTrees can be installed from GitHub with:

# install.packages("devtools")
devtools::install_github("ndphillips/FFTrees", build_vignettes = TRUE)

Getting started

As an example, let’s create a FFT predicting heart disease status (Healthy vs. Diseased) based on the heartdisease dataset included in FFTrees:

library(FFTrees)  # load package

Using data

The heartdisease data provides medical information for 303 patients that were tested for heart disease. The full data were split into two subsets: A heart.train dataset for fitting decision trees, and heart.test dataset for a testing the resulting trees. Here are the first rows and columns of both subsets of the heartdisease data:

  • heart.train (the training / fitting dataset) contains the data from 150 patients:
heart.train
#> # A tibble: 150 × 14
#>    diagnosis   age   sex cp    trestbps  chol   fbs restecg     thalach exang
#>    <lgl>     <dbl> <dbl> <chr>    <dbl> <dbl> <dbl> <chr>         <dbl> <dbl>
#>  1 FALSE        44     0 np         108   141     0 normal          175     0
#>  2 FALSE        51     0 np         140   308     0 hypertrophy     142     0
#>  3 FALSE        52     1 np         138   223     0 normal          169     0
#>  4 TRUE         48     1 aa         110   229     0 normal          168     0
#>  5 FALSE        59     1 aa         140   221     0 normal          164     1
#>  6 FALSE        58     1 np         105   240     0 hypertrophy     154     1
#>  7 FALSE        41     0 aa         126   306     0 normal          163     0
#>  8 TRUE         39     1 a          118   219     0 normal          140     0
#>  9 TRUE         77     1 a          125   304     0 hypertrophy     162     1
#> 10 FALSE        41     0 aa         105   198     0 normal          168     0
#> # … with 140 more rows, and 4 more variables: oldpeak <dbl>, slope <chr>,
#> #   ca <dbl>, thal <chr>
  • heart.test (the testing / prediction dataset) contains data from a new set of 153 patients:
heart.test
#> # A tibble: 153 × 14
#>    diagnosis   age   sex cp    trestbps  chol   fbs restecg     thalach exang
#>    <lgl>     <dbl> <dbl> <chr>    <dbl> <dbl> <dbl> <chr>         <dbl> <dbl>
#>  1 FALSE        51     0 np         120   295     0 hypertrophy     157     0
#>  2 TRUE         45     1 ta         110   264     0 normal          132     0
#>  3 TRUE         53     1 a          123   282     0 normal           95     1
#>  4 TRUE         45     1 a          142   309     0 hypertrophy     147     1
#>  5 FALSE        66     1 a          120   302     0 hypertrophy     151     0
#>  6 TRUE         48     1 a          130   256     1 hypertrophy     150     1
#>  7 TRUE         55     1 a          140   217     0 normal          111     1
#>  8 FALSE        56     1 aa         130   221     0 hypertrophy     163     0
#>  9 TRUE         42     1 a          136   315     0 normal          125     1
#> 10 FALSE        45     1 a          115   260     0 hypertrophy     185     0
#> # … with 143 more rows, and 4 more variables: oldpeak <dbl>, slope <chr>,
#> #   ca <dbl>, thal <chr>

Most of the variables in our data are potential predictors. The criterion variable is diagnosis — a logical column indicating the true state for each patient (TRUE or FALSE, i.e., whether or not the patient suffers from heart disease).

Creating fast-and-frugal trees (FFTs)

Now let’s use FFTrees() to create FFTs for the heart.train data and evaluate their predictive performance on the heart.test data:

  • Create an FFTrees object from the heartdisease data:
# Create an FFTrees object from the heartdisease data: 
heart.fft <- FFTrees(formula = diagnosis ~., 
                     data = heart.train,
                     data.test = heart.test, 
                     decision.labels = c("Healthy", "Disease"))
#> Setting 'goal = bacc'
#> Setting 'goal.chase = bacc'
#> Setting 'goal.threshold = bacc'
#> Setting cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)
#> Growing FFTs with ifan:
#> Fitting other algorithms for comparison (disable with do.comp = FALSE) ...
  • Printing an FFTrees object shows basic information and summary statistics (on the best training tree, FFT #1):
# Print:
heart.fft
#> FFTrees 
#> - Trees: 7 fast-and-frugal trees predicting diagnosis
#> - Outcome costs: [hi = 0, mi = 1, fa = 1, cr = 0]
#> 
#> FFT #1: Definition
#> [1] If thal = {rd,fd}, decide Disease.
#> [2] If cp != {a}, decide Healthy.
#> [3] If ca > 0, decide Disease, otherwise, decide Healthy.
#> 
#> FFT #1: Training Accuracy
#> Training data: N = 150, Pos (+) = 66 (44%) 
#> 
#> |          | True + | True - | Totals:
#> |----------|--------|--------|
#> | Decide + | hi  54 | fa  18 |      72
#> | Decide - | mi  12 | cr  66 |      78
#> |----------|--------|--------|
#>   Totals:        66       84   N = 150
#> 
#> acc  = 80.0%   ppv  = 75.0%   npv  = 84.6%
#> bacc = 80.2%   sens = 81.8%   spec = 78.6%
#> 
#> FFT #1: Training Speed, Frugality, and Cost
#> mcu = 1.74,  pci = 0.87,  E(cost) = 0.200
  • To evaluate the predictive performance of an FFT, we plot an FFTrees object to visualize a tree and its performance (on the test data):
# Plot the best tree applied to the test data: 
plot(heart.fft,
     data = "test",
     main = "Heart Disease")

Figure 1: A fast-and-frugal tree (FFT) predicting heart disease for test data and its performance characteristics.

  • Additionally, we can compare the predictive performance between different machine learning algorithms on a range of metrics:
# Compare predictive performance across algorithms: 
heart.fft$competition$test
#> # A tibble: 5 × 16
#>   algorithm     n    hi    fa    mi    cr  sens  spec    far   ppv   npv   acc
#>   <chr>     <int> <int> <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1 fftrees     153    64    19     9    61 0.877 0.762 0.238  0.771 0.871 0.817
#> 2 lr          153    55    13    18    67 0.753 0.838 0.162  0.809 0.788 0.797
#> 3 cart        153    50    19    23    61 0.685 0.762 0.238  0.725 0.726 0.725
#> 4 rf          153    59     8    14    72 0.808 0.9   0.1    0.881 0.837 0.856
#> 5 svm         153    55     7    18    73 0.753 0.912 0.0875 0.887 0.802 0.837
#> # … with 4 more variables: bacc <dbl>, cost <dbl>, cost_decisions <dbl>,
#> #   cost_cues <dbl>

Building FFTs from verbal descriptions

Because fast-and-frugal trees are so simple, we even can create FFTs ‘from words’ and apply them to data!

For example, let’s create a tree with the following three nodes and evaluate its performance on the heart.test data:

  1. If sex = 1, predict Disease.
  2. If age < 45, predict Healthy.
  3. If thal = {fd, normal}, predict Healthy,
    otherwise, predict Disease.

These conditions can directly be supplied to the my.tree argument of FFTrees():

# Create custom FFT 'in words' and apply it to test data:

# 1. Create my own FFT (from verbal description):
my.fft <- FFTrees(formula = diagnosis ~., 
                  data = heart.train,
                  data.test = heart.test, 
                  decision.labels = c("Healthy", "Disease"),
                  my.tree = "If sex = 1, predict Disease.
                             If age < 45, predict Healthy.
                             If thal = {fd, normal}, predict Healthy,  
                             Otherwise, predict Disease.")

# 2. Plot and evaluate my custom FFT (for test data):
plot(my.fft,
     data = "test",
     main = "My custom FFT")

Figure 2: An FFT predicting heart disease created from a verbal description.

As we can see, this particular tree is somewhat biased: It has nearly perfect sensitivity (i.e., is good at identifying cases of Disease) but suffers from low specificity (i.e., is not so good at identifying Healthy cases). Overall, the accuracy of our custom tree exceeds the data’s baseline by a fair amount. However, exploring FFTrees further will quickly enable you to design much better FFTs.

References

We had a lot of fun creating FFTrees and hope you like it too! As a comprehensive, yet accessible introduction to FFTs, we recommend reading our article in the journal Judgment and Decision Making (2017, volume 12, issue 4), entitled FFTrees: A toolbox to create, visualize,and evaluate fast-and-frugal decision trees (available in html | PDF ).

Citation (in APA format):

  • Phillips, N. D., Neth, H., Woike, J. K. & Gaissmaier, W. (2017). FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making, 12 (4), 344–368. Retrieved from https://journal.sjdm.org/17/17217/jdm17217.pdf

We encourage you to read the article to learn more about the history of FFTs and how the FFTrees package creates, visualizes, and evaluates them. If you use FFTrees in your own work, please cite us and share your experiences (e.g., on GitHub) so we can continue developing the package.

Here are some scientific publications that have used FFTrees (see Google Scholar for the full list):


[File README.Rmd last updated on 2022-08-31.]

Copy Link

Version

Install

install.packages('FFTrees')

Monthly Downloads

373

Version

1.7.0

License

CC0

Issues

Pull Requests

Stars

Forks

Maintainer

Hansjoerg Neth

Last Published

August 31st, 2022

Functions in FFTrees (1.7.0)

contraceptive

Contraceptive use data
factclean

Clean factor variables in prediction data
creditapproval

Credit approval data
mushrooms

Mushrooms data
titanic

Titanic survival data
comp.pred

A wrapper for competing classification algorithms.
breastcancer

Physiological data of patients tested for breast cancer
fftrees_fitcomp

Fit competitive algorithms
inwords

Provide a verbal description of an FFT
car

Car acceptability data
add_stats

Add decision statistics to data (containing counts of a 2x2 contingency table)
blood

Blood donation data
fertility

Fertility data
classtable

Compute classification statistics for binary prediction and criterion (e.g.; truth) vectors
iris.v

Iris data
forestfires

Forest fires data
plot.FFTrees

Plot an FFTrees object
fftrees_ranktrees

Rank FFTs by current goal
fftrees_grow_fan

Grow fast-and-frugal trees (FFTs) using the fan algorithm
fftrees_ffttowords

Describe a fast-and-frugal tree (FFT) in words
fftrees_wordstofftrees

Convert a text description of an FFT into an FFTrees object
FFTrees

Main function to create and apply fast-and-frugal trees (FFTs)
fftrees_cuerank

Calculate thresholds that optimize some statistic (goal) for cues in data
heart.train

Heart disease training data
fftrees_apply

Apply an FFT to data and generate accuracy statistics
fftrees_define

Create FFT definitions
heartdisease

Heart disease data
fftrees_create

Create an object of class FFTrees
FFTrees.guide

Open the FFTrees package guide
summary.FFTrees

Summarize an FFTrees object
predict.FFTrees

Predict classification outcomes or probabilities from data
print.FFTrees

Print basic information of fast-and-frugal trees (FFTs)
sonar

Sonar data
fftrees_threshold_numeric_grid

Perform a grid search over thresholds and return accuracy statistics for a given numeric cue
fftrees_threshold_factor_grid

Perform a grid search over factor and return accuracy statistics for a given factor cue
voting

Voting data
heart.cost

Cue costs for the heartdisease data
heart.test

Heart disease testing data
showcues

Visualize cue accuracies (as points in ROC space)
select_best_tree

Select the best tree (from the current set)
wine

Wine tasting data