This function creates a decision tree based of an example dataset, calculating the best classifier possible in each step. Only creates perfect divisions, this means, if the rule doesn't create a classified group, it is not considered. It is specifically designed for categorical values. Continues values are not recommended as they will be treated as categorical ones.
decision_tree(
data,
classy,
m,
method = "entropy",
details = FALSE,
waiting = TRUE
)
Structure of the tree. List with a list per tree level. Each of these contains a list per level node, each of these contains a list with the node's filtered data, the node's id, the father's node id, the height that node is at, the variable it filters by, the value that variable is filtered by and the information gain of the division
A data frame with already classified observations. Each column represents a parameter of the value. Each row is a different observation. The column names in the parameter "data" must not contain the sequence of characters " or ". As this is supposed to be a binary decision rules generator and not a binary decision tree generator, no tree structures are used, except for the information gain formulas.
Name of the column we want the data to be classified by. the set of rules obtained will be calculated according to this.
Maximum numbers of child nodes each node can have.
The definition of Gain. It must be one of
"Entropy"
, "Gini"
or "Error"
.
Boolean value. If it is set to "TRUE" multiple clarifications and explanations are printed along the code
If TRUE while details
= TRUE. The code will stop in each
"block" of code and wait for the user to press "enter" to continue.
Víctor Amador Padilla, victor.amador@edu.uah.es
If data
is not perfectly classifiable, the code will not finish.
Available information gain methods are:
The formula to calculate the entropy works as follows:\(p_{i} = -\sum{f_{i} p_{i} \cdot \log2 p_{i}}\)
The formula to calculate gini works as follows:\(p_{i} = 1 -\sum{f_{i} p_{i}^{2}}\)
The formula to calculate error works as follows:\(p_{i} = 1 -\max{(f_{i} p_{i}})\)
Once the impurity is calculated, the information gain is calculated as follows: $$IG = I_{father} - \sum{\frac{count(sonvalues)}{count(fathervalues)} \cdot I_{son}}$$
# example code
decision_tree(db3, "VehicleType", 5, "entropy", details = TRUE, waiting = FALSE)
decision_tree(db2, "VehicleType", 4, "gini")
Run the code above in your browser using DataLab