helpCore: Description of parameters.
Description
The behavior of CORElearn is controlled by several parameters. This is a short overview.
Attribute/feature evaluation
The parameters in this group may be used inside model construction
via CoreModel and feature evaluation in attrEval. See attrEval
for description of relevant evaluation methods. Parameters attrEvaluationInstances, binaryEvaluation,
binarySplitNumericAttributes
are applicable to all attribute evaluation methods. In models which need feature evaluation (e.g., trees,
random forests) they affect the selection of splits in the nodes.
Other parameters may be used only in context sensitive measures, i.e., ReliefF in classification
and RReliefF in regression and their variants.
Decision/regression tree construction
There are several parameters controlling a construction of the tree model. Some are described here,
but also attribute evaluation, stop building, model, constructive induction, discretization,
and pruning options described in this document are applicable.
Splits in trees are always binary, however, the option binaryEvaluation has influence on the
feature selection for the split. Namely, selecting the best feature for the split is done with the given
value of binaryEvaluation. If binaryEvaluation=FALSE, the features are first evaluated and
the best one is finally binarized. If binaryEvaluation=TRUE, the features are binarized before
selection. In this case, a search for the best binarization for all considered features is performed and
the best binarizations found are used for splits. The latter option is computationally more intensive,
but typically does not produce better trees.
Stop tree building
During tree construction the node is recursively split, until certain condition is fulfilled.
Models in the tree leaves
In leaves of the tree model there can be various prediction models controlling prediction. For example instead of classification with
majority of class values one can use naive Bayes in classification, or a linear model in regression, thereby expanding
expressive power of the tree model.
Constructive induction aka. feature construction
The expressive power of tree models can be increased by incorporating additional types of splits. Operator based
constructive induction is implemented in both classification and regression. The best construct is searched with beam search.
At each step new constructs are evaluated with selected feature evaluation measure.
With different types of operators one can control expressions in the interior tree nodes.
Attribute discretization and binarization
Some algorithms cannot deal with numeric attributes directly, so we have to discretize them. Also the tree models use
binary splits in nodes. The discretization algorithm evaluates split candidates and forms intervals of values.
Note that setting discretizationSample=1 will force random selection of splitting point, which will speed-up the algorithm
and may be perfectly acceptable for random forest ensembles. CORElearn builds binary trees so multivalued discrete attributes have to be binarized i.e., values have to be split into
twoa subset, one going left and the other going right in a node. The method used depends on the parameters
and the number of attribute values. Possible methods are exhaustive (if the number of attribute values is less or equal
maxValues4Exhaustive), greedy ((if the number of attribute values is less or equal maxValues4Greedy)
and random ((if the number of attribute values is more than maxValues4Exhaustive).
Setting maxValues4Greedy=2 will always randomly selet splitting point.
Tree pruning
After the tree is constructed, to reduce noise it is beneficial to prune it.
Prediction
For some models (decision trees, random forests, naive Bayes, and regression trees) one can smoothe the output predictions.
In classification models output probabilities are smoothed and in case of regression prediction value is smoothed.
Random forests
Random forest is quite complex model, whose construction one can control with several parameters.
Momentarily only classification version of the algorithm is implemented.
Besides parameters in this section one can apply majority of parameters for control of decision trees (except constructive induction and tree pruning).
General tree ensembles
In the same manner as random forests more general tree ensembles can be constructed. Additional options control sampling,
tree size and regularization.
Read data directly from files
In case of very large data sets it is useful to bypass R and read data directly from files as the standalone learning system CORElearn
does. Supported file formats are C4.5, M5, and native format of CORElearn. See documentation at http://lkm.fri.uni-lj.si/rmarko/software/.
Details
There are many different parameters available. Some are general and can be used in many
learning, or feature evaluation algorithms. All the values actually used by
the classifier / regressor can be written to file (or read from it) using
paramCoreIO.
The parameters for the methods are split into several groups and documented below.
References
B. Zadrozny, C. Elkan. Learning and making decisions when costs and probabilities are both unknown.
In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, 2001.