discretize returns discretization bounds for numeric attributes and two auxiliary functions.
Discretization can be obtained with one of the three discretization methods:
greedy search using given feature evaluation heuristics, equal width of intervals, or equal number of instances in each interval.
The attributes and target variable are specified using formula interface, target variable name or index.
Feature evaluation algorithms available for classification problems
are various variants of Relief and ReliefF algorithms, gain ratio, gini-index, MDL, DKM, information gain, etc.
For regression problems there are RREliefF, MSEofMean, MSEofModel, MAEofMode, etc.
discretize(formula, data, method=c("greedy", "equalFrequency", "equalWidth"), estimator, discretizationLookahead=3, discretizationSample=0, maxBins=0, equalDiscBins=4, ...) applyDiscretization(data, boundsList, noDecimalsInValueName=2) intervalMidPoint(data, boundsList, midPointMethod=c("equalFrequency", "equalWidth"))method="greedy"
greedy search using given feature evaluation heuristics is selected, while "equalFrequency" and "equalWidth"
select equal frequency (the same number of instances in each interval) and equal width discretization, respectively. discretizationLookahead number of times
(0=try all possibilities). Candidate boundaries are chosen from a random sample of boundaries,
whose size is discretizationSample.discretizationSample. Otherwise binarization
is done greedily starting from the best separation of a single value.
For ReliefF-type measures, binarization of numeric features is performed with discretizationSample randomly
chosen splits. For other measures, the split is searched exhaustively among all possible splits.discretizationLookahead.helpCore.data to produce
discrete attributes of type factor. Numeric bounds can be obtained by calling discretize function."equalFrequency" method select the middle point so that each half-interval
contains equal number of instances. The "equalWidth" methods sets middle point to be equally distant from the boundaries.discretize returns a list of discretization bounds for numeric attributes. One component of a list contains bounds for one attribute.
If an attribute has all values equal, value NA is returned. If an attribute has all values equal to NA, it is skipped in the returned list.The function applyDiscretization returns a data set where all numeric attributes are replaced with their discrete versions.The function intervalMidPoint returns a list of vectors where each vector contains middle point of discretized intevals.discretize the parameter formula can be interpreted in three ways, where the formula interface is the most elegant one,
but inefficient and inappropriate for large data sets. See CoreModel for details.
The estimator parameter selects the evaluation heuristics. For classification problem it
must be one of the names returned by infoCore(what="attrEval") and for
regression problem it must be one of the names returned by infoCore(what="attrEvalReg").
For details see their description in attrEval.
If the number of supplied vector in maxBins and equalDiscBins is shorter than the number of numeric attributes, the
vector is coerced to the required length.
There are some additional parameters ... available which are used by specific evaluation heuristics.
Their list and short description is available by calling helpCore. See Section on attribute evaluation.
The function applyDiscretization takes the discretization bounds obtain with function discretize and transforms
numeric features in a data set into discrete features.
The function intervalMidPoint takes discretization bounds provided by function discretize and returns
middle points of discretization intervals for numeric attributes. The middle points are computed from the data;
for lowest/highest interval the minimum/maximum of the values in the data for particular attribute
are implicitly taken as an additional left/right boundary point.
Marko Robnik-Sikonja, Igor Kononenko: Discretization of continuous attributes using ReliefF. Proceedings of ERK'95 , Portoroz, Slovenia, 1995. Some of these references are available also from http://lkm.fri.uni-lj.si/rmarko/papers/
CORElearn,
CoreModel,
attrEval,
helpCore,
infoCore.
# use iris data
# run method using estimator ReliefF with exponential rank distance
discBounds <- discretize(Species ~ ., iris, method="greedy", estimator="ReliefFexpRank")
print(discBounds)
discreteIris <- applyDiscretization(iris, discBounds)
prototypePoints <- intervalMidPoint(iris, discBounds, midPointMethod="equalFrequency")
regData <- regDataGen(200)
discretize(response ~ ., regData, method="greedy", estimator="RReliefFequalK", maxBins=2)
# print all available estimators
#infoCore(what="attrEval")
#infoCore(what="attrEvalReg")
Run the code above in your browser using DataLab