
This function implements several basic unsupervised methods to convert continuous variables into a categorical variables (factor) suitable for association rule mining. For convenience, a whole data.frame can be discretized (i.e., all numeric columns are discretized).
discretize(x, method = "frequency", breaks = 3,
labels = NULL, include.lowest = TRUE, right = FALSE, dig.lab = 3,
ordered_result = FALSE, infinity = FALSE, onlycuts = FALSE,
categories, ...)discretizeDF(df, methods = NULL, default = NULL)
a numeric vector (continuous variable).
discretization method. Available are: "interval"
(equal interval width), "frequency"
(equal frequency), "cluster"
(k-means clustering) and
"fixed"
(categories specifies interval boundaries).
Note that equal frequency does not achieve perfect equally sized groups if the data contains duplicated values.
categories
is deprecated, use breaks
.
either number of categories or a vector with boundaries for
discretization (all values outside the boundaries will be set to NA).
character vector; labels for the levels of the resulting category. By default, labels are constructed using "(a,b]" interval notation. If labels = FALSE
, simple integer codes are returned instead of a factor..
logical; should the first interval be closed to the left?
logical; should the intervals be closed on the right (and open on the left) or vice versa?
integer; number of digits used to create labels.
logical; return a ordered factor?
logical; should the first/last break boundary changed to +/-Inf?
logical; return only computed interval boundaries?
for method "cluster" further arguments are passed on to
kmeans
.
data.frame; each numeric column in the data.frame is discretized.
named list of lists or a data.frame;
the named list contains list of discretization parameters
(see parameters of discretize
) for each numeric column
(see details). If no specific discretization is specified for a column,
then the default settings for discretize
are used.
Note: the names have to match exactly.
If a data.frame is specified, then the discretization breaks in this
data.frame are applied to df
.
named list; parameters for discretize
used for all columns not specified in methods
.
A factor representing the categorized continuous variable
with attribute "discretized:breaks"
indicating the used breaks
or and "discretized:method"
giving the used method. If
onlycuts = TRUE
is used, a vector with the calculated
interval boundaries is returned.
discretizeDF
returns a discretized data.frame.
discretize
only implements unsupervised discretization. See packages arulesCBA, discretization or RWeka for supervised
discretization.
discretizeDF
applies discretization to each numeric column.
Individual discretization parameters can be specified in the form:
methods = list(column_name1 = list(method = ,...), column_name2 = list(...))
.
# NOT RUN {
data(iris)
x <- iris[,1]
### look at the distribution before discretizing
hist(x, breaks = 20, main = "Data")
def.par <- par(no.readonly = TRUE) # save default
layout(mat = rbind(1:2,3:4))
### convert continuous variables into categories (there are 3 types of flowers)
### the default method is equal frequency
table(discretize(x, breaks = 3))
hist(x, breaks = 20, main = "Equal Frequency")
abline(v = discretize(x, breaks = 3,
onlycuts = TRUE), col = "red")
# Note: the frequencies are not exactly equal because of ties in the data
### equal interval width
table(discretize(x, method = "interval", breaks = 3))
hist(x, breaks = 20, main = "Equal Interval length")
abline(v = discretize(x, method = "interval", breaks = 3,
onlycuts = TRUE), col = "red")
### k-means clustering
table(discretize(x, method = "cluster", breaks = 3))
hist(x, breaks = 20, main = "K-Means")
abline(v = discretize(x, method = "cluster", breaks = 3,
onlycuts = TRUE), col = "red")
### user-specified (with labels)
table(discretize(x, method = "fixed", breaks = c(-Inf, 6, Inf),
labels = c("small", "large")))
hist(x, breaks = 20, main = "Fixed")
abline(v = discretize(x, method = "fixed", breaks = c(-Inf, 6, Inf),
onlycuts = TRUE), col = "red")
par(def.par) # reset to default
### prepare the iris data set for association rule mining
### use default discretization
irisDisc <- discretizeDF(iris)
head(irisDisc)
### specify discretization for the petal columns
irisDisc <- discretizeDF(iris, methods = list(
Petal.Length = list(method = "frequency", breaks = 3,
labels = c("short", "medium", "long")),
Petal.Width = list(method = "frequency", breaks = 2,
labels = c("narrow", "wide"))
))
head(irisDisc)
### discretize new data using the same discretization scheme as the
### data.frame supplied in methods. Note: NAs may occure if a new
### value falls outside the range of values observed in the
### originally discretized table (use argument infinity = TRUE in
### discretize to prevent this case.)
discretizeDF(iris[sample(1:nrow(iris), 5),], methods = irisDisc)
# }
Run the code above in your browser using DataLab