Discretize multivariate continuous data using a grid that captures the joint distribution via preserving clusters in original data
discretize.jointly(
data,
k = c(2:10),
min_level = 1,
max_level = 100,
cluster_method = c("Ball+BIC", "kmeans+silhouette", "PAM"),
grid_method = c("DP approx likelihood 1-way", "DP approx likelihood 2-way",
"DP exact likelihood", "DP Compressed majority", "DP", "Sort+split",
"MultiChannel.WUC"),
eval_method = c("ARI", "purity", "upsllion", "CAIR"),
cluster_label = NULL,
cutoff = 0,
entropy = FALSE,
noise = FALSE,
dim_reduction = FALSE,
scale = FALSE,
variance = 0.5,
nthread = 1
)A list that contains four items:
Da matrix of discretized values from original data.
Discretized values are one(1)-based.
grida list of numeric vectors of decision boundaries for each variable/dimension.
clabelsa vector of cluster labels for each observation in data.
csimilaritya similarity score between clusters from joint discretization
D and cluster labels clabels. The score is the adjusted Rand index.
a numeric matrix for multivariate data or a numeric vector for univariate data. In case of a matrix, columns are continuous variables; rows are observations.
either an integer, a integer vector,
or Inf, specifying the number of clusters.
The default is a vector of integers from 2 to 10.
If k is a single number, data will
be grouped into into exactly k clusters.
If k is an integer vector, an optimal
k is chosen among the integers. If k
is set to Inf, an optimal k is
chosen from 2 to nrow(data). If
cluster_label is specified,
k is ignored.
an integer or an integer vector, to specify the minimum number of levels
along each dimension. If a vector of size ncol(data), then each element
will be mapped 1:1 to each dimension in order. If an integer, then all dimensions
will have the same minimum number of levels.
an integer or an integer vector, to specify the maximum
number of levels along each dimension. It works in the
same way as min_level. max_level will
be set to the smaller between number of compressed zones and itself,
if grid_method is a likelihood approach or
"DP Compressed majority".
a character string to specify a clustering
method to be used. Ignored if cluster_label is not NULL.
We offer three build-in options:
"Ball+BIC" (default) uses mclust::Mclust
(modelNames = "VII" for 2-D or higher dimensions;
"V" for 1-D) to cluster data and
BIC score to select number of clusters.
"kmeans+silhouette" uses k-means to cluster data and the average
Silhouette width to select number of clusters.
"PAM" uses the algorithm partition around medoids to perform clustering.
a character string to specify a grid
discretization method. Default:
"DP approx likelihood 1-way". The methods
can be roughly separate into three different categories:
by cluster likelihood, by density, and by SSE (Sum of Squared Errors).
See Details for more information.
a character string to specify a method to evaluate quality of discretized data.
a vector of labels for each data point or
observation. It can be class labels on the input data for
supervised learning; it can also be cluster labels for
unsupervised learning. If NULL (default), clustering
is performed to obtain labels.
a numeric value. A grid line is added only when the
quality of the line is not smaller than cutoff.
It is applicable only to grid_method "DP" or
"DP Compressed majority".
a logical to chose either entropy
(TRUE) or likelihood (FALSE, default).
a logical to apply jitter noise to original
data if TRUE. Default: FALSE.
It is only applicable
to cluster_method "Ball+BIC".
When data contain many duplicated values,
adding noise can help Mclust clustering.
a logical to turn on/off
dimension reduction. Default: FALSE.
a logical to specify linear
scaling of the variable in each dimension
if TRUE. Default: FALSE.
a numeric value to specify noise variance to be added to the data
an integer to specify number of CPU threads to use. Automatically adjusted if invalid or exceeding available cores.
Jiandong Wang, Sajal Kumar, and Mingzhou Song
The function implements both published algorithms described in Jwang2020BCBGridOnClusters and new algorithms for multivariate discretization.
The included grid discretization methods can be summarized into three categories:
By Density
"Sort+split" Jwang2020BCBGridOnClusters
sorts clusters by mean in each dimension. It then
splits consecutive pairs only if the sum of error rate of each cluster is
less than or equal to 50%. It is possible that no grid line will be added
in a certain dimension. The maximum number of lines is the number of
clusters minus one.
By SSE (Sum of Squared Errors)
"MultiChannel.WUC" splits each dimension by weighted with-in cluster
sum of squared distances by Ckmeans.1d.dp::MultiChannel.WUC(). Applied in
each projection on each dimension. The channel of each point is defined by
its multivariate cluster label.
"DP" orders labels by data in each dimension and then cuts data
into a maximum of max_level bins. It evaluates the quality of each
cut to find a best number of bins.
"DP Compressed majority" orders labels by data in each dimension.
It then compresses labels neighbored by the same label to avoid
discretization within consecutive points of the same cluster label, so as to
greatly reduce runtime of dynamic programming. Then it cuts data into
a maximum of max_level bins, and it evaluates the quality of
each cut by the majority of data to find a best number of bins.
By cluster likelihood
"DP exact likelihood" orders labels by data in each dimension.
It then compresses labels neighbored by the same label to avoid
discretization within consecutive points of the same cluster label,
so as to greatly reduce runtime of dynamic programming.
Then cut the data into a maximum of max_level bins.
"DP approx likelihood 1-way" is a sped-up version of the
"DP exact likelihood" method, but it is not always optimal.
"DP approx likelihood 2-way" is a bidirectional variant of the
"DP approx likelihood" method. It performs approximate dynamic
programming in both the forward and backward directions and selects
the better of the two results. This approach provides additional robustness
compared to the one-directional version, but optimality is not always achieved.
See Ckmeans.1d.dp for discretizing univariate continuous data.
# using a specified k
x = rnorm(100)
y = sin(x)
z = cos(x)
data = cbind(x, y, z)
discretized_data = discretize.jointly(data, k=5)$D
# using a range of k
x = rnorm(100)
y = log1p(abs(x))
z = tan(x)
data = cbind(x, y, z)
discretized_data = discretize.jointly(data, k=c(3:10))$D
# using k = Inf
x = c()
y = c()
mns = seq(0,1200,100)
for(i in 1:12){
x = c(x,runif(n=20, min=mns[i], max=mns[i]+20))
y = c(y,runif(n=20, min=mns[i], max=mns[i]+20))
}
data = cbind(x, y)
discretized_data = discretize.jointly(data, k=Inf)$D
# using an alternate clustering method to k-means
library(cluster)
x = rnorm(100)
y = log1p(abs(x))
z = sin(x)
data = cbind(x, y, z)
# pre-cluster the data using partition around medoids (PAM)
cluster_label = pam(x=data, diss = FALSE, metric = "euclidean", k = 5)$clustering
discretized_data = discretize.jointly(data, cluster_label = cluster_label)$D
Run the code above in your browser using DataLab