proximus: Proximus

Description

Cluster the rows of a logical matrix using the Proximus algorithm. The compression rate of the algorithm can be influenced by the choice of the maximum cluster radius and the minimum cluster size.

Usage

proximus(x, max.radius = 2, min.size = 1, min.retry = 10,
         max.iter = 16, debug = FALSE)

Value

An object of class proximus with the following components:

nr: the number of rows of the data matrix.
nc: the number of columns of the data matrix.
a: a list containing the approximations (patterns).
a$x: a vector of row (presence set) indexes.
a$y: a vector of column (dominant attribute set) indexes.
a$n: the number of ones in the approximated submatrix.
a$c: the absolute error reduction by the approximation.
max.radius: see arguments.
min.size: see arguments.
rownames: rownames of the data matrix.
colnames: colnames of the data matrix.

Arguments

x: a logical matrix.
max.radius: the maximum number of bits a member in a row set may deviate from its dominant pattern.
min.size: the minimum split size of a row set.
min.retry: number of retries to split a pure rank-one approximation (translates into a resampling rate).
max.iter: the maximum number of iterations for finding a local rank-one approximation.
debug: optional debugging output.

Author

Christian Buchta

Warning

Deep recursions may exhaust your computer.

Details

The intended area of application is the compression of high-dimensional binary data into representative patterns. For instance, purchase incidence (market basket data) or term-document matrices may be preprocessed by Proximus for later association rule mining.

The algorithm is of a recursive partitioning type. Specifically, at each step a binary split is attempted using a local rank-one approximation of the current submatrix (row set). That is a specialization of principal components to binary data which represents a matrix as the outer product of two binary vectors. The node expansion stops if a submatrix is pure, i.e., the column (presence set) vector indicates all the rows and the Hamming distances from the row (dominant attribute set) pattern vector, or the size of the row set, are less than or equal the specified threshold. In the case the rank-one approximation does not result in a split but the radius constraint is violated, the matrix is split using a random row and the radius constraint.

The debug option can be used to gain some insight into how the algorithm proceeds: a right angle bracket indicates a split and the return to a recursion level is indicated by a left one. Leafs in the recursion tree are indicated by an asterisk and retries by a plus sign. The number of retries is bounded by the size of the current set divided by min.retry. Double angle brackets indicate a random split (see above). The numbers between square brackets indicate the current set size, the size of the presence (sub)set, and its radius. The adjoining numbers indicate the depth of the recursion and the count of retries. Finally, a count of the leaf nodes found so far is shown to the right of an asterisk.

References

M. Koyutürk, A. Graham, and N. Ramakrishnan. Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets. IEEE Transactions On Knowledge and Data Engineering, Vol. 17, No. 4, (April) 2005.

Examples

Run this code

x <- matrix(sample(c(FALSE, TRUE), 200, rep=TRUE), ncol=10)
pr <- proximus(x, max.radius=8)
summary(pr)
### example from paper
x <- rlbmat()
pr <- proximus(x, max.radius=8, debug=TRUE)
op <- par(mfrow=c(1,2), pty="s")
lmplot(x, main="Data")
box()
lmplot(fitted(pr)$x, main="Approximation")
box()
par(op)

Run the code above in your browser using DataLab