COMPR: Compressive Orthogonal Matching Pursuit with Replacement

Description

An implementation of the Compressive Orthogonal Matching Pursuit with Replacement algorithm

Usage

COMPR(
  Data,
  ind.col = 1:ncol(Data),
  K,
  Frequencies,
  lower_b,
  upper_b,
  SK_Data,
  maxIter = 300,
  HardThreshold = TRUE,
  options = list(tol_centroid = 1e-08, nIterCentroid = 1500, min_weight = 0, max_weight
    = Inf, nIterLast = 1000, tol_global = 1e-12)
)

Arguments

Data

A Filebacked Big Matrix n x N, data vectors are stored in the matrix columns.

ind.col

Column indeces, which indicate which data vectors are considered for clustering. By default the entire Data matrix.

Number of clusters.

Frequencies

A frequency matrix m x n with frequency vectors in rows.

lower_b

A vector of the lower boundary of data.

upper_b

A vector of the upper boundary.

SK_Data

Data sketch vector of the length 2m. It can be computed using Sketch.

maxIter

Maximum number of iterations in the global optimization with respect to cluster centroid vectors and their weights. Default is 300.

HardThreshold

logical that indicates whether to perform the replacement. Default is TRUE.

options

List of optimization parameters:

tol_centroid is a tolerance value for the centroid optimization. Default is 1e-8.
nIterCentroid is a maximum number of iterations in the centroid optimization (default is 1500).
min_weight is a lower bound for centroid weights (default is 0).
max_weight is an upper bound for centroids weights (default is Inf)
nIterLast is a number of iteration in the global optimization at the last algorithm iteration. Default is 1000.
tol_global is a tolerance value for the global optimization. Default is 1e-12.

Value

A matrix n x K with cluster centroid vectors in columns.

Details

COMPR is an iterative greedy method, which alternates between expanding the cluster centroid set \(C\) with a new element \(c_i\), whose sketch is the most correlated to the residue and the global minimization with respect to cluster centroids \(c_1, \dots, c_K\) and their weights \(w_1, \dots, w_K\). It clusters the data collection into K groups by minimizing the difference between the compressed data version (data sketch) and a linear combination of cluster centroid sketches, i.e. \(\|Sk(Data) - \sum_{i=1}^K w_i \cdot Sk(c_i)\|\).

Examples

Run this code

# NOT RUN {
X = matrix(rnorm(1e5), ncol=1000, nrow = 100)
lb = apply(X, 1, min)
ub = apply(X, 1, max)
X_FBM = bigstatsr::FBM(init = X, ncol=1000, nrow = 100)
out = GenerateFrequencies(Data = X_FBM, m = 20, N0 = ncol(X_FBM))
SK = Sketch(Data = X_FBM, W = out$W)
C  <- COMPR(Data = X_FBM, K = 2, Frequencies = out$W, lower_b = lb, upper_b = ub, SK_Data = SK)
# }

Run the code above in your browser using DataLab