CHICKN_W1: Chromatogram Hierarchical Compressive K-means with Nystrom approximation

Description

An implementation of the complete pipeline of the CHICKN algorithm.

Usage

CHICKN_W1(
  Data,
  K = 2,
  k_total,
  K_W1 = NULL,
  kernel_type = "Gaussian",
  distance_type = "W1",
  Freq = NULL,
  ncores = 2,
  max_neighbors = 32,
  nblocks = 64,
  N0 = 10000,
  max_Nsize = 32,
  DoPreimage = FALSE,
  DIR_output = tempfile(),
  DIR_tmp = tempfile(),
  BIG = FALSE,
  verbose = FALSE,
  ...
)

Arguments

Data

A Filebacked Big Matrix n x N.

Number of cluster at each call of clustering method. Default is 2.

k_total

An upper bound of the total number of clusters.

K_W1

A Filebacked Big Matrix. Nystrom kernel matrix \(s \times N\), where N is the number of signals in the training collection and s is the Nystrom sample size. By default is NULL and it is generated using Nystrom_kernel function.

kernel_type

Kernel function type c('Gaussian', 'Laplacian').

distance_type

Distance function type. The available types are Wasserstein-1 ('W1') and Euclidean ('Euclide'). The default value is 'W1'.

Freq

A frequency matrix m x n with frequency vectors in rows. If NULL, the frequency vectors are generated by GenerateFrequencies function.

ncores

Number of cores. Default is 2.

max_neighbors

Number of neighbors used to estimate the kernel parameter gamma. Default is 32.

nblocks

Number of blocks, on which the regression is performed. Default is 32.

Number of data vectors used for the variance estimation in EstimSigma.

max_Nsize

Number of neighbors used to compute consensus chromatograms.

DoPreimage

logical that controls whether to compute the consensus chromatograms. Default is TRUE.

DIR_output

A directory to save the results.

DIR_tmp

A directory for temporal files.

BIG

logical parameter that controls whether the resulting consensus chromatograms are stored as a Filebacked Big Matrix ('Centroid_preimage.bk'). Default is FALSE.

verbose

logical that indicates whether dysplay the processing steps.

...

Additional arguments passed on to COMPR.

Value

A list with the following attributes:

gamma is the estimated kernel parameter.
CompressedData is the Nystrom kernel matrix.
sigma is the estimated variance.
Frequency is the frequency matrix m x n.
Clusters is the cluster assignment.

Details

CHICKN_W1 compresses the data by computing a Nystrom kernel approximation and applying the sketching operator from DBLP:journals/corr/KerivenBGP16chickn. See Nystrom_kernel and Sketch functions. Then clusters are recovered by operating on the compressed data version. It can use the kernel function based on the Wasserstein-1 or the Euclidean distances. It generates in DIR_output directory the following files:

'Cluster_assign_out.bk' is a Filebacked Big Matrix N x maxLevel+1, which stores the cluster assignment at each hierarchical level.
'Centroids_out.bk' is a Filebacked Big Matrix with the resulting cluster centroids in columns.

References

Permiakova O, Guibert R, Kraut A, Fortin T, Hesse AM, Burger T (2020) "CHICKN: Extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis." BMC Bioinformatics (under revision).

Examples

Run this code

# NOT RUN {
data("UPS2")
N = ncol(UPS2)
n= nrow(UPS2)
X_FBM = bigstatsr::FBM(init = UPS2, ncol=N, nrow = n)$save()
output  <- CHICKN_W1(Data = X_FBM, K = 2, k_total =8, max_neighbors = 10, ncores = 2, 
                     N0 = N, DoPreimage = FALSE)
# }

Run the code above in your browser using DataLab