An implementation of the complete pipeline of the CHICKN algorithm.
CHICKN_W1(
Data,
K = 2,
k_total,
K_W1 = NULL,
kernel_type = "Gaussian",
distance_type = "W1",
Freq = NULL,
ncores = 2,
max_neighbors = 32,
nblocks = 64,
N0 = 10000,
max_Nsize = 32,
DoPreimage = FALSE,
DIR_output = tempfile(),
DIR_tmp = tempfile(),
BIG = FALSE,
verbose = FALSE,
...
)
A Filebacked Big Matrix n x N.
Number of cluster at each call of clustering method. Default is 2.
An upper bound of the total number of clusters.
A Filebacked Big Matrix. Nystrom kernel matrix \(s \times N\),
where N is the number of signals in the training collection and s is the Nystrom sample size.
By default is NULL and it is generated using Nystrom_kernel
function.
Kernel function type c('Gaussian', 'Laplacian').
Distance function type. The available types are Wasserstein-1 ('W1') and Euclidean ('Euclide'). The default value is 'W1'.
A frequency matrix m x n with frequency vectors in rows.
If NULL, the frequency vectors are generated by GenerateFrequencies
function.
Number of cores. Default is 2.
Number of neighbors used to estimate the kernel parameter gamma
. Default is 32.
Number of blocks, on which the regression is performed. Default is 32.
Number of data vectors used for the variance estimation in EstimSigma
.
Number of neighbors used to compute consensus chromatograms.
logical that controls whether to compute the consensus chromatograms. Default is TRUE.
A directory to save the results.
A directory for temporal files.
logical parameter that controls whether the resulting consensus chromatograms are stored as a Filebacked Big Matrix ('Centroid_preimage.bk'). Default is FALSE.
logical that indicates whether dysplay the processing steps.
Additional arguments passed on to COMPR
.
A list with the following attributes:
gamma
is the estimated kernel parameter.
CompressedData
is the Nystrom kernel matrix.
sigma
is the estimated variance.
Frequency
is the frequency matrix m x n.
Clusters
is the cluster assignment.
CHICKN_W1
compresses the data by computing a Nystrom kernel approximation and
applying the sketching operator from DBLP:journals/corr/KerivenBGP16chickn.
See Nystrom_kernel
and Sketch
functions.
Then clusters are recovered by operating on the compressed data version.
It can use the kernel function based on the
Wasserstein-1 or the Euclidean distances. It generates in DIR_output
directory the following files:
'Cluster_assign_out.bk' is a Filebacked Big Matrix N x maxLevel
+1, which stores the cluster assignment at each hierarchical level.
'Centroids_out.bk' is a Filebacked Big Matrix with the resulting cluster centroids in columns.
Permiakova O, Guibert R, Kraut A, Fortin T, Hesse AM, Burger T (2020) "CHICKN: Extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis." BMC Bioinformatics (under revision).
Nystrom_kernel
, GenerateFrequencies
,
hcc_parallel
, Preimage
, bigstatsr
# NOT RUN {
data("UPS2")
N = ncol(UPS2)
n= nrow(UPS2)
X_FBM = bigstatsr::FBM(init = UPS2, ncol=N, nrow = n)$save()
output <- CHICKN_W1(Data = X_FBM, K = 2, k_total =8, max_neighbors = 10, ncores = 2,
N0 = N, DoPreimage = FALSE)
# }
Run the code above in your browser using DataLab