segmentClusters: Run the `segmenTier` algorithm.

Description

segmenTier's main wrapper interface, calculates segments from a clustering sequence. This will run the segmentation algorithm once for the indicated parameters. The function segmentCluster.batch allows for multiple runs over different parameters or input-clusterings.

Usage

segmentClusters(seq, k = 1, csim, E = 1, S = "ccor", M = 175,
  Mn = 20, a = -2, nui = 1, nextmax = TRUE, multi = "max",
  multib = "max", rm.nui = TRUE, save.matrix = FALSE, verb = 1)

Arguments

seq

Either an integer vector of cluster labels, or a structure of class 'clustering' as returned by clusterTimeseries. The only strict requirement for the first option is that nuisance clusters (which will be treated specially during the dynamic programming routine) have to be '0' (zero).

if argument seq is of class 'clustering' the kth clustering will be used; defaults to 1

csim

The cluster-cluster or position-cluster similarity matrix for scoring functions "ccor" and "icor" (option S), respectively. If seq is of class 'clustering' csim is optional and will override the similarity matrices in seq. If argument seq is a simple vector of cluster labels and the scoring function is "icor" or "ccor", an appropriate matrix csim MUST be provided. Finally, for scoring function "ccls" the argument csim will be ignored and the matrix is instead automatically constructed from argument a, and using argument nui for the nuisance cluster.

exponent to scale similarity matrices

the scoring function to be used: "ccor", "icor" or "ccls"

segment length penalty. Note, that this is not a strict cut-off but defined as a penalty that must be "overcome" by good score.

segment length penalty for nuisance cluster. Mn<M will allow shorter distances between "real" segments; only used in scoring functions "ccor" and "icor"

a cluster "dissimilarity" only used for pure cluster-based scoring w/o cluster similarity measures in scoring function "ccls".

nui

the similarity score to be used for nuisance clusters in the cluster similarity matrices

nextmax

go backwards while score is increasing before opening a new segment, default is TRUE

multi

handling of multiple k with max. score in forward phase, either "min" (default) or "max"

multib

handling of multiple k with max. score in back-trace phase, either "min" (default), "max" or "skip"

rm.nui

remove nuisance cluster segments from final results

save.matrix

store the total score matrix S(i,c) and the backtracing matrix K(i,c); useful in testing stage or for debugging or illustration of the algorithm;

verb

level of verbosity, 0: no output, 1: progress messages

Value

Returns a list (class "segments") containing the main result (list item "segments"), and additional information (see `Details'). A plot method exists that allows to plot clusters aligned to time-series and segmentation plots.

Details

This is the main R wrapper function for the `segmenTier' segmentation algorithm. It takes an ordered sequence of cluster labels and returns segments of consistent clusterings, where cluster-cluster or cluster-position similarities are maximal. Its main input (argument seq) is either a "clustering" object returned by clusterTimeseries (scenario I), or an integer vector of cluster labels (scenario II) or. The function then runs the dynamic programming algorithm (calculateScore) for a selected scoring function and an according cluster similarity matrix, followed by the back-tracing step (backtrace) to find segment borders.

The main result, list item "segments" of the returned object, is a 3-column matrix, where column 1 is the cluster assignment and columns 2 and 3 are start and end indices of the segments. For the batch function segmentCluster.batch, the "segments" item is a data.frame contain additional information, see ?segmentCluster.batch.

As shown in the publication, the parameters M, E and nui have the strongest impact on resulting segment borders. Other parameters can be fine-tuned but had little impact on our test data set.

In the default and tested scenario I, when the input is an object of class "clustering" produced by clusterTimeseries, the cluster-cluster and cluster-position similarity matrices are already provided by this object.

In the second scenario II for custom use, argument seq can be a simple clustering vector, where a nuisance cluster must be indicated by cluster label "0" (zero). The cluster-cluster or cluster-position similarities MUST be provided (argument csim) for scoring functions "ccor" and "icor", respectively. For the simplest scoring function "ccls", a uniform cluster similarity matrix is constructed from arguments a and nui, with cluster self-similarities of 1, "dissimilarities" between different clusters using argument a<0, and nuisance cluster self-similarity of -a.

The function returns a list (class "segments") comprising of the main result (list item "segments"), and "warnings" from the dynamic programming and backtracing phases, the used similarity matrix csim, extended by the nuisance cluster; and optionally (see option save.matrix) the scoring vectors S1(i,c), the total score matrix S(i,c) and the backtracing matrix K(i,c) for analysis of algorithm performance for novel data sets. Additional convenience data is reported, such as cluster colors and sortings if argument seq was of class 'clustering'. These allow for convenient inspection of all data processing steps with the plot methods. A plot method exists that allows to plot segments aligned to "timeseries" and "clustering" plots.

References

Machne, Murray & Stadler (2017) <doi:10.1038/s41598-017-12401-8>

Examples

Run this code

# NOT RUN {
# load example data, an RNA-seq time-series data from a short genomic region
# of budding yeast
data(primseg436)

# 1) Fourier-transform time series:
## NOTE: reducing official example data set to stay within 
## CRAN example timing restrictions with segmentation below
tset <- processTimeseries(ts=tsd[2500:6500,], na2zero=TRUE, use.fft=TRUE,
                          dft.range=1:7, dc.trafo="ash", use.snr=TRUE)

# 2) cluster time-series into K=12 clusters:
cset <- clusterTimeseries(tset, K=12)

# 3) ... segment it; this takes a few seconds:
segments <- segmentClusters(seq=cset, M=100, E=2, nui=3, S="icor")

# 4) inspect results:
print(segments)
plotSegmentation(tset, cset, segments, cex=.5, lwd=3)

# 5) and get segment border table for further processing:
sgtable <- segments$segments

# }

Run the code above in your browser using DataLab