Chronogically constrained agglomerative hierarchical clustering on a corpus of documents.
LexCHCca (object, nb.clust=0, min=3, max=NULL, nb.par=5, graph=TRUE, proba=0.05)
object of LexCA class
number of clusters (see details). If 0, the tree is cut at the level the user clicks on. If -1, the tree is automatically cut at the suggested level. If a (positive) integer, the tree is cut with nb.clust clusters (by default 0)
minimum number of clusters (by default 3)
maximum number of clusters (by default NULL and then max is computed as the minimum between 10 and the number of documents divided by 2)
number of edited paragons (para) and specific documents labels (dist) (by default 5)
if TRUE, graphs are displayed (by default TRUE)
threshold on the p-value used in selecting the characteristic words of the clusters and in selecting the axes when describing the clusters by the axes (by default 0.05)
Returns a list including:
the original active lexical table with a supplementary column called clust containing the partition
description of the clusters by their characteristic words
description of the clusters by the characteristic axes
list or parameters and internal objects
labels of the paragon (para) and specific documents (dist) of each cluster
list with the succession of nodes that are found when reading the tree downward
LexCHCca starts from the documents coordinates on textual correspondence analysis axes. The hierarchical tree is built taking into account that only chronological contiguous nodes can be grouped. The documents have to be ranked in the lexical table in the chronological order. Euclidean metric and complete linkage method are used.
The number of clusters is determined either a priori or from the constrained hierarchical tree structure. If nb.clust=0, a level for cutting the tree is automatically suggested. This is computed in the following way, reading the tree downward. At a given step, the tree could be cut into Q clusters (Q varying between min and max). The distance between the two nodes that are no longer grouped together using complete linkage method when passing from Q-1 to Q clusters and the distance between the two nodes that are no longer grouped together when passing from Q to Q+1 are computed. The suggested level corresponds to the maximum value of the ratio between the former and the latter of these values. These distances correspond to the criterion value when building the tree bottom up. The user can choose to cut the tree at this level or at another one.
The results include a thorough description of the clusters. Graphs are provided.
The tree is plotted jointly with a barchart of the successive values of the aggregation criterion.
B<U+00E9>cue-Bertaut, M., Kostov, B., Morin, A., & Naro, G. (2014). Rhetorical Strategy in Forensic Speeches: Multidimensional Statistics-Based Methodology. Journal of Classification,31, 85-106. 10.1007/s00357-014-9148-9.
Lebart L. (1978). Programme d'agr<U+00E9>gation avec contraintes. Les Cahiers de l'Analyse des Donn<U+00E9>es, 3, pp. 275--288.
Legendre, P. & Legendre, L. (1998), Numerical Ecology (2nd ed.), Amsterdam: Elsevier Science.
Murtagh F. (1985). Multidimensional Clustering Algorithms. Vienna: Physica-Verlag, COMPSTAT Lectures.
# NOT RUN {
data(open.question)
res.TD<-TextData(open.question,var.text=c(9,10), var.agg="Age_Group", Fmin=10, Dmin=10,
stop.word.tm=TRUE)
res.LexCA<-LexCA(res.TD, graph=FALSE)
res.ccah<-LexCHCca(res.LexCA, nb.clust=4, min=3)
# }
Run the code above in your browser using DataLab