LexCHCca: Chronogically Constrained Agglomerative Hierarchical Clustering on Correspondence Analysis Components (LexCHCca)

Description

Chronogically constrained agglomerative hierarchical clustering on a corpus of documents.

Usage

LexCHCca (object, nb.clust=0, min=3, max=NULL, nb.par=5, graph=TRUE, proba=0.05)

Arguments

object

object of LexCA class

nb.clust

number of clusters (see details). If 0, the tree is cut at the level the user clicks on. If -1, the tree is automatically cut at the suggested level. If a (positive) integer, the tree is cut with nb.clust clusters (by default 0)

min

minimum number of clusters (by default 3)

max

maximum number of clusters (by default NULL and then max is computed as the minimum between 10 and the number of documents divided by 2)

nb.par

number of edited paragons (para) and specific documents labels (dist) (by default 5)

graph

if TRUE, graphs are displayed (by default TRUE)

proba

threshold on the p-value used in selecting the characteristic words of the clusters and in selecting the axes when describing the clusters by the axes (by default 0.05)

Value

Returns a list including:

data.clust

the original active lexical table with a supplementary column called clust containing the partition

desc.word

description of the clusters by their characteristic words

desc.axes

description of the clusters by the characteristic axes

call

list or parameters and internal objects

desc.doc

labels of the paragon (para) and specific documents (dist) of each cluster

dendro

list with the succession of nodes that are found when reading the tree downward

Returns the graphs with the tree and the correspondence analysis map where the documents are colored according to the cluster they belong to (2D).

Details

LexCHCca starts from the documents coordinates on textual correspondence analysis axes. The hierarchical tree is built taking into account that only chronological contiguous nodes can be grouped. The documents have to be ranked in the lexical table in the chronological order. Euclidean metric and complete linkage method are used.

The number of clusters is determined either a priori or from the constrained hierarchical tree structure. If nb.clust=0, a level for cutting the tree is automatically suggested. This is computed in the following way, reading the tree downward. At a given step, the tree could be cut into Q clusters (Q varying between min and max). The distance between the two nodes that are no longer grouped together using complete linkage method when passing from Q-1 to Q clusters and the distance between the two nodes that are no longer grouped together when passing from Q to Q+1 are computed. The suggested level corresponds to the maximum value of the ratio between the former and the latter of these values. These distances correspond to the criterion value when building the tree bottom up. The user can choose to cut the tree at this level or at another one.

The results include a thorough description of the clusters. Graphs are provided.

The tree is plotted jointly with a barchart of the successive values of the aggregation criterion.

References

B<U+00E9>cue-Bertaut, M., Kostov, B., Morin, A., & Naro, G. (2014). Rhetorical Strategy in Forensic Speeches: Multidimensional Statistics-Based Methodology. Journal of Classification,31, 85-106. 10.1007/s00357-014-9148-9.

Lebart L. (1978). Programme d'agr<U+00E9>gation avec contraintes. Les Cahiers de l'Analyse des Donn<U+00E9>es, 3, pp. 275--288.

Legendre, P. & Legendre, L. (1998), Numerical Ecology (2nd ed.), Amsterdam: Elsevier Science.

Murtagh F. (1985). Multidimensional Clustering Algorithms. Vienna: Physica-Verlag, COMPSTAT Lectures.

Examples

Run this code

# NOT RUN {
data(open.question)
res.TD<-TextData(open.question,var.text=c(9,10), var.agg="Age_Group", Fmin=10, Dmin=10, 
        stop.word.tm=TRUE)
res.LexCA<-LexCA(res.TD, graph=FALSE)
res.ccah<-LexCHCca(res.LexCA, nb.clust=4, min=3)
# }

Run the code above in your browser using DataLab