LexHCca: Hierarchical Clustering of Documents on Textual Correspondence Analysis Coordinates (LexHCca)

Description

Agglomerative hierarchical clustering on a corpus of documents.

Usage

LexHCca(object, nb.clust=0, consol=TRUE, iter.max=10, min=3, max=NULL, 
 order=TRUE, nb.par=5, edit.par=FALSE, graph=TRUE, proba=0.05,...)

Arguments

object

object of LexCA class

nb.clust

number of clusters (see details). If 0, the tree is cut at the level the user clicks on. If -1, the tree is automatically cut at the suggested level. If a (positive) integer, the tree is cut with nb.clust clusters (by default 0)

consol

if TRUE, k-means consolidation is performed (by default TRUE)

iter.max

maximum number of iterations in the consolidation step (by default 10)

min

minimum number of clusters (by default 3)

max

maximum number of clusters (by default NULL and then max is computed as the minimum between 10 and the number of documents divided by 2)

order

if TRUE, the clusters are numbered depending on the coordinate of their centroid on the first axis (by default TRUE)

nb.par

number of edited paragons (para) and specific documents (dist) (by default 5)

edit.par

if TRUE, the literal text of the parangon and specific documents are listed in the results (by default FALSE)

graph

if TRUE, graphs are displayed (by default TRUE)

proba

threshold on the p-value used in selecting words, documents, axes and contextual variables when describing the clusters (by default 0.05)

...

other arguments from other methods

Value

Returns a list including:

data.clust

the original active lexical table used in LexCA plus a new column called clust containing the partition

desc.wordvar

description of the clusters by their characteristic words and, if contextual variables were considered in LexCA, description of the partition/clusters by these variables

desc.axes

description of the clusters by the characteristic axes

call

list of internal objects. call$t giving the results for the hierarchical tree; See the first reference for more details

desc.doc

labels of the paragon (para) and specific documents (dist) of each cluster

clust.count

count of documents belonging to each cluster

clust.content

list of the document labels according to the cluster they belong to

docspara

if edit.par=TRUE, description of the clusters by the literal text of the nb.par "para" documents

docsdist

if edit.par=TRUE, description of the clusters by the literal text of the nb.par "dist" documents

Returns the hierarchical tree with a barplot of the successive inertia gains, the CA map of the documents enriched by the tree (3D), the CA map with the document labels colored according to their cluster (2D).

Details

LexHCca starts from the documents coordinates on textual correspondence analysis axes. Euclidean metric and Ward method are used.

The number of clusters is determined either a priori or from the hierarchical tree structure. If nb.clust=0, a level for cutting the tree is automatically suggested. This is computed in the following way, reading the tree downward. At a given step, the tree could be cut into Q clusters (Q varying between min and max). The between-inertia gain when passing from Q-1 to Q clusters and the between-inertia gain when passing from Q to Q+1 clusters are computed. The suggested level corresponds to the maximum value of the ratio between the former and the latter of these inertia-gains. Note that the between-inertia gain when passing from Q to Q+1 clusters is equal to the value of the Ward criterion when passing from Q+1 to Q clusters when building the tree bottom up. In this latter case, a level where to cut the tree is suggested. The user can choose to cut the tree at this level or at another one.

The results include a thorough description of the clusters, taking into account contextual variables. Graphs are provided.

References

Husson F., Le S., Pages J. (2011). Exploratory Multivariate Analysis by Example Using R. Chapman & Hall/CRC.

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.).

Examples

Run this code

data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), Fmin=10, Dmin=10, stop.word.tm=TRUE,
        context.quali=c("Gender","Age_Group","Education"), context.quanti=c("Age"))
res.LexCA<-LexCA(res.TD, graph=FALSE, ncp=8)
res.hcca<-LexHCca(res.LexCA, graph=TRUE, nb.clust=5, order=TRUE)

Run the code above in your browser using DataLab