LexHCca: Hierarchical Clustering of Documents on Textual Correspondence Analysis Coordinates (LexHCca)

Description

Agglomerative hierarchical clustering on a corpus of documents.

Usage

LexHCca(object, nb.clust=0, consol=FALSE, iter.max=10, min=3, max=NULL, 
   kk=Inf, order=TRUE, graph=TRUE, proba=0.05,cluster.CA="rows",description=TRUE,
   nb.par=0,size.par=80,marg.doc=FALSE,seed=12345,...)

Arguments

object

object of LexCA class

nb.clust

number of clusters (see details). If 0, the tree is cut at the level the user clicks on. If -1, the tree is automatically cut at the suggested level. If a (positive) integer, the tree is cut with nb.clust clusters (by default 0)

consol

if TRUE, consolidation is performed after hierachical clustering (by default FALSE)

iter.max

maximum number of iterations in the consolidation step (by default 10)

min

minimum number of clusters (by default 3)

max

maximum number of clusters (by default NULL and then max is computed as the minimum between 10 and the number of documents divided by 2)

An integer corresponding to the number of clusters used in a Kmeans preprocessing before the hierarchical clustering; the top of the hierarchical tree is then constructed from this partition. This is very useful if the number of individuals is high. Note that consolidation cannot be performed if kk is different from Inf and some graphics are not drawn. Inf is used by default and no preprocessing is done, all the graphical outputs are then given.

order

if TRUE, the clusters are numbered depending on the coordinate of their centroid on the first axis (by default TRUE)

graph

if TRUE, graphs are displayed (by default TRUE)

proba

threshold on the p-value used in selecting words, documents, axes and contextual variables when describing the clusters (by default 0.05)

cluster.CA

if 'rows' or 'docs' cluster is performed with documents; 'columns' or 'words' with words (by default 'rows')

description

if TRUE, description of the clusters by their characteristic words/documents, by the characteristic axes and by contextual variables if considered in LexCA (by default TRUE)

nb.par

number of edited paragons (para) and specific documents (dist) (by default 0)

size.par

text size of edited paragons (para) and specific documents (dist) (by default 80)

marg.doc

if FALSE, frequencies before TextData selection are the marginal frequencies for documents in description analysis, TRUE if frequencies after TextData selection (by default FALSE)

seed

Seed to obtain the same results using k-means (by default 12345)

...

other arguments from other methods

Value

Returns a list including:

data.clust

the original active lexical table used in LexCA plus a new column called clust containing the partition

centers

coordinates of centers from LexCA results for each cluster

clust.count

count of documents/words belonging to each cluster and some statistics

clust.content

list of the document/word labels according to the cluster they belong to

total sum of squares

call

list of internal objects. call$t giving the results for the hierarchical tree; See the first reference for more details

desc.axes

description of the clusters by the characteristic axes

desc.wordvar

if description=TRUE, description of the clusters by their characteristic words, supplementary words and, if contextual variables were considered in LexCA, description of the partition/clusters by these variables

desc.doc

if description=TRUE, description of the clusters by their characteristic documents

wordslabels

labels of the paragon (para) and specific words (dist) of each cluster

docslabels

labels of the paragon (para) and specific documents (dist) of each cluster

docspara

if nb.par>0, description of the clusters by the nb.par "para" documents writing the first size.par characters of the literal text

docsdist

if nb.par>0, description of the clusters by the nb.par "dist" documents writing the first size.par characters of the literal text

Returns the hierarchical tree with a barplot of the successive inertia gains, the CA map of the documents enriched by the tree (3D), the CA map with the document labels colored according to their cluster (2D).

Details

LexHCca starts from the documents coordinates on textual correspondence analysis axes. Euclidean metric and Ward method are used.

The number of clusters is determined either a priori or from the hierarchical tree structure. If nb.clust=0, a level for cutting the tree is automatically suggested. This is computed in the following way, reading the tree downward. At a given step, the tree could be cut into Q clusters (Q varying between min and max). The between-inertia gain when passing from Q-1 to Q clusters and the between-inertia gain when passing from Q to Q+1 clusters are computed. The suggested level corresponds to the maximum value of the ratio between the former and the latter of these inertia-gains. Note that the between-inertia gain when passing from Q to Q+1 clusters is equal to the value of the Ward criterion when passing from Q+1 to Q clusters when building the tree bottom up. In this latter case, a level where to cut the tree is suggested. The user can choose to cut the tree at this level or at another one.

The results include a thorough description of the clusters, taking into account contextual variables. Graphs are provided.

References

Husson F., L<U+00EA> S., Pag<U+00E8>s J. (2011). Exploratory Multivariate Analysis by Example Using R. Chapman & Hall/CRC. 10.1201/b10345.

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.). 10.1007/978-94-017-1525-6.

Examples

Run this code

# NOT RUN {
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), Fmin=10, Dmin=10, stop.word.tm=TRUE,
        context.quali=c("Gender","Age_Group","Education"), context.quanti=c("Age"))
res.LexCA<-LexCA(res.TD, graph=FALSE, ncp=8)
res.hcca<-LexHCca(res.LexCA, graph=TRUE, nb.clust=5, order=TRUE)
# }

Run the code above in your browser using DataLab