createClustersDlg: Cut hierarchical clustering tree into clusters

Description

Cut a hierarchical clustering tree into clusters of documents.

Arguments

Details

This dialog allows grouping the documents present in a tm corpus according to a previously computed hierarchical clustering tree (see corpusClustDlg). It adds a new meta-data variable to the corpus, each number corresponding to a cluster; this variable is also added to the corpusMetaData data set. If clusters were already created before, they are simply replaced.

Clusters will be created by starting from the top of the dendrogram, and going through the merge points with the highest position until the requested number of branches is reached.

A window opens to summarize created clusters, providing information about specific documents and terms for each cluster. Specific terms are those whose observed frequency in the document or level has the lowest probability under an hypergeometric distribution, based on their global frequencies in the corpus and on the number of occurrences of all terms in the considered cluster. All terms with a probability below the value chosen using the third slider are reported, ignoring terms with fewer occurrences in the whole corpus than the value of the fourth slider (these terms can often have a low probability but are too rare to be of interest). The last slider allows limiting the number of terms that will be shown for each cluster.

The positive or negative character of the association is visible from the sign of the t value, or by comparing the value of the % Term/Level column with that of the Global % column. The definition of columns is: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Specific documents are selected using a different criterion than terms: documents with the smaller Chi-squared distance to the average vocabulary of the cluster are shown. This is a euclidean distance, but weighted by the inverse of the prevalence of each term in the whole corpus, and controlling for the documents' different lengths.

This dialog can only be used after having created a tree, which is done via the Text Mining->Hierarchical clustering->Create dendrogram... dialog.

Description

Arguments

Details

See Also