rainette: Corpus clustering based on the Reinert method - Simple clustering

Description

Corpus clustering based on the Reinert method - Simple clustering

Usage

rainette(
  dtm,
  k = 10,
  min_uc_size = 10,
  min_split_members = 5,
  cc_test = 0.3,
  tsj = 3,
  min_members
)

Arguments

dtm

quanteda dfm object of documents to cluster, usually the result of split_segments()

maximum number of clusters to compute

min_uc_size

minimum number of forms by document

min_split_members

don't try to split groups with fewer members

cc_test

contingency coefficient value for feature selection

tsj

minimum frequency value for feature selection

min_members

deprecated, use min_split_members instead

Value

The result is a list of both class hclust and rainette. Besides the elements of an hclust object, two more results are available :

uce_groups give the group of each document for each k
group give the group of each document for the maximum value of k available

Details

See the references for original articles on the method. Computations and results may differ quite a bit, see the package vignettes for more details.

The dtm object is automatically converted to boolean.

References

Reinert M, Une m<U+00E9>thode de classification descendante hi<U+00E9>rarchique : application <U+00E0> l'analyse lexicale par contexte, Cahiers de l'analyse des donn<U+00E9>es, Volume 8, Num<U+00E9>ro 2, 1983. http://www.numdam.org/item/?id=CAD_1983__8_2_187_0
Reinert M., Alceste une m<U+00E9>thodologie d'analyse des donn<U+00E9>es textuelles et une application: Aurelia De Gerard De Nerval, Bulletin de M<U+00E9>thodologie Sociologique, Volume 26, Num<U+00E9>ro 1, 1990. 10.1177/075910639002600103

Examples

Run this code

# NOT RUN {
require(quanteda)
corpus <- data_corpus_inaugural
corpus <- head(corpus, n = 10)
corpus <- split_segments(corpus)
dtm <- dfm(corpus, remove = stopwords("en"), tolower = TRUE, remove_punct = TRUE)
dtm <- dfm_wordstem(dtm, language = "english")
dtm <- dfm_trim(dtm, min_termfreq = 3)
res <- rainette(dtm, k = 3)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab