rainette v0.1.2

0

Monthly downloads

0th

Percentile

The Reinert Method for Textual Data Clustering

An R implementation of the Reinert text clustering method. For more details about the algorithm see the included vignettes or Reinert (1990) <doi:10.1177/075910639002600103>.

Readme

Rainette

CRAN status CRAN Downloads Lifecycle: maturing R build status

The package website can be found at : https://juba.github.io/rainette/.

Rainette is an R package which implements a variant of the Reinert textual clustering method. This method is available in other software such as Iramuteq (free software) or Alceste (commercial, closed source).

Features

  • Simple or double clustering algorithms
  • Plot functions and shiny gadgets to visualise and explore clustering results
  • Utility functions to split a corpus into segments or import a corpus in Iramuteq format

Installation and usage

The package is installable from CRAN :

install_packages("rainette")

The development version is installable from Github :

remotes::install_github("juba/rainette")

Let's start with an example corpus provided by the excellent quanteda package :

library(quanteda)
data_corpus_inaugural

First, we'll use split_segments to split each text in the corpus into segments of about 40 words (punctuation is taken into account) :

corpus <- split_segments(data_corpus_inaugural, segment_size = 40)

Next, we'll compute a document-term matrix and apply some treatments with quanteda functions :

dtm <- dfm(corpus, remove = stopwords("en"), tolower = TRUE, remove_punct = TRUE)
dtm <- dfm_wordstem(dtm, language = "english")
dtm <- dfm_trim(dtm, min_termfreq = 3)

We can then apply a simple clustering on this dtm with the rainette function. We specify the number of clusters (k), the minimum size for a cluster to be splitted at next step (min_split_members) and the minimum number of forms in each segment (min_uc_size) :

res <- rainette(dtm, k = 6, min_uc_size = 15, min_split_members = 20)

We can use the rainette_explor shiny interface to visualise and explore the different clusterings at each k :

rainette_explor(res, dtm)

We can then use the generated R code to reproduce the displayed clustering visualisation plot :

rainette_plot(res, dtm, k = 5, type = "bar", n_terms = 20, free_scales = FALSE,
    measure = "chi2", show_negative = "TRUE", text_size = 10)

Or cut the tree at chosen k and add a group membership variable to our corpus metadata :

docvars(corpus)$group <- cutree_rainette(res, k = 5)

In addition to this, you can also perform a double clustering, ie two simple clusterings produced with different min_uc_size which are then "crossed" to generate more solid clusters. To do this, use rainette2 either on two rainette results :

res1 <- rainette(dtm, k = 10, min_uc_size = 10, min_split_members = 10)
res2 <- rainette(dtm, k = 10, min_uc_size = 15, min_split_members = 10)
res <- rainette2(res1, res2, max_k = 10, min_members = 20)

Or directly on a dtm with uc_size1 and uc_size2 arguments :

rainette2(dtm, max_k = 10, uc_size1 = 10, uc_size2 = 15, min_members = 20)

You can then use rainette2_explor, rainette2_plot and cutree_rainette2 to explore and visualise the results.

Tell me more

Three vignettes are available, an introduction in english :

And an introduction and an algorithm description, in french :

Credits

This classification method has been created by Max Reinert, and is described in several articles. Here are two references :

  • Reinert M, Une méthode de classification descendante hiérarchique : application à l'analyse lexicale par contexte, Cahiers de l'analyse des données, Volume 8, Numéro 2, 1983. http://www.numdam.org/item/?id=CAD_1983__8_2_187_0
  • Reinert M., Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval, Bulletin de Méthodologie Sociologique, Volume 26, Numéro 1, 1990. https://doi.org/10.1177/075910639002600103

Thanks to Pierre Ratineau, the author of Iramuteq, for providing it as free software and open source. Even if the R code has been almost entirely rewritten, it has been a precious resource to understand the algorithms.

Many thanks to Sébastien Rochette for the creation of the hex logo.

Many thanks to Florian Privé for his work on rewriting and optimizing Rcpp code.

Functions in rainette

Name Description
cutree_rainette Cut a rainette result tree into groups of documents
order_docs return documents indices ordered by CA first axis coordinates
rainette2_complete_groups Complete groups membership with knn classification
import_corpus_iramuteq Import a corpus in Iramuteq format
cutree_rainette2 Cut a rainette2 result object into groups of documents
cluster_tab Split a dtm into two clusters with reinert algorithm
split_segments Split a character string or corpus into segments
rainette Corpus clustering based on the Reinert method - Simple clustering
rainette2_plot Generate a clustering description plot from a rainette2 result
rainette2_explor Shiny gadget for rainette2 clustering exploration
rainette_explor Shiny gadget for rainette clustering exploration
switch_docs Switch documents between two groups to maximize chi-square value
rainette_stats Generate cluster keyness statistics from a rainette result
select_features Remove features from dtm of each group base don cc_test and tsj
rainette_plot Generate a clustering description plot from a rainette result
compute_uc Merges uces into uc according to minimum uc size
rainette2 Corpus clustering based on the Reinert method - Double clustering
cutree Cut a tree into groups
No Results!

Vignettes of rainette

Name
algorithmes.Rmd
introduction_en.Rmd
introduction_usage.Rmd
rainette2_explor.png
rainette2_explor_en.png
rainette_explor_en.png
rainette_explor_en_cloud.png
rainette_explor_pc.png
rainette_explor_pc_cloud.png
No Results!

Last month downloads

Details

Type Package
Date 2021-01-19
License GPL (>= 3)
VignetteBuilder knitr
URL https://juba.github.io/rainette/
BugReports https://github.com/juba/rainette/issues
Encoding UTF-8
RoxygenNote 7.1.1
LinkingTo Rcpp
NeedsCompilation yes
Packaged 2021-01-20 10:39:15 UTC; julien
Repository CRAN
Date/Publication 2021-01-20 12:30:02 UTC

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/rainette)](http://www.rdocumentation.org/packages/rainette)