Learn R Programming

Rainette

Rainette is an R package which implements a variant of the Reinert textual clustering method. This method is available in other softwares such as Iramuteq (free software) or Alceste (commercial, closed source).

Features

  • Simple and double clustering algorithms
  • Plot functions and shiny interfaces to visualise and explore clustering results
  • Utility functions to split a corpus into segments or import a corpus in Iramuteq format

Installation

The package is installable from CRAN.

install_packages("rainette")

The development version is installable from R-universe.

install.packages("rainette", repos = "https://juba.r-universe.dev")

Usage

Let's start with an example corpus provided by the excellent quanteda package.

library(quanteda)
data_corpus_inaugural

First, we'll use split_segments() to split each document into segments of about 40 words (punctuation is taken into account).

corpus <- split_segments(data_corpus_inaugural, segment_size = 40)

Next, we'll apply some preprocessing and compute a document-term matrix with quanteda functions.

tok <- tokens(corpus, remove_punct = TRUE)
tok <- tokens_remove(tok, stopwords("en"))
dtm <- dfm(tok, tolower = TRUE)
dtm <- dfm_trim(dtm, min_docfreq = 10)

We can then apply a simple clustering on this matrix with the rainette() function. We specify the number of clusters (k), and the minimum number of forms in each segment (min_segment_size). Segments which do not include enough forms will be merged with the following or previous one when possible.

res <- rainette(dtm, k = 6, min_segment_size = 15)

We can use the rainette_explor() shiny interface to visualise and explore the different clusterings at each k.

rainette_explor(res, dtm, corpus)

The Cluster documents tab allows to browse and filter the documents in each cluster.

We can also directly generate the clusters description plot for a given k with rainette_plot().

rainette_plot(res, dtm, k = 5)

Or cut the tree at chosen k and add a group membership variable to our corpus metadata.

docvars(corpus)$cluster <- cutree(res, k = 5)

In addition to this, we can also perform a double clustering, ie two simple clusterings produced with different min_segment_size which are then "crossed" to generate more robust clusters. To do this, we use rainette2() on two rainette() results :

res1 <- rainette(dtm, k = 5, min_segment_size = 10)
res2 <- rainette(dtm, k = 5, min_segment_size = 15)
res <- rainette2(res1, res2, max_k = 5)

We can then use rainette2_explor() to explore and visualise the results.

rainette2_explor(res, dtm, corpus)

Tell me more

Two vignettes are available :

Credits

This clustering method has been created by Max Reinert, and is described in several articles, notably :

Thanks to Pierre Ratineau, the author of Iramuteq, for providing it as free software and open source. Even if the R code has been almost entirely rewritten, it has been a precious resource to understand the algorithms.

Many thanks to Sébastien Rochette for the creation of the hex logo.

Many thanks to Florian Privé for his work on rewriting and optimizing the Rcpp code.

Copy Link

Version

Install

install.packages('rainette')

Monthly Downloads

552

Version

0.3.1.1

License

GPL (>= 3)

Maintainer

Julien Barnier

Last Published

March 28th, 2023

Functions in rainette (0.3.1.1)

rainette_explor

Shiny gadget for rainette clustering exploration
rainette2

Corpus clustering based on the Reinert method - Double clustering
rainette2_explor

Shiny gadget for rainette2 clustering exploration
rainette2_plot

Generate a clustering description plot from a rainette2 result
switch_docs

Switch documents between two groups to maximize chi-square value
rainette_plot

Generate a clustering description plot from a rainette result
rainette2_complete_groups

Complete groups membership with knn classification
select_features

Remove features from dtm of each group base don cc_test and tsj
rainette_stats

Generate cluster keyness statistics from a rainette result
clusters_by_doc_table

Returns the number of segment of each cluster for each source document
merge_segments

Merges segments according to minimum segment size
cutree_rainette

Cut a rainette result tree into groups of documents
import_corpus_iramuteq

Import a corpus in Iramuteq format
cluster_tab

Split a dtm into two clusters with reinert algorithm
cutree

Cut a tree into groups
cutree_rainette2

Cut a rainette2 result object into groups of documents
order_docs

return documents indices ordered by CA first axis coordinates
docs_by_cluster_table

Returns, for each cluster, the number of source documents with at least n segments of this cluster
rainette

Corpus clustering based on the Reinert method - Simple clustering
split_segments

Split a character string or corpus into segments