textTopics: BERTopic

Description

textTopics() trains a BERTopic model (via the bertopic Python package) on a text variable in a tibble/data.frame. The function embeds documents, reduces dimensionality (UMAP), clusters documents (HDBSCAN), and extracts topic representations using c-TF-IDF with optional KeyBERT/MMR-based representation. (EXPERIMENTAL)

Usage

textTopics(
  data,
  variable_name,
  embedding_model = "distilroberta",
  representation_model = c("mmr", "keybert"),
  umap_n_neighbors = 15L,
  umap_n_components = 5L,
  umap_min_dist = 0,
  umap_metric = "cosine",
  hdbscan_min_cluster_size = 5L,
  hdbscan_min_samples = NULL,
  hdbscan_metric = "euclidean",
  hdbscan_cluster_selection_method = "eom",
  hdbscan_prediction_data = TRUE,
  num_top_words = 10L,
  n_gram_range = c(1L, 3L),
  stopwords = "english",
  min_df = 5L,
  bm25_weighting = FALSE,
  reduce_frequent_words = TRUE,
  set_seed = 8L,
  save_dir
)

Value

A named list containing:

train_data: The training data used to fit the model (or loaded from disk if available).
preds: A document-by-topic matrix of normalized topic mixtures (LDA-like). Rows typically sum to 1; rows of zeros can occur if no topic mass was assigned.
doc_info: Document-level outputs including hard topic labels (-1 indicates outliers).
topic_info: Topic-level outputs including topic sizes and top terms.
model: The fitted BERTopic model object (Python-backed).
model_type: Model identifier (currently "bert_topic").
seed: Random seed used.
save_dir: Directory where artifacts were saved.

Arguments

data

A tibble/data.frame containing a text variable to analyze and, optionally, additional numeric/categorical variables that can be used in later analyses (e.g., testing topic prevalence differences across groups).

variable_name

A character string giving the name of the text variable in data to perform topic modeling on.

embedding_model

A character string specifying which embedding model to use. Common options include "miniLM", "mpnet", "multi-mpnet", and "distilroberta". The choice affects topic quality, speed, and memory usage.

representation_model

A character string specifying the topic representation method. Must be one of "mmr" or "keybert".

"keybert" uses embedding similarity to select representative words/phrases.
"mmr" (Maximal Marginal Relevance) promotes diversity among selected terms.

umap_n_neighbors

Integer. Number of neighbors used by UMAP to balance local versus global structure. Smaller values emphasize local clusters; larger values emphasize global structure.

umap_n_components

Integer. Number of dimensions to reduce to with UMAP (the embedding space used for clustering).

umap_min_dist

Numeric. Minimum distance between embedded points in UMAP. Smaller values typically yield tighter clusters.

umap_metric

Character string specifying the distance metric used by UMAP, e.g. "cosine".

hdbscan_min_cluster_size

Integer. The minimum cluster size for HDBSCAN. Larger values yield fewer, broader topics; smaller values yield more, finer-grained topics.

hdbscan_min_samples

Integer or NULL. Controls how conservative clustering is. If NULL, HDBSCAN chooses a default.

hdbscan_metric

Character string specifying the metric used by HDBSCAN, typically "euclidean" when clustering in reduced UMAP space.

hdbscan_cluster_selection_method

Character string specifying cluster selection strategy. Either "eom" (excess of mass; often yields more stable clusters) or "leaf" (can yield more fine-grained clusters).

hdbscan_prediction_data

Logical. If TRUE, stores additional information enabling approximate topic prediction for new documents (when supported by the underlying pipeline).

num_top_words

Integer. Number of top terms to return per topic.

n_gram_range

Integer vector of length 2 giving the min and max n-gram length used by the vectorizer (e.g., c(1L, 3L)).

stopwords

Character string naming the stopword dictionary to use (e.g. "english").

min_df

Integer. Minimum document frequency for terms included in the vectorizer.

bm25_weighting

Logical. If TRUE, uses BM25 weighting in the class-based TF-IDF transformer (can improve term weighting in some corpora).

reduce_frequent_words

Logical. If TRUE, down-weights very frequent words using the class-based TF-IDF transformer.

set_seed

Integer. Random seed used to initialize UMAP (and other stochastic components) for reproducibility.

save_dir

Character string specifying the directory where outputs should be saved. A folder will be created (or reused) to store the fitted model and derived outputs.

Details

Typical tuning levers:

More topics / finer clusters: decrease hdbscan_min_cluster_size, decrease umap_n_neighbors, and/or increase umap_n_components.
Fewer topics / broader clusters: increase hdbscan_min_cluster_size and/or increase umap_n_neighbors.
More phrase-like terms: increase n_gram_range max (e.g., up to 3).
Cleaner vocabulary: increase min_df, and use reduce_frequent_words = TRUE.

Examples

Run this code

if (FALSE) {
res <- textTopics(
  data = Language_based_assessment_data_8,
  variable_name = "harmonytexts",
  embedding_model = "distilroberta",
  representation_model = "mmr",
  min_df = 3,
  save_dir = "bertopic_results"
)

}