textTopics() trains a BERTopic model (via the bertopic Python package) on a
text variable in a tibble/data.frame. The function embeds documents, reduces
dimensionality (UMAP), clusters documents (HDBSCAN), and extracts topic representations
using c-TF-IDF with optional KeyBERT/MMR-based representation. (EXPERIMENTAL)
textTopics(
data,
variable_name,
embedding_model = "distilroberta",
representation_model = c("mmr", "keybert"),
umap_n_neighbors = 15L,
umap_n_components = 5L,
umap_min_dist = 0,
umap_metric = "cosine",
hdbscan_min_cluster_size = 5L,
hdbscan_min_samples = NULL,
hdbscan_metric = "euclidean",
hdbscan_cluster_selection_method = "eom",
hdbscan_prediction_data = TRUE,
num_top_words = 10L,
n_gram_range = c(1L, 3L),
stopwords = "english",
min_df = 5L,
bm25_weighting = FALSE,
reduce_frequent_words = TRUE,
set_seed = 8L,
save_dir
)A named list containing:
The training data used to fit the model (or loaded from disk if available).
A document-by-topic matrix of normalized topic mixtures (LDA-like). Rows typically sum to 1; rows of zeros can occur if no topic mass was assigned.
Document-level outputs including hard topic labels (-1 indicates outliers).
Topic-level outputs including topic sizes and top terms.
The fitted BERTopic model object (Python-backed).
Model identifier (currently "bert_topic").
Random seed used.
Directory where artifacts were saved.
A tibble/data.frame containing a text variable to analyze and,
optionally, additional numeric/categorical variables that can be used in later analyses
(e.g., testing topic prevalence differences across groups).
A character string giving the name of the text variable in data
to perform topic modeling on.
A character string specifying which embedding model to use.
Common options include "miniLM", "mpnet", "multi-mpnet",
and "distilroberta". The choice affects topic quality, speed, and memory usage.
A character string specifying the topic representation method.
Must be one of "mmr" or "keybert".
"keybert" uses embedding similarity to select representative words/phrases.
"mmr" (Maximal Marginal Relevance) promotes diversity among selected terms.
Integer. Number of neighbors used by UMAP to balance local versus global structure. Smaller values emphasize local clusters; larger values emphasize global structure.
Integer. Number of dimensions to reduce to with UMAP (the embedding space used for clustering).
Numeric. Minimum distance between embedded points in UMAP. Smaller values typically yield tighter clusters.
Character string specifying the distance metric used by UMAP, e.g.
"cosine".
Integer. The minimum cluster size for HDBSCAN. Larger values yield fewer, broader topics; smaller values yield more, finer-grained topics.
Integer or NULL. Controls how conservative clustering is.
If NULL, HDBSCAN chooses a default.
Character string specifying the metric used by HDBSCAN, typically
"euclidean" when clustering in reduced UMAP space.
Character string specifying cluster selection strategy.
Either "eom" (excess of mass; often yields more stable clusters) or "leaf"
(can yield more fine-grained clusters).
Logical. If TRUE, stores additional information enabling
approximate topic prediction for new documents (when supported by the underlying pipeline).
Integer. Number of top terms to return per topic.
Integer vector of length 2 giving the min and max n-gram length used by
the vectorizer (e.g., c(1L, 3L)).
Character string naming the stopword dictionary to use (e.g. "english").
Integer. Minimum document frequency for terms included in the vectorizer.
Logical. If TRUE, uses BM25 weighting in the class-based TF-IDF
transformer (can improve term weighting in some corpora).
Logical. If TRUE, down-weights very frequent words using
the class-based TF-IDF transformer.
Integer. Random seed used to initialize UMAP (and other stochastic components) for reproducibility.
Character string specifying the directory where outputs should be saved. A folder will be created (or reused) to store the fitted model and derived outputs.
Typical tuning levers:
More topics / finer clusters: decrease hdbscan_min_cluster_size,
decrease umap_n_neighbors, and/or increase umap_n_components.
Fewer topics / broader clusters: increase hdbscan_min_cluster_size
and/or increase umap_n_neighbors.
More phrase-like terms: increase n_gram_range max (e.g., up to 3).
Cleaner vocabulary: increase min_df, and use reduce_frequent_words = TRUE.
textTopicsReduce, textTopicsTest,
textTopicsWordcloud
if (FALSE) {
res <- textTopics(
data = Language_based_assessment_data_8,
variable_name = "harmonytexts",
embedding_model = "distilroberta",
representation_model = "mmr",
min_df = 3,
save_dir = "bertopic_results"
)
}
Run the code above in your browser using DataLab