This function creates embeddings with sentence-transformers, configures UMAP, HDBSCAN, and CountVectorizer, optionally wires a representation model, and fits a BERTopic model from R. The returned model can be used with bertopicr helpers.
train_bertopic_model(
docs,
embedding_model = "Qwen/Qwen3-Embedding-0.6B",
embeddings = NULL,
embedding_batch_size = 32,
embedding_show_progress = TRUE,
umap_model = NULL,
umap_n_neighbors = 15,
umap_n_components = 5,
umap_min_dist = 0,
umap_metric = "cosine",
umap_random_state = 42,
hdbscan_model = NULL,
hdbscan_min_cluster_size = 50,
hdbscan_min_samples = 20,
hdbscan_metric = "euclidean",
hdbscan_cluster_selection_method = "eom",
hdbscan_gen_min_span_tree = TRUE,
hdbscan_prediction_data = TRUE,
hdbscan_core_dist_n_jobs = 1,
vectorizer_model = NULL,
stop_words = "all_stopwords",
ngram_range = c(1, 3),
min_df = 2L,
max_df = 50L,
max_features = 10000,
strip_accents = NULL,
decode_error = "strict",
encoding = "UTF-8",
representation_model = c("none", "keybert", "mmr", "ollama"),
representation_params = list(),
ollama_model = NULL,
ollama_base_url = "http://localhost:11434/v1",
ollama_api_key = "ollama",
ollama_client_params = list(),
ollama_prompt = NULL,
top_n_words = 200L,
calculate_probabilities = TRUE,
verbose = TRUE,
seed = NULL,
timestamps = NULL,
topics_over_time_nr_bins = 20L,
topics_over_time_global_tuning = TRUE,
topics_over_time_evolution_tuning = TRUE,
classes = NULL,
compute_reduced_embeddings = TRUE,
reduced_embedding_n_neighbors = 10L,
reduced_embedding_min_dist = 0,
reduced_embedding_metric = "cosine",
compute_hierarchical_topics = TRUE,
bertopic_args = list()
)A list with elements model, topics, probabilities, embeddings, reduced_embeddings_2d, reduced_embeddings_3d, hierarchical_topics, topics_over_time, and topics_per_class.
Character vector of documents to model.
Sentence-transformers model name or local path.
Optional precomputed embeddings (matrix or array).
Batch size for embedding encoding.
Logical. Show embedding progress bar.
Optional pre-built UMAP Python object. If NULL, one is created.
Number of neighbors for UMAP.
Number of UMAP components.
UMAP min_dist parameter.
UMAP metric.
Random state for UMAP.
Optional pre-built HDBSCAN Python object. If NULL, one is created.
HDBSCAN min_cluster_size.
HDBSCAN min_samples.
HDBSCAN metric.
HDBSCAN cluster selection method.
HDBSCAN gen_min_span_tree.
Logical. Whether to generate prediction data.
HDBSCAN core_dist_n_jobs.
Optional pre-built CountVectorizer Python object.
Stop words for CountVectorizer. Use "all_stopwords" to load the bundled multilingual list, "english", or a character vector.
Length-2 integer vector for n-gram range.
Minimum document frequency for CountVectorizer.
Maximum document frequency for CountVectorizer.
Maximum features for CountVectorizer.
Passed to CountVectorizer. Use NULL to preserve umlauts.
Passed to CountVectorizer when decoding input bytes.
Text encoding for CountVectorizer (defaults to "utf-8").
Representation model to use: "none", "keybert", "mmr", or "ollama".
Named list of parameters passed to the representation model.
Ollama model name when representation_model = "ollama".
Base URL for the Ollama OpenAI-compatible endpoint.
API key placeholder for the Ollama OpenAI-compatible endpoint.
Named list of extra parameters passed to openai$OpenAI().
Optional prompt template for the Ollama OpenAI representation.
Number of top words per topic to keep in the model.
Logical. Whether to calculate topic probabilities.
Logical. Verbosity for BERTopic.
Optional random seed.
Optional vector of timestamps (Date/POSIXt/ISO strings or integer) for topics over time. Defaults to NULL (topics over time disabled).
Number of bins for topics_over_time.
Logical. Whether to enable global tuning for topics_over_time.
Logical. Whether to enable evolution tuning for topics_over_time.
Optional vector of class labels (character or factor) for topics per class. Defaults to NULL (topics per class disabled).
Logical. If TRUE, computes 2D and 3D UMAP reductions.
Number of neighbors for reduced embeddings.
UMAP min_dist for reduced embeddings.
UMAP metric for reduced embeddings.
Logical. If TRUE, computes hierarchical topics.
Named list of extra arguments passed to BERTopic().
# \donttest{
if (requireNamespace("reticulate", quietly = TRUE) &&
reticulate::py_available(initialize = FALSE) &&
reticulate::py_module_available("bertopic")) {
setup_python_environment()
sample_path <- system.file("extdata", "spiegel_sample.rds", package = "bertopicr")
df <- readr::read_rds(sample_path)
texts <- df$text_clean[seq_len(500)]
fit <- train_bertopic_model(
texts,
embedding_model = "Qwen/Qwen3-Embedding-0.6B",
top_n_words = 3L
)
visualize_topics(fit$model, filename = "intertopic_distance_map", auto_open = FALSE)
} else {
message("Python/bertopic not available. Skipping this example.")
}
# }
Run the code above in your browser using DataLab