Learn R Programming

SportMiner (version 0.1.0)

sm_preprocess_text: Preprocess Text for Topic Modeling

Description

Tokenizes, cleans, and stems text data in preparation for topic modeling. Removes stopwords, numbers, and performs stemming using the Porter algorithm.

Usage

sm_preprocess_text(
  data,
  text_col = "abstract",
  id_col = NULL,
  min_word_length = 3,
  custom_stopwords = NULL
)

Value

A data.frame with columns: doc_id, stem, and n (word count).

Arguments

data

A data.frame containing text data.

text_col

Name of the column containing text to preprocess. Default is "abstract".

id_col

Name of the column containing document IDs. If NULL, a doc_id column will be created. Default is NULL.

min_word_length

Minimum word length to retain. Default is 3.

custom_stopwords

Additional stopwords to remove beyond the standard English stopwords. Default is NULL.

Examples

Run this code
if (FALSE) {
# Requires API data from sm_search_scopus()
papers <- sm_search_scopus(query, max_count = 50)
processed <- sm_preprocess_text(papers)
}

Run the code above in your browser using DataLab