powered by
Tokenizes, cleans, and stems text data in preparation for topic modeling. Removes stopwords, numbers, and performs stemming using the Porter algorithm.
sm_preprocess_text( data, text_col = "abstract", id_col = NULL, min_word_length = 3, custom_stopwords = NULL )
A data.frame with columns: doc_id, stem, and n (word count).
A data.frame containing text data.
Name of the column containing text to preprocess. Default is "abstract".
Name of the column containing document IDs. If NULL, a doc_id column will be created. Default is NULL.
Minimum word length to retain. Default is 3.
Additional stopwords to remove beyond the standard English stopwords. Default is NULL.
if (FALSE) { # Requires API data from sm_search_scopus() papers <- sm_search_scopus(query, max_count = 50) processed <- sm_preprocess_text(papers) }
Run the code above in your browser using DataLab