pre_process: Preprocess a Vector of Text Documents

Description

This function provides a comprehensive and configurable pipeline for cleaning raw text data. It handles a variety of common preprocessing steps including removing URLs and HTML, lowercasing, stopword removal, and lemmatization.

Usage

pre_process(
  doc_vector,
  remove_brackets = TRUE,
  remove_urls = TRUE,
  remove_html = TRUE,
  remove_nums = TRUE,
  remove_emojis_flag = TRUE,
  to_lowercase = TRUE,
  remove_punct = TRUE,
  remove_stop_words = TRUE,
  lemmatize = TRUE
)

Value

A character vector of the cleaned and preprocessed text.

Arguments

doc_vector: A character vector where each element is a document.
remove_brackets: A logical value indicating whether to remove text in square brackets.
remove_urls: A logical value indicating whether to remove URLs and email addresses.
remove_html: A logical value indicating whether to remove HTML tags.
remove_nums: A logical value indicating whether to remove numbers.
remove_emojis_flag: A logical value indicating whether to remove common emojis.
to_lowercase: A logical value indicating whether to convert text to lowercase.
remove_punct: A logical value indicating whether to remove punctuation.
remove_stop_words: A logical value indicating whether to remove English stopwords.
lemmatize: A logical value indicating whether to lemmatize words to their dictionary form.

Examples

Run this code

raw_text <- c(
  "This is a test! Visit https://example.com",
  "Email me at test.user@example.org [important]"
)

# Basic preprocessing with defaults
clean_text <- pre_process(raw_text)
print(clean_text)

# Keep punctuation and stopwords
clean_text_no_stop <- pre_process(
  raw_text,
  remove_stop_words = FALSE,
  remove_punct = FALSE
)
print(clean_text_no_stop)

Run the code above in your browser using DataLab