Learn R Programming

quickSentiment (version 0.1.0)

pre_process: Preprocess a Vector of Text Documents

Description

This function provides a comprehensive and configurable pipeline for cleaning raw text data. It handles a variety of common preprocessing steps including removing URLs and HTML, lowercasing, stopword removal, and lemmatization.

Usage

pre_process(
  doc_vector,
  remove_brackets = TRUE,
  remove_urls = TRUE,
  remove_html = TRUE,
  remove_nums = TRUE,
  remove_emojis_flag = TRUE,
  to_lowercase = TRUE,
  remove_punct = TRUE,
  remove_stop_words = TRUE,
  lemmatize = TRUE
)

Value

A character vector of the cleaned and preprocessed text.

Arguments

doc_vector

A character vector where each element is a document.

remove_brackets

A logical value indicating whether to remove text in square brackets.

remove_urls

A logical value indicating whether to remove URLs and email addresses.

remove_html

A logical value indicating whether to remove HTML tags.

remove_nums

A logical value indicating whether to remove numbers.

remove_emojis_flag

A logical value indicating whether to remove common emojis.

to_lowercase

A logical value indicating whether to convert text to lowercase.

remove_punct

A logical value indicating whether to remove punctuation.

remove_stop_words

A logical value indicating whether to remove English stopwords.

lemmatize

A logical value indicating whether to lemmatize words to their dictionary form.

Examples

Run this code
raw_text <- c(
  "This is a test! Visit https://example.com",
  "Email me at test.user@example.org [important]"
)

# Basic preprocessing with defaults
clean_text <- pre_process(raw_text)
print(clean_text)

# Keep punctuation and stopwords
clean_text_no_stop <- pre_process(
  raw_text,
  remove_stop_words = FALSE,
  remove_punct = FALSE
)
print(clean_text_no_stop)

Run the code above in your browser using DataLab