Learn R Programming

contentanalysis (version 0.2.1)

calculate_word_distribution: Calculate word distribution across text segments or sections

Description

Calculates the frequency of selected words/n-grams across document sections or equal-length segments.

Usage

calculate_word_distribution(
  text,
  selected_words,
  use_sections = "auto",
  n_segments = 10,
  remove_stopwords = TRUE,
  language = "en"
)

Value

Tibble with columns:

  • segment_id: Segment identifier

  • segment_name: Section name or segment number

  • segment_type: "section" or "equal_length"

  • word: Word/n-gram

  • count: Absolute frequency

  • total_words: Total words in segment

  • relative_frequency: Proportion of total words

  • percentage: Percentage representation

Attributes include metadata about segmentation used.

Arguments

text

Character string or named list. Document text or text with sections.

selected_words

Character vector. Words/n-grams to track.

use_sections

Logical or "auto". Use document sections if available (default: "auto").

n_segments

Integer. Number of segments if not using sections (default: 10).

remove_stopwords

Logical. Remove stopwords before analysis (default: TRUE).

language

Character. Language for stopwords (default: "en").

Details

The function:

  • Automatically detects if sections are available

  • Removes stopwords before creating n-grams (if requested)

  • Supports unigrams, bigrams, trigrams, etc.

  • Calculates both absolute and relative frequencies

Examples

Run this code
if (FALSE) {
doc <- pdf2txt_auto("paper.pdf")

# Track specific words across sections
words_to_track <- c("machine learning", "neural network", "accuracy")
dist <- calculate_word_distribution(doc, words_to_track)

# Use equal-length segments instead
dist <- calculate_word_distribution(doc, words_to_track,
                                    use_sections = FALSE,
                                    n_segments = 20)
}

Run the code above in your browser using DataLab