calculate_word_distribution: Calculate word distribution across text segments or sections

Description

Calculates the frequency of selected words/n-grams across document sections or equal-length segments.

Usage

calculate_word_distribution(
  text,
  selected_words,
  use_sections = "auto",
  n_segments = 10,
  remove_stopwords = TRUE,
  language = "en"
)

Value

Tibble with columns:

segment_id: Segment identifier
segment_name: Section name or segment number
segment_type: "section" or "equal_length"
word: Word/n-gram
count: Absolute frequency
total_words: Total words in segment
relative_frequency: Proportion of total words
percentage: Percentage representation

Attributes include metadata about segmentation used.

Arguments

text: Character string or named list. Document text or text with sections.
selected_words: Character vector. Words/n-grams to track.
use_sections: Logical or "auto". Use document sections if available (default: "auto").
n_segments: Integer. Number of segments if not using sections (default: 10).
remove_stopwords: Logical. Remove stopwords before analysis (default: TRUE).
language: Character. Language for stopwords (default: "en").

Details

The function:

Automatically detects if sections are available
Removes stopwords before creating n-grams (if requested)
Supports unigrams, bigrams, trigrams, etc.
Calculates both absolute and relative frequencies

Examples

Run this code

if (FALSE) {
doc <- pdf2txt_auto("paper.pdf")

# Track specific words across sections
words_to_track <- c("machine learning", "neural network", "accuracy")
dist <- calculate_word_distribution(doc, words_to_track)

# Use equal-length segments instead
dist <- calculate_word_distribution(doc, words_to_track,
                                    use_sections = FALSE,
                                    n_segments = 20)
}

Run the code above in your browser using DataLab