word_counts: Get word counts by speaker role

Description

Reads a CSV with a word column or an in-memory character vector and writes an Excel file with Word_Frequencies, Dataset_Summary, File_Speaker_Summary, and Run_Metadata. If no word list is provided, all types in the selected slice are counted (FREQ-style “all words” mode).

Usage

word_counts(
  word_list_file = NULL,
  output_file,
  words = NULL,
  collection = NULL,
  language = NULL,
  corpus = NULL,
  age = NULL,
  sex = NULL,
  role = NULL,
  role_exclude = NULL,
  wildcard = FALSE,
  collapse = c("none", "stem"),
  part_of_speech = NULL,
  tier = c("main", "mor"),
  normalize = FALSE,
  per = 1000L,
  zipf = FALSE,
  include_patterns = NULL,
  exclude_patterns = NULL,
  sort_by = c("word", "frequency"),
  min_count = 0L,
  freq_ignore_special = TRUE,
  db_version = "current",
  cache = FALSE,
  cache_dir = NULL,
  ...
)

Value

Invisibly returns output_file after writing the workbook.

Arguments

word_list_file: Optional path to a CSV file with a column named word. If NULL and words is also NULL, all types in the slice are counted.
output_file: Path to the output .xlsx file.
words: Optional character vector of target words/patterns. Ignored if word_list_file is provided. If both are NULL, all types are counted.
collection: Optional CHILDES filter.
language: Optional CHILDES filter.
corpus: Optional CHILDES filter.
age: Optional numeric: single value or c(min, max) in months.
sex: Optional: "male" and/or "female".
role: Optional character vector of roles to include.
role_exclude: Optional character vector of roles to exclude.
wildcard: Logical; treat "%" as any number of characters and "_" as one character (token mode).
collapse: Either "none" or "stem". Using "stem" triggers token mode.
part_of_speech: Optional POS filter, e.g., c("n","v") (token mode).
tier: Which tier to count from: "main" or "mor".
normalize: Logical; if TRUE, add per-N rate columns.
per: Integer denominator for rates (for example 1000 for per-1k).
zipf: Logical; if TRUE, also add Zipf columns (log10 per-billion).
include_patterns: Optional character vector of CHILDES-style patterns, using "%" and "_" to restrict output to matching words (FREQ-style +s).
exclude_patterns: Optional character vector of CHILDES-style patterns to drop from the output.
sort_by: Final sort order: "word" (alphabetical) or "frequency" (descending Total).
min_count: Integer; drop rows with Total < min_count (after counting).
freq_ignore_special: Logical; if TRUE, drop "xxx", "www", and any word starting with 0, &, +, -, or # (FREQ default ignore rules).
db_version: CHILDES database version label to record in metadata.
cache: Logical; if TRUE, cache CHILDES queries on disk.
cache_dir: Optional cache directory when cache = TRUE.
...: Reserved for future extensions; currently unused.

Details

Uses exact type counts by default; switches to token mode when wildcards, stems, or POS filters are requested. Optional MOR-only tier.

Examples

Run this code

if (FALSE) {
# Minimal example (not run during R CMD check)
tmp_csv <- tempfile(fileext = ".csv")
write.csv(data.frame(word = c("the","go")), tmp_csv, row.names = FALSE)

out_file <- tempfile(fileext = ".xlsx")
word_counts(
  word_list_file = tmp_csv,
  output_file    = out_file,
  language       = "eng",
  corpus         = "Brown",
  age            = c(24, 26)
)

# All-words mode (no word list; counts every type in the slice)
out_all <- tempfile(fileext = ".xlsx")
word_counts(
  word_list_file = NULL,
  words          = NULL,
  output_file    = out_all,
  language       = "eng",
  corpus         = "Brown",
  age            = c(24, 26)
)
}

Run the code above in your browser using DataLab