ngram_frequencies: Frequency Analysis of Cuneiform Sign Combinations (N-grams)

Description

Analyzes a Sumerian text for frequently occurring cuneiform sign combinations (n-grams). The input can be either cuneiform text or transliterated text (which is automatically converted to cuneiform via as.cuneiform). The analysis starts with the longest combinations and works down to single signs, masking already-counted occurrences to avoid reporting subsequences that are only frequent because they are part of a longer frequent combination. N-grams are searched within lines only (not across line boundaries).

Usage

ngram_frequencies(x, min_freq = c(6, 4, 2), mapping = NULL)

Value

A data frame with three columns, sorted by descending length, then descending frequency:

frequency: Integer. The number of occurrences of the combination.
length: Integer. The number of signs in the combination.
combination: Character. The cuneiform sign combination (e.g., "\U0001202D\U00012097\U000120A0").

Arguments

x

Character vector whose elements are the lines of a Sumerian text. The input can be either cuneiform characters or transliterated text. If no cuneiform characters (U+12000 to U+1254F) are detected, the input is automatically converted using as.cuneiform. Lines starting with # are treated as comments and ignored. Optional line numbers at the beginning of a line (e.g., "42)\t") are automatically removed. Spaces are removed before tokenization.

min_freq

Integer vector specifying minimum frequencies (default: c(6, 4, 2)). The i-th value specifies the minimum frequency for combinations of length i. For lengths beyond the vector's length, the last value is used.

The default c(6, 4, 2) means: single signs must occur at least 6 times, pairs at least 4 times, and all longer combinations at least 2 times.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file etcsl_mapping.txt is loaded.

Details

A “sign” is defined as either a single cuneiform Unicode character (U+12000 to U+1254F) or a character sequence enclosed in mathematical angle brackets (U+27E8 ... U+27E9), which is treated as a single token. All other characters (spaces, X, numbers, punctuation, etc.) are skipped during tokenization.

The maximum n-gram length is automatically determined as the length of the longest tokenized line in the input.

The analysis proceeds from the longest combinations down to single signs. When a combination is identified as frequent (i.e., meets the minimum frequency threshold), all occurrences except the first are masked before continuing with shorter combinations. This prevents subsequences from being reported as frequent when their frequency is solely due to a longer frequent combination.

Examples

Run this code

# Read the text "Enki and the World Order"

path  <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding="UTF-8")

cat(text[1:10],sep="\n")

# Find combinations that appear at least 6 times in the text
freq <- ngram_frequencies(text, min_freq = 6)

freq[1:10,]