Analyzes a Sumerian text for frequently occurring cuneiform sign combinations
(n-grams). The input can be either cuneiform text or transliterated text
(which is automatically converted to cuneiform via as.cuneiform).
The analysis starts with the longest combinations and works down to single
signs, masking already-counted occurrences to avoid reporting subsequences
that are only frequent because they are part of a longer frequent combination.
N-grams are searched within lines only (not across line boundaries).
ngram_frequencies(x, min_freq = c(6, 4, 2), mapping = NULL)A data frame with three columns, sorted by descending length, then descending frequency:
Integer. The number of occurrences of the combination.
Integer. The number of signs in the combination.
Character. The cuneiform sign combination
(e.g., "\U0001202D\U00012097\U000120A0").
Character vector whose elements are the lines of a Sumerian text.
The input can be either cuneiform characters or transliterated text. If no
cuneiform characters (U+12000 to U+1254F) are detected, the input is
automatically converted using as.cuneiform.
Lines starting with # are treated as comments and ignored.
Optional line numbers at the beginning of a line (e.g., "42)\t")
are automatically removed. Spaces are removed before tokenization.
Integer vector specifying minimum frequencies (default:
c(6, 4, 2)). The i-th value specifies the minimum frequency for
combinations of length i. For lengths beyond the vector's length, the last
value is used.
The default c(6, 4, 2) means: single signs must occur at least 6
times, pairs at least 4 times, and all longer combinations at least 2
times.
A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file etcsl_mapping.txt is loaded.
A “sign” is defined as either a single cuneiform Unicode character (U+12000 to U+1254F) or a character sequence enclosed in mathematical angle brackets (U+27E8 ... U+27E9), which is treated as a single token. All other characters (spaces, X, numbers, punctuation, etc.) are skipped during tokenization.
The maximum n-gram length is automatically determined as the length of the longest tokenized line in the input.
The analysis proceeds from the longest combinations down to single signs. When a combination is identified as frequent (i.e., meets the minimum frequency threshold), all occurrences except the first are masked before continuing with shorter combinations. This prevents subsequences from being reported as frequent when their frequency is solely due to a longer frequent combination.
as.sign_name for converting cuneiform to sign names,
as.cuneiform for converting transliterations to cuneiform,
split_sumerian for tokenizing transliterated text.
# Read the text "Enki and the World Order"
path <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding="UTF-8")
cat(text[1:10],sep="\n")
# Find combinations that appear at least 6 times in the text
freq <- ngram_frequencies(text, min_freq = 6)
freq[1:10,]
Run the code above in your browser using DataLab