guess_substr_info: Look Up Translations for All Substrings of a Sumerian Text

Description

Converts a Sumerian text string into cuneiform tokens, generates all contiguous substrings, and looks up the most frequent translation for each substring in one or more dictionaries.

Usage

guess_substr_info(x, dic, mapping = NULL)

Value

A data frame with one row per substring and the following columns:

start: Integer. The token position of the first token in the substring (1-based).
n_tokens: Integer. The number of tokens in the substring.
expr: Character. The concatenated cuneiform tokens of the substring.
type: Character. The grammatical type of the most frequent translation (e.g. "S", "V"), or "" if no translation was found.
translation: Character. The most frequent translation from the dictionaries, or "" if no translation was found.
sign_name: Character. The sign name representation of the substring.

The rows are ordered as in init_substr_info (by n_tokens descending, then start ascending), so that row indices can be computed with substr_position.

Arguments

x: A character string of length 1 containing Sumerian text (transliteration, sign names, or cuneiform characters). May contain brackets as used by skeleton.
dic: A dictionary, a list of dictionaries, or a character vector of file paths to dictionary files. If file paths are given, each file is loaded with read_dictionary. Dictionaries are tried in order: the first dictionary that contains a translation for a given substring wins.
mapping: A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file etcsl_mapping.txt is loaded.

Details

The function performs the following steps:

If dic is a character vector of file paths, the dictionaries are loaded with read_dictionary. If dic is a single data frame, it is wrapped in a list.
The input string x is converted to cuneiform with as.cuneiform and split into individual tokens with split_sumerian.
A data frame of all contiguous substrings is created with init_substr_info.
A sign_name column is added by converting each substring expression with as.sign_name.
For each substring, the dictionaries are searched in order. The most frequent translation (highest count among rows with row_type == "trans.") from the first dictionary that contains a match is used to fill in the type and translation columns.
Single-token entries of type 4 (numbers and N) receive type "S" and their numeric value as translation, regardless of dictionary content.

Examples

Run this code

# Load the built-in dictionary
dic <- read_dictionary()

# Look up translations for all substrings
x <- "lugal kur-ra-ke4"
df <- guess_substr_info(x, dic)

# Show rows that have a translation
df[df$translation != "", ]

# Use multiple dictionaries (ordered by reliability -> first match wins)
file1 <- system.file("extdata", "sumer-dictionary.txt", package = "sumer")
df <- guess_substr_info(x, file1)