Learn R Programming

sumer (version 1.4.0)

guess_substr_info: Look Up Translations for All Substrings of a Sumerian Text

Description

Converts a Sumerian text string into cuneiform tokens, generates all contiguous substrings, and looks up the most frequent translation for each substring in one or more dictionaries.

Usage

guess_substr_info(x, dic, mapping = NULL)

Value

A data frame with one row per substring and the following columns:

start

Integer. The token position of the first token in the substring (1-based).

n_tokens

Integer. The number of tokens in the substring.

expr

Character. The concatenated cuneiform tokens of the substring.

type

Character. The grammatical type of the most frequent translation (e.g. "S", "V"), or "" if no translation was found.

translation

Character. The most frequent translation from the dictionaries, or "" if no translation was found.

sign_name

Character. The sign name representation of the substring.

The rows are ordered as in init_substr_info (by n_tokens descending, then start ascending), so that row indices can be computed with substr_position.

Arguments

x

A character string of length 1 containing Sumerian text (transliteration, sign names, or cuneiform characters). May contain brackets as used by skeleton.

dic

A dictionary, a list of dictionaries, or a character vector of file paths to dictionary files. If file paths are given, each file is loaded with read_dictionary. Dictionaries are tried in order: the first dictionary that contains a translation for a given substring wins.

mapping

A data frame containing the sign mapping table with columns syllables, name, and cuneiform. If NULL (the default), the package's internal mapping file etcsl_mapping.txt is loaded.

Details

The function performs the following steps:

  1. If dic is a character vector of file paths, the dictionaries are loaded with read_dictionary. If dic is a single data frame, it is wrapped in a list.

  2. The input string x is converted to cuneiform with as.cuneiform and split into individual tokens with split_sumerian.

  3. A data frame of all contiguous substrings is created with init_substr_info.

  4. A sign_name column is added by converting each substring expression with as.sign_name.

  5. For each substring, the dictionaries are searched in order. The most frequent translation (highest count among rows with row_type == "trans.") from the first dictionary that contains a match is used to fill in the type and translation columns.

  6. Single-token entries of type 4 (numbers and N) receive type "S" and their numeric value as translation, regardless of dictionary content.

See Also

init_substr_info for creating the substring data frame, substr_position for computing row indices, read_dictionary for loading dictionaries, look_up for interactive dictionary lookup, skeleton for creating translation templates

Examples

Run this code
# Load the built-in dictionary
dic <- read_dictionary()

# Look up translations for all substrings
x <- "lugal kur-ra-ke4"
df <- guess_substr_info(x, dic)

# Show rows that have a translation
df[df$translation != "", ]

# Use multiple dictionaries (ordered by reliability -> first match wins)
file1 <- system.file("extdata", "sumer-dictionary.txt", package = "sumer")
df <- guess_substr_info(x, file1)

Run the code above in your browser using DataLab