read_translated_text: Read Annotated Sumerian Translations from Text Files

Description

Reads Word documents (.docx) or plain text files containing annotated Sumerian translations and extracts sign names, grammatical types, and meanings into a structured data frame.

Usage

read_translated_text(file, mapping=NULL)

Value

A data frame with the following columns:

sign_name: The normalized sign name with components separated by hyphens (e.g., "A", "AN", "X-NA")
type: Grammatical type (e.g., "S", "V", "A", "Sx->A")
meaning: The translated meaning of the sign

Arguments

file: A character vector of file paths to .docx or text files. Files must contain translation lines that are formatted as described below.
mapping: A data frame containing sign-to-reading mappings with columns name, cuneiform and syllables. If NULL (default), the package's built-in mapping file etcsl_mapping.txt is used.

Details

Input Format

The input files must contain lines starting with | in the following format:

|sign_name: TYPE: meaning

|equation for sign_name: TYPE: meaning

For example:


|a2-tab: S: the double amount of work performance
|me=ME: S: divine force
|AN: S: god of heaven
|na=NA: Sx->A: whose existence is bound to S

Lines not starting with | are ignored. Only the first entry in an equation of sign names is extracted. The following notation is suggested for grammatical types:

S for substantives and noun phrases, (e.g., "the old man in the temple")
V for verbs and decorated verbs (e.g., "to go", "to bring the delivery into the temple")
A for adjectives, attributes and subordinate clauses that further define the subject (e.g., "who/which is weak", "whose resource for sustaining life is grain")
Sx->A for a symbol that transforms the preceding noun phrase into an attribute (e.g., "whose resource for sustaining life is S"). Other transformations are denoted accordingly.
N for numbers,
D for everything else.

Processing Steps

Reads text from .docx files or plain text files
Filters lines starting with |
Parses each line into sign name, type, and meaning components
Normalizes transliterated text by removing separators and looking up the sign names from the mapping
Cleans meaning field by removing content after ; or | delimiters
Issues a warning for entries with missing type annotations
Excludes empty sign names from the result

Examples

Run this code


# Read translations from a single text document
filename     <- system.file("extdata", "text_with_translations.txt", package = "sumer")
translations <- read_translated_text(filename)

# View the structure
head(translations)

# Filter by grammatical type
nouns <- translations[translations$type == "S", ]
nouns

#Make some custom unifications (here: removing the word "the")
translations$meaning <- gsub("\\bthe\\b", "", translations$meaning, ignore.case = TRUE)
translations$meaning <- trimws(gsub("\\s+", " ", translations$meaning))

# View the structure
head(translations)

#Convert the result into a dictionary
dictionary   <- convert_to_dictionary(translations)

# View the structure
head(dictionary)

Run the code above in your browser using DataLab