make_dictionary: Create a Sumerian Dictionary from Annotated Text Files

Description

Parses Word documents (.docx) or plain text files containing annotated Sumerian translations and creates a structured dictionary data frame. The function extracts sign names, their cuneiform representations, possible readings, and translations with grammatical types.

Usage

make_dictionary(file, mapping = NULL)

Value

A data frame with the following columns:

sign_name: The normalized Sumerian sign name (e.g., "A", "AN", "ME")
row_type: Type of entry: "cunei." (cuneiform), "reading" (phonetic readings), or "trans." (translation)
count: Number of occurrences for translations; NA for cuneiform and reading entries
type: Grammatical type (e.g., "S", "V", "Sx->A") for translations; empty for other line types
meaning: The cuneiform character(s), reading(s), or translated meaning depending on line_type

Arguments

file: A character vector of file paths to .docx or text files. Files must contain translation lines that are formatted as described below.
mapping: A data frame containing sign-to-reading mappings with columns name, cuneiform and syllables. If NULL (default), the package's built-in mapping file etcsl_mapping.txt is used.

Details

Input Format

The input files must contain lines starting with | in the following format:

|sign_name: TYPE: meaning

|equation for sign_name: TYPE: meaning

For example:


|a2-tab: S: the double amount of work performance
|me=ME: S: divine force
|AN: S: god of heaven
|na=NA: Sx->A: whose existence is bound to S

Lines not starting with | are ignored. Only the first entry in an equation of sign names is used for the dictionary. The following notation is suggested for grammatical types:

S for substantives and noun phrases, (e.g., "the old man in the temple")
V for verbs and decorated verbs (e.g., "to go", "to bring the delivery into the temple")
A for adjectives, attributes and subordinate clauses that further define the subject (e.g., "who/which is weak", "whose resource for sustaining life is grain")
Sx->A for a symbol that transforms the preceding noun phrase into an attribute (e.g., "whose resource for sustaining life is S"). Other transformations are denoted accordingly.
N for numbers,
D for everything else.

Processing Steps

Extracts text from .docx files or reads plain text
Filters lines starting with |
Excludes lines containing the unknown-sign placeholder X
Replaces standalone numbers in sign names with N (suffix digits like the 2 in jal2 are not affected)
Normalizes sign names and looks up possible readings from the mapping table
Aggregates translations and counts occurrences

Output Structure

For each unique sign, the output contains:

One cunei. row with the cuneiform character(s)
One reading row with possible phonetic readings
One or more trans. rows with translations, sorted by frequency

Examples

Run this code


# Create a dictionary from a single text document
filename  <- system.file("extdata", "text_with_translations.txt", package = "sumer")
dict <- make_dictionary(filename)

# Use the dictionary
look_up("an", dict)

Run the code above in your browser using DataLab