Learn R Programming

sumer (version 1.4.0)

make_dictionary: Create a Sumerian Dictionary from Annotated Text Files

Description

Parses Word documents (.docx) or plain text files containing annotated Sumerian translations and creates a structured dictionary data frame. The function extracts sign names, their cuneiform representations, possible readings, and translations with grammatical types.

Usage

make_dictionary(file, mapping = NULL)

Value

A data frame with the following columns:

sign_name

The normalized Sumerian sign name (e.g., "A", "AN", "ME")

row_type

Type of entry: "cunei." (cuneiform), "reading" (phonetic readings), or "trans." (translation)

count

Number of occurrences for translations; NA for cuneiform and reading entries

type

Grammatical type (e.g., "S", "V", "Sx->A") for translations; empty for other line types

meaning

The cuneiform character(s), reading(s), or translated meaning depending on line_type

Arguments

file

A character vector of file paths to .docx or text files. Files must contain translation lines that are formatted as described below.

mapping

A data frame containing sign-to-reading mappings with columns name, cuneiform and syllables. If NULL (default), the package's built-in mapping file etcsl_mapping.txt is used.

Details

Input Format

The input files must contain lines starting with | in the following format:

|sign_name: TYPE: meaning

or

|equation for sign_name: TYPE: meaning

For example:


|a2-tab: S: the double amount of work performance
|me=ME: S: divine force
|AN: S: god of heaven
|na=NA: Sx->A: whose existence is bound to S

Lines not starting with | are ignored. Only the first entry in an equation of sign names is used for the dictionary. The following notation is suggested for grammatical types:

  • S for substantives and noun phrases, (e.g., "the old man in the temple")

  • V for verbs and decorated verbs (e.g., "to go", "to bring the delivery into the temple")

  • A for adjectives, attributes and subordinate clauses that further define the subject (e.g., "who/which is weak", "whose resource for sustaining life is grain")

  • Sx->A for a symbol that transforms the preceding noun phrase into an attribute (e.g., "whose resource for sustaining life is S"). Other transformations are denoted accordingly.

  • N for numbers,

  • D for everything else.

Processing Steps

  1. Extracts text from .docx files or reads plain text

  2. Filters lines starting with |

  3. Excludes lines containing the unknown-sign placeholder X

  4. Replaces standalone numbers in sign names with N (suffix digits like the 2 in jal2 are not affected)

  5. Normalizes sign names and looks up possible readings from the mapping table

  6. Aggregates translations and counts occurrences

Output Structure

For each unique sign, the output contains:

  • One cunei. row with the cuneiform character(s)

  • One reading row with possible phonetic readings

  • One or more trans. rows with translations, sorted by frequency

See Also

read_translated_text for reading translation files, convert_to_dictionary for the aggregation step, read_dictionary for loading a saved dictionary, save_dictionary for saving a dictionary to file, look_up for searching a dictionary

Examples

Run this code

# Create a dictionary from a single text document
filename  <- system.file("extdata", "text_with_translations.txt", package = "sumer")
dict <- make_dictionary(filename)

# Use the dictionary
look_up("an", dict)

Run the code above in your browser using DataLab