Learn R Programming

sumer (version 1.3.0)

init_substr_info: Initialize a Data Frame of All Substrings

Description

Creates a data frame containing all contiguous substrings of a token vector, including the full token sequence itself. Each row represents one substring, with its starting position, length in tokens, the concatenated expression, and empty columns for type and translation.

The rows are ordered by n_tokens descending and start ascending, so that the row number can be computed from start and n_tokens using substr_position.

This is an internal helper function.

Usage

init_substr_info(token)

Value

A data frame with \(N(N+1)/2\) rows and the following columns:

start

Integer. The position of the first token in the substring (1-based).

n_tokens

Integer. The number of tokens in the substring.

expr

Character. The concatenated token sequence (without separators).

type

Character. Initialized as empty string "".

translation

Character. Initialized as empty string "".

Arguments

token

A character vector of Sumerian tokens (e.g. cuneiform signs).

Details

For a token vector of length \(N\), the function generates all \(N(N+1)/2\) contiguous substrings. The substrings are ordered by n_tokens descending (longest first) and within each group by start ascending. This ordering ensures that the row index of any substring can be computed with the formula

$$\mathrm{row} = \frac{(N - k)(N - k + 1)}{2} + s$$

where \(k\) is the number of tokens (n_tokens) and \(s\) is the starting position (start).

The expr column contains the tokens concatenated without separators. The type and translation columns are initialized as empty strings, intended to be filled in later.

See Also

substr_position for computing the row index from start and n_tokens, skeleton for creating translation templates, make_dictionary for creating dictionaries from filled-in templates

Examples

Run this code
x<-" ki a. jal2 (e2{kur}) ra. gaba jal2. an ki a"

token <- split_sumerian(as.cuneiform(x))$signs

df <- sumer:::init_substr_info(token)
df

# Verify that substr_position recovers the row indices
N <- length(token)
all(seq_len(nrow(df)) == sumer:::substr_position(df$start, df$n_tokens, N))

Run the code above in your browser using DataLab