init_substr_info: Initialize a Data Frame of All Substrings

Description

Creates a data frame containing all contiguous substrings of a token vector, including the full token sequence itself. Each row represents one substring, with its starting position, length in tokens, the concatenated expression, and empty columns for type and translation.

The rows are ordered by n_tokens descending and start ascending, so that the row number can be computed from start and n_tokens using substr_position.

This is an internal helper function.

Usage

init_substr_info(token)

Value

A data frame with $N(N+1)/2$ rows and the following columns:

start: Integer. The position of the first token in the substring (1-based).
n_tokens: Integer. The number of tokens in the substring.
expr: Character. The concatenated token sequence (without separators).
type: Character. Initialized as empty string "".
translation: Character. Initialized as empty string "".

Arguments

token: A character vector of Sumerian tokens (e.g. cuneiform signs).

Details

For a token vector of length $N$, the function generates all $N(N+1)/2$ contiguous substrings. The substrings are ordered by n_tokens descending (longest first) and within each group by start ascending. This ordering ensures that the row index of any substring can be computed with the formula

$$\mathrm{row} = \frac{(N - k)(N - k + 1)}{2} + s$$

where $k$ is the number of tokens (n_tokens) and $s$ is the starting position (start).

The expr column contains the tokens concatenated without separators. The type and translation columns are initialized as empty strings, intended to be filled in later.

Examples

Run this code

x<-" ki a. jal2 (e2{kur}) ra. gaba jal2. an ki a"

token <- split_sumerian(as.cuneiform(x))$signs

df <- sumer:::init_substr_info(token)
df

# Verify that substr_position recovers the row indices
N <- length(token)
all(seq_len(nrow(df)) == sumer:::substr_position(df$start, df$n_tokens, N))