substr_position: Compute Row Index of a Substring in a Substring Data Frame

Description

Computes the row index of a substring in the data frame created by init_substr_info, given its starting position, its length in tokens, and the total number of tokens.

This is an internal helper function.

Usage

substr_position(start, n_tokens, N)

Value

A numeric vector of row indices (1-based).

Arguments

start: Integer (or integer vector). The starting position of the substring (1-based).
n_tokens: Integer (or integer vector). The number of tokens in the substring.
N: Integer. The total number of tokens in the full token sequence.

Details

The data frame returned by init_substr_info is ordered by n_tokens descending and start ascending. This function computes the corresponding row index using the formula

$$\mathrm{row} = \frac{(N - k)(N - k + 1)}{2} + s$$

where $k$ = n_tokens and $s$ = start.

The function is vectorized: if start and n_tokens are vectors of the same length, a vector of row indices is returned.

Examples

Run this code


# Create a character vector with tokens
x <- " ki a. jal2 (e2{kur}) ra. gaba jal2. an ki a"
token <- split_sumerian(as.cuneiform(x))$signs
token

N <- length(token)

# Create a data frame with all substrings
df <- sumer:::init_substr_info(token)

# The full string (start=1, n_tokens=N) is in row 1
pos <- sumer:::substr_position(1, N, N)
pos
df$expr[pos]


# The last single token (start=N, n_tokens=1) is in the last row
pos <- sumer:::substr_position(N, 1, N)
pos
df$expr[pos]

# Vectorized call
start <- c(1, 2, 1)
n_token <- c(2, 2, 1)
pos <- sumer:::substr_position(start, n_token, N)
pos
df$expr[pos]