bigexp_terms: Create a deterministic expansion spec for wide polynomial and interaction models

Description

bigexp_terms() builds a specification object that:

decides which predictors are treated as continuous or categorical,
optionally treats selected variables as blocking factors that enter the model only additively and never in interactions or polynomials,
locks factor levels from the supplied data,
records the contrast settings used when the model matrix is first built, and
constructs a reusable right-hand side (RHS) expression string for a large expansion that can be shared across multiple responses and datasets.

Usage

bigexp_terms(
  formula,
  data,
  factorial_order = 3L,
  polynomial_order = 3L,
  include_pc_2way = TRUE,
  include_pc_3way = FALSE,
  intercept = TRUE,
  blocking = NULL,
  discrete_numeric = NULL,
  audit = c("warn", "error", "none"),
  audit_numeric_rate = 0.9,
  audit_unique_ratio = 0.8,
  audit_min_n = 12L,
  report = TRUE
)

Value

An object of class "bigexp_spec" with components:

formula: expanded formula of the form y ~ <big expansion>, using the response from the input formula.
rhs: right-hand side expansion string (reusable for any response).
vars: character vector of predictor names (including blocking variables) in the order discovered from the formula and data.
is_cat: named logical vector indicating which predictors are treated as categorical (TRUE) versus continuous (FALSE).
levels: list of locked factor levels for categorical predictors.
num_range: 2 x p numeric matrix of ranges for continuous variables (rows c("min","max")).
settings: list of expansion settings, including factorial_order, polynomial_order, include_pc_2way, include_pc_3way, intercept, blocking, and stored contrast information.

Arguments

formula

Main-effects formula of the form y ~ X1 + X2 + G or y ~ .. The right-hand side should contain main effects only; do not include : (interactions), ^ (factorial shortcuts), I() powers, or inline polynomial expansions. The helper will generate interactions and polynomial terms automatically.

data

Data frame used to decide types and lock factor levels.

factorial_order

Integer >= 1. Maximum order of factorial interactions among the non-blocking main effects. For example, 1 gives main effects only, 2 gives up to two-way interactions, 3 gives up to three-way interactions, and so on.

polynomial_order

Integer >= 1. Maximum polynomial degree for continuous non-blocking predictors. A value of 1 means only linear terms; 2 adds squares I(X^2); 3 adds cubes I(X^3); in general, all powers I(X^k) for k from 2 up to polynomial_order are added.

include_pc_2way

Logical. If TRUE (default) and polynomial_order >= 2, include partial-cubic two-way terms of the form Z:I(X^2) where X is continuous and Z is another non-blocking predictor.

include_pc_3way

Logical. If TRUE and polynomial_order >= 2, include partial-cubic three-way terms I(X^2):Z:W among non-blocking predictors.

intercept

Logical. If TRUE (default), include an intercept in the expansion; if FALSE, the generated RHS drops the intercept.

blocking

Optional character vector of column names in data to treat as blocking factors. These variables are included in the spec and typed like other predictors (categorical vs continuous), but they enter the model only as additive main effects and never appear in interactions, polynomials, or partial-cubic terms. Important: when using y ~ ., blocking variables are automatically excluded from the "non-blocking" predictor set so they do not trigger a conflict error. When using an explicit RHS (for example y ~ X1 + X2), blocking variables must not also be explicitly listed on the right-hand side.

discrete_numeric

Optional specification of "discrete numeric" predictors for downstream sampling (for example in svem_random_table_multi()). These predictors are still treated as numeric for modeling and expansion (that is, they remain continuous in the design matrix and may participate in polynomial and interaction terms). This option only records a finite set of preferred numeric levels to be used when randomly generating recipes. Supply either:

a character vector of predictor names, in which case the allowed levels are inferred as the sorted unique finite values observed in data; or
a named list mapping predictor names to numeric vectors of allowed levels. If an entry is NULL or length zero, levels are inferred from data for that predictor.

audit

How to handle suspicious typing / high-cardinality issues when building the spec. One of "warn" (default), "error", or "none". Audits cover numeric-like character/factor columns (including percent strings like "25%"), and very high-cardinality categorical predictors that are likely IDs or mis-typed numerics.

audit_numeric_rate

Numeric in (0,1). If at least this fraction of non-missing values parse as numeric (after stripping commas and an optional trailing %), the column is flagged as numeric-like when stored as character/factor.

audit_unique_ratio

Numeric in (0, 1). For categorical predictors, warn/error if unique(non-missing) / n_nonmissing >= audit_unique_ratio.

audit_min_n

Integer >= 1. Minimum number of non-missing values required before audits are applied.

report

Logical. If TRUE (default), print a compact summary of the inferred predictor types and settings (via print.bigexp_spec) when bigexp_terms() returns.

Typical workflow

In a typical multi-response workflow you:

Call bigexp_terms() once on your training data to build and lock the expansion (types, levels, contrasts, RHS).
Fit models using spec$formula and the original data (for example, SVEMnet(spec$formula, data, ...) or lm(spec$formula, data)).
For new batches, call bigexp_prepare with the same spec so that design matrices have exactly the same columns and coding.
For additional responses on the same factor space, use bigexp_formula to swap the left-hand side while reusing the locked expansion.

Details

The expansion for non-blocking predictors can include:

full factorial interactions among the listed main effects, up to a chosen order;
polynomial terms I(X^k) for continuous predictors up to a chosen degree; and
optional partial-cubic interactions of the form Z:I(X^2) and I(X^2):Z:W.

Predictor types are inferred from data:

factors, characters, and logicals are treated as categorical;
all other numeric predictors are treated as continuous, and their observed ranges are stored.

Variables listed in blocking are included in the spec and are typed using the same rules as other predictors (for example, a numeric blocking variable with many distinct values is treated as continuous). However, blocking variables enter the model only as additive main effects, without interactions or polynomial terms, regardless of factorial_order or polynomial_order.

Once built, a "bigexp_spec" can be reused to create consistent expansions for new datasets via bigexp_prepare and bigexp_formula. The RHS and contrast settings are locked, so the same spec applied to different data produces design matrices with the same columns in the same order (up to missing levels for specific batches).

Examples

Run this code

## Example 1: small design with one factor
set.seed(1)
df <- data.frame(
  y  = rnorm(20),
  X1 = rnorm(20),
  X2 = rnorm(20),
  G  = factor(sample(c("A", "B"), 20, replace = TRUE))
)

## Two-way interactions and up to cubic terms in X1 and X2
spec <- bigexp_terms(
  y ~ X1 + X2 + G,
  data             = df,
  factorial_order  = 2,
  polynomial_order = 3
)

print(spec)

## Example 2: pure main effects (no interactions, no polynomial terms)
spec_main <- bigexp_terms(
  y ~ X1 + X2 + G,
  data             = df,
  factorial_order  = 1,  # main effects only
  polynomial_order = 1   # no I(X^2) or higher
)

## Example 3: blocking factors (categorical and continuous)
set.seed(2)
df_block <- data.frame(
  y           = rnorm(30),
  X1          = rnorm(30),
  X2          = rnorm(30),
  G           = factor(sample(c("A", "B"), 30, replace = TRUE)),
  Operator    = factor(sample(paste0("Op", 1:3), 30, replace = TRUE)),
  AmbientTemp = rnorm(30, mean = 22, sd = 2)  # continuous blocking covariate
)

## Here Operator (categorical) and AmbientTemp (continuous) are blocking factors:
## they enter additively, but do not appear in interactions or polynomials.
spec_block <- bigexp_terms(
  y ~ X1 + X2 + G,
  data             = df_block,
  factorial_order  = 2,
  polynomial_order = 3,
  blocking         = c("Operator", "AmbientTemp")
)

print(spec_block)
spec_block$rhs

## Example 4: discrete numeric predictors (finite numeric support)
## A common case is a numeric process setting that only takes a small set
## of allowed values (e.g., 0.5, 1, 2, 4). Use `discrete_numeric` in
## bigexp_terms() so downstream sampling respects those levels automatically.
# \donttest{
set.seed(3)
D_allowed <- c(0.5, 1, 2, 4)
df_disc <- data.frame(
  y  = rnorm(60),
  D  = sample(D_allowed, 60, replace = TRUE),   # numeric with discrete support
  X1 = rnorm(60),
  G  = factor(sample(c("A", "B"), 60, replace = TRUE))
)

# Record that D should be treated as "discrete numeric" for downstream sampling.
# Levels are inferred automatically from the training data.
spec_disc <- bigexp_terms(
  y ~ D + X1 + G,
  data             = df_disc,
  factorial_order  = 2,
  polynomial_order = 2,
  discrete_numeric = "D"
)


# Fit. The discrete support is expected to propagate into fit$sampling_schema
# (assuming the updated SVEMnet implementation that stores sampling_schema).
fit_disc <- SVEMnet(spec_disc, df_disc, nBoot = 20)

# Score random candidates; sampled D values stay in D_allowed
scored <- svem_score_random(
  objects         = list(y = fit_disc),
  goals           = list(y = list(goal = "max", weight = 1)),
  n               = 2000,
  numeric_sampler = "random",
  verbose         = FALSE
)

table(scored$score_table$D)
stopifnot(all(scored$score_table$D %in% D_allowed))
# }

Run the code above in your browser using DataLab