sim.data.ppls: Simulate Data for Penalized Partial Least Squares (PPLS)

Description

Generates a training and test dataset with non-linear relationships between predictors and response, as used in PPLS simulation studies.

Usage

sim.data.ppls(ntrain, ntest, stnr, p, a = NULL, b = NULL)

Value

A list with the following components:

Xtrain: ntrain x p matrix of training predictors (uniform in [-1, 1]).
ytrain: Numeric vector of training responses.
Xtest: ntest x p matrix of test predictors.
ytest: Numeric vector of test responses.
sigma: Standard deviation of the added noise.
a: Linear coefficients used in the simulation.
b: Nonlinear sine coefficients used in the simulation.

Arguments

ntrain: Integer. Number of training observations.
ntest: Integer. Number of test observations.
stnr: Numeric. Signal-to-noise ratio (higher means less noise).
p: Integer. Number of predictors (must be >= 5).
a: Optional numeric vector of length 5. Linear coefficients for the first 5 variables. If NULL, drawn uniformly from [-1, 1].
b: Optional numeric vector of length 5. Nonlinear sine coefficients. If NULL, drawn uniformly from [-1, 1].

Details

The function simulates a response variable y as a combination of additive linear and sinusoidal effects of the first 5 predictors: $$f(x) = \sum_{j=1}^{5} a_j x_j + \sin(6 b_j x_j)$$ The response y is then generated by adding Gaussian noise scaled to match the specified signal-to-noise ratio (stnr).

Remaining variables (p - 5) are included as noise variables, making the dataset suitable to evaluate selection or regularization methods.

Examples

Run this code

set.seed(123)
sim <- sim.data.ppls(ntrain = 100, ntest = 100, stnr = 3, p = 10)
str(sim)
plot(sim$Xtrain[, 1], sim$ytrain, main = "Effect of x1 on y")