d.spls.simulate: Simulation of a data

Description

The function d.spls.simulate simulates G mixtures of nondes Gaussians from which it builds a data set of predictors X and response y in a way that X can be divided into G groups and the values of y depend on the values of X.

Usage

d.spls.simulate(n=200,p=100,nondes=50,sigmaondes=0.05,sigmay=0.5,int.coef=1:5)

Value

A list of the following attributes

X: the concatenated predictors matrix.
y: the response vector.
y0: the response vector without noise sigmay.
sigmay: the uncertainty on y.
sigmaondes: the standard deviation of the Gaussians.
G: the number of groups.

Arguments

n: a positive integer. n is the number of observations. Default value is 200.
p: a numeric vector of length G representing the number of variables. Default value is 100.
nondes: a numeric vector of length G. nondes is the number of Guassians in each mixture. Default value is 50.
sigmaondes: a numeric vector of length G. sigmaondes is the standard deviation of the Gaussians for each group $g$. Default value is 0.05.
sigmay: a real value. sigmay is the uncertainty on y. Default value is 0.5.
int.coef: a numeric vector of the coefficients of the linear combination in the construction of the response vector y.

Author

Louna Alsouki François Wahl

Details

The predictors matrix X is a concatenations of G predictors sub matrices. Each is computed using a mixture of Gaussian i.e. summing the following Gaussians: $$A \exp{(-\frac{(\textrm{xech}-\mu)^2}{2 \sigma^2})}.$$ Where

$A$ is a numeric vector of random values between 0 and 1,
xech is an element from the sequence of $p(g)$ equally spaced values from 0 to 1. $p(g)$ is the number of variables of the sub matrix $g$, for $g \in \{1, \dots, G\}$,
$\mu$ is a random value in $[0,1]$ representing the mean of the Gaussians,
$\sigma$ is a positive real value specified by the user and representing the standard deviation of the Gaussians.

The response vector y is a linear combination of the predictors to which we add a noise of uncertainty sigmay. It is computed as follows:

$$y_i= \sigma_y \times V_i +\sum_{g=1}^G \sum_{k=1}^K \textrm{int.coeff}_k \times \textrm{sum}X^{g}_{ik}$$ Where

$G$ is the number of predictor sub matrices,
$i$ is the index of the observation,
$V$ is a normally distributed vector of 0 mean and unitary standard deviation,
$K$ is the length of the vector int.coeff,
$\textrm{sum}X^{g}$ is a matrix of $n$ rows and $K$ columns. The values of the column $k$ are the sum of selected parts of each row of the sub matrix $X^g$. The columns of $X^g$ are separated equally and each part is used for the $K$ columns of $\textrm{sum}X^{g}$.

Examples

Run this code

### load dual.spls library
library(dual.spls)
####one predictors matrix
### parameters
n <- 100
p <- 50
nondes <- 20
sigmaondes <- 0.5
data1=d.spls.simulate(n=n,p=p,nondes=nondes,sigmaondes=sigmaondes)

Xa <- data1$X
ya <- data1$y

###plotting the data
plot(Xa[1,],type='l',ylim=c(0,max(Xa)),main='Data', ylab='Xa',col=1)
for (i in 2:n){ lines(Xa[i,],col=i) }

####two predictors matrix
### parameters
n <- 100
p <- c(50,100)
nondes <- c(20,30)
sigmaondes <- c(0.05,0.02)
data2=d.spls.simulate(n=n,p=p,nondes=nondes,sigmaondes=sigmaondes)

Xb <- data2$X
X1 <- Xb[,(1:p[1])]
X2 <- Xb[,(p[1]+1):(p[1]+p[2])]
yb <- data2$y

###plotting the data
plot(Xb[1,],type='l',ylim=c(0,max(Xb)),main='Data', ylab='Xb',col=1)
for (i in 2:n){ lines(Xb[i,],col=i) }

###plotting the data
plot(X1[1,],type='l',ylim=c(0,max(X1)),main='Data X1', ylab='X1',col=1)
for (i in 2:n){ lines(X1[i,],col=i) }

###plotting the data
plot(X2[1,],type='l',ylim=c(0,max(X2)),main='Data X2', ylab='X2',col=1)
for (i in 2:n){ lines(X2[i,],col=i) }

Run the code above in your browser using DataLab