Gen_Data: Data simulator for high-dimensional

Description

This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.

Usage

Gen_Data(
  n = 200,
  p = 5000,
  sigma = 1,
  num_ctgidx = NULL,
  pos_ctgidx = NULL,
  num_truecoef = NULL,
  pos_truecoef = NULL,
  level_ctgidx = NULL,
  effect_truecoef = NULL,
  correlation = c("ID", "AR", "MA", "CS"),
  rho = 0.5,
  family = c("gaussian", "binomial", "poisson")
)

Arguments

Sample size, number of rows of for the feature matrix to be generated.

number of columns for the feature matrix to be generated.

sigma

Parameter for noise level.

num_ctgidx

The number of features that are categorical. Set to FALSE for only numerical features. Default is FALSE.

pos_ctgidx

Vector of indices denoting which columns are categorical.

num_truecoef

The number of features (columns) that affect response. Default is 5.

pos_truecoef

Vector of indices denoting which features (columns) affect the response variable.

level_ctgidx

A vector to indicate the levels of categorical features in 'pos_ctgidx'. Default is 2.

effect_truecoef

Effects for the relevant features in 'pos_truecoef'.

correlation

Correlation structure among features. correlation = 'ID' for independent, correlation = 'MA' for moving average, correlation = 'CS' for compound symmetry, correlation = 'AR' for auto regressive Default is "ID".For more information see details.

rho

Parameter controlling the correlation strength. See details.

family

Models to generate the response from the synthetic features: 'gaussian' for normally distributed data, 'poisson' for non-negative counts, 'binomial' for binary (0-1).

Value

Returns a "sdata" object with

Response variable vector of length $n$

Feature matrix or Dataframe (Matrix if num_ctgidx =FALSE and dataframe otherwise)

index

Vector of columns indices of X for the features that affect the response variables (relevant features).

Beta

Vector of effects for the relevant features.

Details

Simulated data $(y_i , x_i)$ for $i = 1, . . . , n$ are generated as follows: First, we generate a $p \times 1$ model coefficient vector beta with all entries being zero, except on the positions specified in pos_truecoef, on which effect_truecoef is used. When pos_truecoef is not specified, we randomly choose num_truecoef positions from the coefficient vector. When effect_truecoef is not specified, we randomly set the strength of the true model coefficients following Chen's setting: $$(4*\frac{\log{N}}{\sqrt{N}}+U(0,1))*Z$$ where U is uniform distribution. and $P(Z=1)=1/2,P(Z=-1)=1/2$.

Next, we generate a $n \times p$ feature matrix X based on the choice in correlation specified as follows.

Independent (ID): all features are independently generated from $N( 0, 1)$.

Moving average (MA): candidate features $x_1,..., x_p$ are joint normal, marginally $N( 0, 1)$, with

cov$(x_j, x_{j-1}) = \rho$, cov$(x_j, x_{j-2}) = \frac{\rho}{2}$ and cov$(x_j, x_h) = 0$ for $|j-h| \geq 3$.

Compound symmetry (CS): candidate features $x_1,..., x_p$ are joint normal, marginally $N( 0, 1)$, with cov$(x_j, x_h) = \rho$ if $j$ ,$h$ are both in the set of important features and $cov(x_j, x_h) = \frac{\rho}{2}$ when only one of $j$ or $h$ are iOn the set of important features.

Auto-regressive (AR): candidate features $x_1,..., x_p$ are joint normal marginally $N( 0, 1)$, with

cov$(x_j, x_h) = \rho^{|j-h|}$ for all $j$ and $h$.

Then, generate the response variable Y according to its response type. For Gaussian model, $Y =x^T \cdot \beta + \epsilon$ where $\epsilon\ \in$ $N( 0, 1)$. For the binary model let $\pi = P(Y = 1|x)$. Sample y from Bernoulli($\pi$) where $logit(\pi) = x^T \cdot\beta$. Finally, for the Poisson model, Y is generated from Poisson distribution with the link $\pi =exp(x^T \cdot \beta )$. For more details (see reference below)

References

Chen Xu and Jiahua Chen. (2014), The Sparse MLE for Ultrahigh-Dimensional Feature Screening * Journal of the American Statistical Association*109:507, pages:1257-1269

Examples

Run this code

# NOT RUN {
#Simulating data with binomial response and independent strcture.
Data<-Gen_Data(family ="binomial",correlation = "ID")
cor(Data$X[,1:5])
print(Data)


# }

Run the code above in your browser using DataLab