Learn R Programming

SMLE (version 0.4.0)

Gen_Data: Data simulator for high-dimensional

Description

This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.

Usage

Gen_Data(
  n = 200,
  p = 5000,
  sigma = 1,
  num_ctgidx = NULL,
  pos_ctgidx = NULL,
  num_truecoef = NULL,
  pos_truecoef = NULL,
  level_ctgidx = NULL,
  effect_truecoef = NULL,
  correlation = c("ID", "AR", "MA", "CS"),
  rho = 0.5,
  family = c("gaussian", "binomial", "poisson")
)

Arguments

n

Sample size, number of rows of for the feature matrix to be generated.

p

number of columns for the feature matrix to be generated.

sigma

Parameter for noise level.

num_ctgidx

The number of features that are categorical. Set to FALSE for only numerical features. Default is FALSE.

pos_ctgidx

Vector of indices denoting which columns are categorical.

num_truecoef

The number of features (columns) that affect response. Default is 5.

pos_truecoef

Vector of indices denoting which features (columns) affect the response variable.

level_ctgidx

A vector to indicate the levels of categorical features in 'pos_ctgidx'. Default is 2.

effect_truecoef

Effects for the relevant features in 'pos_truecoef'.

correlation

Correlation structure among features. correlation = 'ID' for independent, correlation = 'MA' for moving average, correlation = 'CS' for compound symmetry, correlation = 'AR' for auto regressive Default is "ID".For more information see details.

rho

Parameter controlling the correlation strength. See details.

family

Models to generate the response from the synthetic features: 'gaussian' for normally distributed data, 'poisson' for non-negative counts, 'binomial' for binary (0-1).

Value

Returns a "sdata" object with

Y

Response variable vector of length \(n\)

X

Feature matrix or Dataframe (Matrix if num_ctgidx =FALSE and dataframe otherwise)

index

Vector of columns indices of X for the features that affect the response variables (relevant features).

Beta

Vector of effects for the relevant features.

Details

Simulated data \((y_i , x_i)\) for \(i = 1, . . . , n\) are generated as follows: First, we generate a \(p \times 1\) model coefficient vector beta with all entries being zero, except on the positions specified in pos_truecoef, on which effect_truecoef is used. When pos_truecoef is not specified, we randomly choose num_truecoef positions from the coefficient vector. When effect_truecoef is not specified, we randomly set the strength of the true model coefficients following Chen's setting: $$(4*\frac{\log{N}}{\sqrt{N}}+U(0,1))*Z$$ where U is uniform distribution. and \(P(Z=1)=1/2,P(Z=-1)=1/2\).

Next, we generate a \(n \times p\) feature matrix X based on the choice in correlation specified as follows.

Independent (ID): all features are independently generated from \(N( 0, 1)\).

Moving average (MA): candidate features \(x_1,..., x_p\) are joint normal, marginally \(N( 0, 1)\), with

cov\((x_j, x_{j-1}) = \rho\), cov\((x_j, x_{j-2}) = \frac{\rho}{2}\) and cov\((x_j, x_h) = 0\) for \(|j-h| \geq 3\).

Compound symmetry (CS): candidate features \(x_1,..., x_p\) are joint normal, marginally \(N( 0, 1)\), with cov\((x_j, x_h) = \rho\) if \(j\) ,\(h\) are both in the set of important features and \(cov(x_j, x_h) = \frac{\rho}{2}\) when only one of \(j\) or \(h\) are iOn the set of important features.

Auto-regressive (AR): candidate features \(x_1,..., x_p\) are joint normal marginally \(N( 0, 1)\), with

cov\((x_j, x_h) = \rho^{|j-h|}\) for all \(j\) and \(h\).

Then, generate the response variable Y according to its response type. For Gaussian model, \(Y =x^T \cdot \beta + \epsilon\) where \(\epsilon\ \in\) \(N( 0, 1)\). For the binary model let \(\pi = P(Y = 1|x)\). Sample y from Bernoulli(\(\pi\)) where \(logit(\pi) = x^T \cdot\beta\). Finally, for the Poisson model, Y is generated from Poisson distribution with the link \(\pi =exp(x^T \cdot \beta )\). For more details (see reference below)

References

Chen Xu and Jiahua Chen. (2014), The Sparse MLE for Ultrahigh-Dimensional Feature Screening * Journal of the American Statistical Association*109:507, pages:1257-1269

Examples

Run this code
# NOT RUN {
#Simulating data with binomial response and independent strcture.
Data<-Gen_Data(family ="binomial",correlation = "ID")
cor(Data$X[,1:5])
print(Data)


# }

Run the code above in your browser using DataLab