Gen_Data: Data simulator for high-dimensional GLMs

Description

This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.

Usage

Gen_Data(n = 200, p = 5000, num_ctgidx = NULL, pos_ctgidx = NULL,
  num_truecoef = NULL, pos_truecoef = NULL, correlation = c("ID",
  "AR", "MA", "CS"), rho = 0.5, family = c("gaussian", "binomial",
  "poisson"))

Arguments

Sample size, number of rows of the data.frame (or matrix).

Number of features.

num_ctgidx

The number of features that are categorical. Set to FALSE for only numerical features. Default is FALSE.

pos_ctgidx

Vector of indices denoting which columns are categorical.

num_truecoef

The number of features that affect response. Default is 5.

pos_truecoef

Vector of indices denoting which columns affect the response variable.

correlation

correlation structure between features. correlation = 'ID' for all variables independent, correlation = 'MA' for moving average, correlation = 'CS' for compound symmetry, correlation = 'AR' for auto correlation. Default is "independent". For more information see details.

rho

Parameter for AR(1) data, when correlation = "AR". Default is 0.5.

family

Response type. 'gaussian' for normally distributed data, 'poisson' for non-negative counts, 'binomial' for binary (0-1).

Value

Returns a "sdata" object with

Response variable vector of length \(n\)

Feature matrix or Dataframe (Matrix if num_ctgidx =FALSE and dataframe otherwise)

index

Vector of columns indices of X for the features that affect the response variables (causal features).

Beta

Vector of effects for the causal features.

Details

Simulated data \((y_i , x_i)\) for \(i = 1, . . . , n\) are generated as follows: First, randomly sample num_truecoef important features among the p features, the magnitude of the effects from a U(0,1) and the sign of the effect randomly(\(\beta\)'s). Second, generate X using theselected correlation structure (independent, auto-regressive, moving average, compound symmetry). Then, to generating categorical data, we convert numerical features to categorical by binning value. We randomly select num_ctgixdx numerical columns and convert them into a four level factor.

Moving average: candidate features \(x_1,..., x_p\) are joint normal, marginally N(0, 1), with \(cov(x_j, x_{j-1}) = \frac{2}{3}\), \(cov(x_j, x_{j-2}) = \frac{1}{3}\) and \(cov(x_j, x_h) = 0\) for \(|j-h| \geq 3\).

Compound symmetry: candidate features \(x_1,..., x_p\) are joint normal, marginally N(0, 1), with \(cov(x_j, x_h) =0.15\) if \(\ j\ ,\ h\ \) are both in the set of important features and \(cov(x_j, x_h) = 0.3\) when only one of \(j\) or \(h\) are in the set of important features.

Auto-regressive: candidate features \(x_1,..., x_p\) are joint normal marginally N(0, 1), with \(cov(x_j, x_{j+1}) = \rho\) for all \(j\).

Then, generate the response variable Y according to its response type. For Gaussian model, \(Y =x^T \cdot \beta + \epsilon\) where \(\epsilon \in N(0,1)\). For the binary model let \(\pi = P(Y = 1|x)\). Sample y from Bernoulli(\(\pi\)) where \(logit(\pi) = x^T \cdot\beta\). Finally, for the Poisson model, Y is generated from Poisson distribution with the link \(\pi =exp(x^T \cdot \beta )\). For more details (see reference below)

References

Chen Xu and Jiahua Chen. (2014), The Sparse MLE for Ultrahigh-Dimensional Feature Screening * Journal of the American Statistical Association*109:507, pages:1257-1269

Examples

Run this code

# NOT RUN {
#Simulating data with binomial response and independent strcture.
Data<-Gen_Data(family ="binomial",correlation = "ID")
cor(Data$X[,1:5])
print(Data)


# }

Run the code above in your browser using DataLab