This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.
Gen_Data(n = 200, p = 5000, num_ctgidx = NULL, pos_ctgidx = NULL,
num_truecoef = NULL, pos_truecoef = NULL, correlation = c("ID",
"AR", "MA", "CS"), rho = 0.5, family = c("gaussian", "binomial",
"poisson"))
Sample size, number of rows of the data.frame (or matrix).
Number of features.
The number of features that are categorical. Set to FALSE for only numerical features. Default is FALSE.
Vector of indices denoting which columns are categorical.
The number of features that affect response. Default is 5.
Vector of indices denoting which columns affect the response variable.
correlation structure between features. correlation = 'ID'
for all variables independent,
correlation = 'MA'
for moving average, correlation = 'CS'
for compound symmetry, correlation = 'AR'
for auto correlation. Default is "independent".
For more information see details.
Parameter for AR(1) data, when correlation = "AR". Default is 0.5.
Response type.
'gaussian'
for normally distributed data, 'poisson'
for non-negative counts,
'binomial'
for binary (0-1).
Returns a "sdata"
object with
Response variable vector of length \(n\)
Feature matrix or Dataframe (Matrix if num_ctgidx =FALSE
and dataframe otherwise)
Vector of columns indices of X for the features that affect the response variables (causal features).
Vector of effects for the causal features.
Simulated data \((y_i , x_i)\) for \(i = 1, . . . , n\) are generated as follows:
First, randomly sample num_truecoef
important features among the p
features, the magnitude of the effects from a U(0,1) and the sign of the effect randomly(\(\beta\)'s).
Second, generate X using theselected correlation structure (independent, auto-regressive, moving average, compound symmetry). Then, to generating categorical data, we convert numerical
features to categorical by binning value. We randomly select num_ctgixdx numerical columns and convert them into a four level factor.
Moving average: candidate features \(x_1,..., x_p\) are joint normal, marginally N(0, 1), with \(cov(x_j, x_{j-1}) = \frac{2}{3}\), \(cov(x_j, x_{j-2}) = \frac{1}{3}\) and \(cov(x_j, x_h) = 0\) for \(|j-h| \geq 3\).
Compound symmetry: candidate features \(x_1,..., x_p\) are joint normal, marginally N(0, 1), with \(cov(x_j, x_h) =0.15\) if \(\ j\ ,\ h\ \) are both in the set of important features and \(cov(x_j, x_h) = 0.3\) when only one of \(j\) or \(h\) are in the set of important features.
Auto-regressive: candidate features \(x_1,..., x_p\) are joint normal marginally N(0, 1), with \(cov(x_j, x_{j+1}) = \rho\) for all \(j\).
Then, generate the response variable Y according to its response type. For Gaussian model, \(Y =x^T \cdot \beta + \epsilon\) where \(\epsilon \in N(0,1)\). For the binary model let \(\pi = P(Y = 1|x)\). Sample y from Bernoulli(\(\pi\)) where \(logit(\pi) = x^T \cdot\beta\). Finally, for the Poisson model, Y is generated from Poisson distribution with the link \(\pi =exp(x^T \cdot \beta )\). For more details (see reference below)
Chen Xu and Jiahua Chen. (2014), The Sparse MLE for Ultrahigh-Dimensional Feature Screening * Journal of the American Statistical Association*109:507, pages:1257-1269
# NOT RUN {
#Simulating data with binomial response and independent strcture.
Data<-Gen_Data(family ="binomial",correlation = "ID")
cor(Data$X[,1:5])
print(Data)
# }
Run the code above in your browser using DataLab