This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.
Gen_Data(
n = 200,
p = 5000,
sigma = 1,
num_ctgidx = NULL,
pos_ctgidx = NULL,
num_truecoef = NULL,
pos_truecoef = NULL,
level_ctgidx = NULL,
effect_truecoef = NULL,
correlation = c("ID", "AR", "MA", "CS"),
rho = 0.5,
family = c("gaussian", "binomial", "poisson")
)
Sample size, number of rows of for the feature matrix to be generated.
number of columns for the feature matrix to be generated.
Parameter for noise level.
The number of features that are categorical. Set to FALSE for only numerical features. Default is FALSE.
Vector of indices denoting which columns are categorical.
The number of features (columns) that affect response. Default is 5.
Vector of indices denoting which features (columns) affect the response variable.
A vector to indicate the levels of categorical features in 'pos_ctgidx'. Default is 2.
Effects for the relevant features in 'pos_truecoef'.
Correlation structure among features. correlation = 'ID'
for independent,
correlation = 'MA'
for moving average, correlation = 'CS'
for compound symmetry, correlation = 'AR'
for auto regressive Default is "ID".For more information see details.
Parameter controlling the correlation strength. See details.
Models to generate the response from the synthetic features:
'gaussian'
for normally distributed data, 'poisson'
for non-negative counts,
'binomial'
for binary (0-1).
Returns a "sdata"
object with
Response variable vector of length \(n\)
Feature matrix or Dataframe (Matrix if num_ctgidx =FALSE
and dataframe otherwise)
Vector of columns indices of X for the features that affect the response variables (relevant features).
Vector of effects for the relevant features.
Simulated data \((y_i , x_i)\) for \(i = 1, . . . , n\) are generated as follows:
First, we generate a \(p \times 1\) model coefficient vector beta with all entries being zero, except on the positions specified in pos_truecoef
,
on which effect_truecoef
is used. When pos_truecoef
is not specified, we randomly choose num_truecoef
positions from the coefficient
vector. When effect_truecoef
is not specified, we randomly set the strength of the true model coefficients following Chen's setting:
$$(4*\frac{\log{N}}{\sqrt{N}}+U(0,1))*Z$$
where U is uniform distribution. and \(P(Z=1)=1/2,P(Z=-1)=1/2\).
Next, we generate a \(n \times p\) feature matrix X based on the choice in
correlation
specified as follows.
Independent (ID): all features are independently generated from \(N( 0, 1)\).
Moving average (MA): candidate features \(x_1,..., x_p\) are joint normal, marginally \(N( 0, 1)\), with
cov\((x_j, x_{j-1}) = \rho\), cov\((x_j, x_{j-2}) = \frac{\rho}{2}\) and cov\((x_j, x_h) = 0\) for \(|j-h| \geq 3\).
Compound symmetry (CS): candidate features \(x_1,..., x_p\) are joint normal, marginally \(N( 0, 1)\), with cov\((x_j, x_h) = \rho\) if \(j\) ,\(h\) are both in the set of important features and \(cov(x_j, x_h) = \frac{\rho}{2}\) when only one of \(j\) or \(h\) are iOn the set of important features.
Auto-regressive (AR): candidate features \(x_1,..., x_p\) are joint normal marginally \(N( 0, 1)\), with
cov\((x_j, x_h) = \rho^{|j-h|}\) for all \(j\) and \(h\).
Then, generate the response variable Y according to its response type. For Gaussian model, \(Y =x^T \cdot \beta + \epsilon\) where \(\epsilon\ \in\) \(N( 0, 1)\). For the binary model let \(\pi = P(Y = 1|x)\). Sample y from Bernoulli(\(\pi\)) where \(logit(\pi) = x^T \cdot\beta\). Finally, for the Poisson model, Y is generated from Poisson distribution with the link \(\pi =exp(x^T \cdot \beta )\). For more details (see reference below)
Chen Xu and Jiahua Chen. (2014), The Sparse MLE for Ultrahigh-Dimensional Feature Screening * Journal of the American Statistical Association*109:507, pages:1257-1269
# NOT RUN {
#Simulating data with binomial response and independent strcture.
Data<-Gen_Data(family ="binomial",correlation = "ID")
cor(Data$X[,1:5])
print(Data)
# }
Run the code above in your browser using DataLab