Generate simulated data under the generalized linear model and Cox proportional hazard model.
generate.data(
n,
p,
support.size = NULL,
rho = 0,
family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial",
"gamma", "ordinal"),
beta = NULL,
cortype = 1,
snr = 10,
sigma = NULL,
weibull.shape = 1,
uniform.max = 1,
y.dim = 3,
class.num = 3,
seed = 1
)
A list
object comprising:
Design matrix of predictors.
Response variable.
The coefficients used in the underlying regression model.
The number of observations.
The number of predictors of interest.
The number of nonzero coefficients in the underlying regression
model. Can be omitted if beta
is supplied.
A parameter used to characterize the pairwise correlation in
predictors. Default is 0
.
The distribution of the simulated response. "gaussian"
for
univariate quantitative response, "binomial"
for binary classification response,
"poisson"
for counting response, "cox"
for left-censored response,
"mgaussian"
for multivariate quantitative response,
"mgaussian"
for multi-classification response,
"ordinal"
for ordinal response.
The coefficient values in the underlying regression model.
If it is supplied, support.size
would be omitted.
The correlation structure.
cortype = 1
denotes the independence structure,
where the covariance matrix has \((i,j)\) entry equals \(I(i \neq j)\).
cortype = 2
denotes the exponential structure,
where the covariance matrix has \((i,j)\) entry equals \(rho^{|i-j|}\).
cortype = 3
denotes the constant structure,
where the non-diagonal entries of covariance
matrix are \(rho\) and diagonal entries are 1.
A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as
as the variance of \(x\beta\) divided
by the variance of a gaussian noise: \(\frac{Var(x\beta)}{\sigma^2}\).
The gaussian noise \(\epsilon\) is set with mean 0 and variance.
The noise is added to the linear predictor \(\eta\) = \(x\beta\). Default is snr = 10
.
Note that this arguments's effect is overridden if sigma
is supplied with a non-null value.
The variance of the gaussian noise. Default sigma = NULL
implies it is determined by snr
.
The shape parameter of the Weibull distribution.
It works only when family = "cox"
.
Default: weibull.shape = 1
.
A parameter controlling censored rate.
A large value implies a small censored rate;
otherwise, a large censored rate.
It works only when family = "cox"
.
Default is uniform.max = 1
.
Response's Dimension. It works only when family = "mgaussian"
. Default: y.dim = 3
.
The number of class. It works only when family = "multinomial"
. Default: class.num = 3
.
random seed. Default: seed = 1
.
Jin Zhu
For family = "gaussian"
, the data model is
$$Y = X \beta + \epsilon.$$
The underlying regression coefficient \(\beta\) has
uniform distribution [m, 100m] and \(m=5 \sqrt{2log(p)/n}.\)
For family= "binomial"
, the data model is $$Prob(Y = 1) = \exp(X
\beta + \epsilon)/(1 + \exp(X \beta + \epsilon)).$$
The underlying regression coefficient \(\beta\) has
uniform distribution [2m, 10m] and \(m = 5 \sqrt{2log(p)/n}.\)
For family = "poisson"
, the data is modeled to have
an exponential distribution:
$$Y = Exp(\exp(X \beta + \epsilon)).$$
The underlying regression coefficient \(\beta\) has
uniform distribution [2m, 10m] and \(m = \sqrt{2log(p)/n}/3.\)
For family = "gamma"
, the data is modeled to have
a gamma distribution:
$$Y = Gamma(X \beta + \epsilon + 10, shape),$$
where \(shape\) is shape parameter in a gamma distribution.
The underlying regression coefficient \(\beta\) has
uniform distribution [2m, 100m] and \(m = \sqrt{2log(p)/n}.\)
For family = "ordinal"
, the data is modeled to have
an ordinal distribution.
For family = "cox"
, the model for failure time \(T\) is
$$T = (-\log(U / \exp(X \beta)))^{1/weibull.shape},$$
where \(U\) is a uniform random variable with range [0, 1].
The centering time \(C\) is generated from
uniform distribution \([0, uniform.max]\),
then we define the censor status as
\(\delta = I(T \le C)\) and observed time as \(R = \min\{T, C\}\).
The underlying regression coefficient \(\beta\) has
uniform distribution [2m, 10m],
where \(m = 5 \sqrt{2log(p)/n}\).
For family = "mgaussian"
, the data model is
$$Y = X \beta + E.$$
The non-zero values of regression matrix \(\beta\) are sampled from
uniform distribution [m, 100m] and \(m=5 \sqrt{2log(p)/n}.\)
For family= "multinomial"
, the data model is $$Prob(Y = 1) = \exp(X \beta + E)/(1 + \exp(X \beta + E)).$$
The non-zero values of regression coefficient \(\beta\) has
uniform distribution [2m, 10m] and \(m = 5 \sqrt{2log(p)/n}.\)
In the above models, \(\epsilon \sim N(0, \sigma^2 )\) and \(E \sim MVN(0, \sigma^2 \times I_{q \times q})\),
where \(\sigma^2\) is determined by the snr
and q is y.dim
.
# Generate simulated data
n <- 200
p <- 20
support.size <- 5
dataset <- generate.data(n, p, support.size)
str(dataset)
Run the code above in your browser using DataLab