generate.data: Generate simulated data

Description

Generate simulated data under the generalized linear model and Cox proportional hazard model.

Usage

generate.data(
  n,
  p,
  support.size = NULL,
  rho = 0,
  family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial",
    "gamma", "ordinal"),
  beta = NULL,
  cortype = 1,
  snr = 10,
  sigma = NULL,
  weibull.shape = 1,
  uniform.max = 1,
  y.dim = 3,
  class.num = 3,
  seed = 1
)

Value

A list object comprising:

x: Design matrix of predictors.
y: Response variable.
beta: The coefficients used in the underlying regression model.

Arguments

n: The number of observations.
p: The number of predictors of interest.
support.size: The number of nonzero coefficients in the underlying regression model. Can be omitted if beta is supplied.
rho: A parameter used to characterize the pairwise correlation in predictors. Default is 0.
family: The distribution of the simulated response. "gaussian" for univariate quantitative response, "binomial" for binary classification response, "poisson" for counting response, "cox" for left-censored response, "mgaussian" for multivariate quantitative response, "mgaussian" for multi-classification response, "ordinal" for ordinal response.
beta: The coefficient values in the underlying regression model. If it is supplied, support.size would be omitted.
cortype: The correlation structure. cortype = 1 denotes the independence structure, where the covariance matrix has $(i,j)$ entry equals $I(i \neq j)$. cortype = 2 denotes the exponential structure, where the covariance matrix has $(i,j)$ entry equals $rho^{|i-j|}$. cortype = 3 denotes the constant structure, where the non-diagonal entries of covariance matrix are $rho$ and diagonal entries are 1.
snr: A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as as the variance of $x\beta$ divided by the variance of a gaussian noise: $\frac{Var(x\beta)}{\sigma^2}$. The gaussian noise $\epsilon$ is set with mean 0 and variance. The noise is added to the linear predictor $\eta$ = $x\beta$. Default is snr = 10. Note that this arguments's effect is overridden if sigma is supplied with a non-null value.
sigma: The variance of the gaussian noise. Default sigma = NULL implies it is determined by snr.
weibull.shape: The shape parameter of the Weibull distribution. It works only when family = "cox". Default: weibull.shape = 1.
uniform.max: A parameter controlling censored rate. A large value implies a small censored rate; otherwise, a large censored rate. It works only when family = "cox". Default is uniform.max = 1.
y.dim: Response's Dimension. It works only when family = "mgaussian". Default: y.dim = 3.
class.num: The number of class. It works only when family = "multinomial". Default: class.num = 3.
seed: random seed. Default: seed = 1.

Author

Jin Zhu

Details

For family = "gaussian", the data model is $$Y = X \beta + \epsilon.$$ The underlying regression coefficient $\beta$ has uniform distribution [m, 100m] and $m=5 \sqrt{2log(p)/n}.$

For family= "binomial", the data model is $$Prob(Y = 1) = \exp(X \beta + \epsilon)/(1 + \exp(X \beta + \epsilon)).$$ The underlying regression coefficient $\beta$ has uniform distribution [2m, 10m] and $m = 5 \sqrt{2log(p)/n}.$

For family = "poisson", the data is modeled to have an exponential distribution: $$Y = Exp(\exp(X \beta + \epsilon)).$$ The underlying regression coefficient $\beta$ has uniform distribution [2m, 10m] and $m = \sqrt{2log(p)/n}/3.$

For family = "gamma", the data is modeled to have a gamma distribution: $$Y = Gamma(X \beta + \epsilon + 10, shape),$$ where $shape$ is shape parameter in a gamma distribution. The underlying regression coefficient $\beta$ has uniform distribution [2m, 100m] and $m = \sqrt{2log(p)/n}.$

For family = "ordinal", the data is modeled to have an ordinal distribution.

For family = "cox", the model for failure time $T$ is $$T = (-\log(U / \exp(X \beta)))^{1/weibull.shape},$$ where $U$ is a uniform random variable with range [0, 1]. The centering time $C$ is generated from uniform distribution $[0, uniform.max]$, then we define the censor status as $\delta = I(T \le C)$ and observed time as $R = \min\{T, C\}$. The underlying regression coefficient $\beta$ has uniform distribution [2m, 10m], where $m = 5 \sqrt{2log(p)/n}$.

For family = "mgaussian", the data model is $$Y = X \beta + E.$$ The non-zero values of regression matrix $\beta$ are sampled from uniform distribution [m, 100m] and $m=5 \sqrt{2log(p)/n}.$

For family= "multinomial", the data model is $$Prob(Y = 1) = \exp(X \beta + E)/(1 + \exp(X \beta + E)).$$ The non-zero values of regression coefficient $\beta$ has uniform distribution [2m, 10m] and $m = 5 \sqrt{2log(p)/n}.$

In the above models, $\epsilon \sim N(0, \sigma^2 )$ and $E \sim MVN(0, \sigma^2 \times I_{q \times q})$, where $\sigma^2$ is determined by the snr and q is y.dim.

Examples

Run this code


# Generate simulated data
n <- 200
p <- 20
support.size <- 5
dataset <- generate.data(n, p, support.size)
str(dataset)

Run the code above in your browser using DataLab