simulateSNPglm: Simulation of SNP data

Description

Simulates SNP data. Interactions of some of the simulated SNPs are then used to specify either a binary or a quantitative response by a logistic or linear regression model, respectively.

Usage

simulateSNPglm(n.obs = 1000, n.snp = 50, list.ia = NULL, list.snp = NULL, 
   beta0 = -0.5, beta = 1.5, maf = 0.25, sample.y = TRUE, p.cutoff = 0.5, 
   err.fun = NULL, rand = NA, ...)

Arguments

n.obs

number of observations that should be generated.

n.snp

number of SNPs that should be generated.

list.ia

a list consisting of numeric vectors (or values) specifying the genotypes of the SNPs that should be explanatory for the response. Each of these vectors must be composed of some of the numbers -3, -2, -1, 1, 2, 3, where 1 denotes the hom

list.snp

a list consisting of numeric vectors specifying the SNPs that compose the interactions used in the regression model. Each of these vectors must have the same length as the corresponding vector in list.ia, and must consist o

beta0

a numeric value specifying the intercept of the regression model.

beta

a non-negative numeric value or vector of the same length as list.ia (i.e. one numeric value for each interaction) specifying the parameters in the regression model.

maf

either an integer, or a vector of length 2 or n.snp specifying the minor allele frequency. If an integer, all the SNPs will have the same minor allele frequency. If a vector of length n.snp, each SNP will have the minor

sample.y

should the values of the response in the logistic regression model be randomly drawn using the probabilities of the respective observations for being a case? If FALSE, then the response value of an observation is 1 if its probabilit

p.cutoff

a probability, i.e. a numeric value between 0 and 1, naming the cutoff for an observation to be called a case if sample.y = FALSE. For details, see sample.y. Ignored if sample.y = TRUE or err.fun

err.fun

a function for generating the error that is added to the linear model to determine the value of the (quantitative) response. If NULL, a logistic regression model is fitted. If specified, a linear model is fitted. Therefore, this arg

rand

a numeric value for setting the random number generator in a reproducible state.

...

further arguments of the function specified by err.fun.

Value

An object of class simSNPglm consisting of
xa matrix with n.obs rows and n.snp columns containing the simulated SNP values.
ya vector of length n.obs composed of the values of the response.
beta0the value of the intercept.
betathe vector of parameters.
iaa character vector naming the explanatory interactions.
mafa vector of length n.snp composed of the minor allele frequencies.
proba vector of length n.obs consisting of the values of Prob($Y = 1$) (if err.fun = NULL).
erra vector of length n.obs composed of the values of the error (if err.fun is specified).
p.cutoffthe value of p.cutoff (if err.fun = NULL and sample.y = FALSE).
err.calla character string naming the call of the error function (if err.fun is specified).

Details

simulateSNPglm first simulates a matrix consisting of n.obs observations and n.snp SNPs, where the minor allele frequencies of these SNPs are given by maf. Note that all SNPs are currently simulated independently of each other such that they are unlinked. Afterwards, the response is determined by a regression model using the specifications of list.ia, list.snp, beta0 and beta. Depending on whether err.fun is specified or not, a linear or a logistic regression model is used, respectively, i.e. the response $Y$ is continuous or binary. By default, a logistic regression model logit(Prob($Y = 1$)) = beta0 + beta[1] * $L_1$ + beta[2] * $L_2$ + ...is fitted, since err.fun = NULL. If both list.ia and list.snp are NULL, then interactions similar to the one considered, e.g., in Nunkesser et al. (2007) or Schwender et al. (2007) are used, i.e. $L_1$ = (SNP6 != 1) & (SNP7 == 1) and $L_2$ = (SNP3 == 1) & (SNP9 == 1) & (SNP10 == 1), by setting list.ia = list(c(-1, 1), c(1, 1, 1)) and list.snp = list(c(6, 7), c(3, 9, 10)). Using the above model Prob($Y = 1$) is computed for each observation, and its value of the response is determined either by a random draw from a Bernoulli distribution using this probability (if sample.y = TRUE), or by evaluating if Prob($Y = 1$) $>$ p.cutoff (if sample.y = FALSE). If err.fun is specified, then the linear model $Y$ = beta0 + beta[1] * $L_1$ + beta[2] * $L_2$ + ...+ error is used to determine the values of the response $Y$, where the values for error are given by the output of a call of err.fun.

References

Nunkesser, R., Bernholt, T., Schwender, H., Ickstadt, K. and Wegener, I. (2007). Detecting High-Order Interactions of Single Nucleotide Polymorphisms Using Genetic Programming. Bioinformatics, 23, 3280-3288. Schwender, H. (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.

Examples

Run this code

# The simulated data set described in Details.

sim1 <- simulateSNPglm()
sim1

# A bit more information: Table of probabilities of being a case
# vs. numbers of cases and controls.

summary(sim1)

# Calling an observation a case if its probability of being
# a case is larger than 0.5 (the default for p.cutoff).

sim2 <- simulateSNPglm(sample.y = FALSE)
summary(sim2)

# If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3) and
# ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions
# (or variables) that are explanatory for the response,
# list.ia and list.snp are specified as follows.

list.ia <- list(c(-2, 1), 3, c(-1,3))
list.snp <- list(c(4, 3), 5, c(12,9))

# The binary response and the data set consisting of 
# 600 observations and 25 SNPs, where the minor allele
# frequency of each SNP is randomly drawn from a
# uniform distribution with minimum 0.1 and maximum 0.4,
# is then generated by

sim3 <- simulateSNPglm(n.obs = 600, n.snp = 25,
  list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4))
sim3

summary(sim3)
  
# If the response should be quantitative, err.fun has
# to be specified. To use a normal distribution with mean 0
# (default in rnorm) and a standard deviation of 2 
# as the distribution of the error, call

simulateSNPglm(err.fun = rnorm, sd = 2)

Run the code above in your browser using DataLab