simulateSNPcatResponse: Simulation of SNP Data with Categorical Response

Description

Simulates SNP data. Interactions of some of the simulated SNPs are then used to specify a categorical response by level-wise or multinomial logistic regression.

Usage

simulateSNPcatResponse(n.obs = 1000, n.snp = 50, list.ia = NULL,
   list.snp = NULL, withRef = FALSE, beta0 = -0.5, beta = 1.5, 
   maf = 0.25, sample.y = TRUE, rand = NA)
  
# S3 method for simSNPcatResponse
print(x, justify = c("left", "right"), spaces = 2, ...)

Arguments

n.obs

number of observations that should be generated.

n.snp

number of SNPs that should be generated.

list.ia

a list consisting of \(n_{cat}\) objects, where \(n_{cat}\) is the number of levels the response should have. If one interaction of SNPs should be explanatory for a specific level of the response, then the corresponding object in list.ia must be a numeric vector specifying the genotypes of the interacting SNPs by the integers -3, -2, -1, 1, 2, or 3, where 1 codes for the homozygous reference genotype, 2 for the heterozygous genotype, and 3 for the homozygous variant genotype, and a minus before these numbers means that the corresponding SNP should be not of this genotype. If more than one interaction should be explanatory for a specific category, then the corresponding object of list.ia must be a list containing one numeric vector composed of the integers -3, -2, -1, 1, 2, and 3 for each of the interactions.

If, e.g., one of the vectors is given by c(1, -1, -3) and the corresponding vector in list.snp is c(5, 7, 8), then the corresponding interaction explanatory for a level of the response is given by

(SNP5 == 1) & (SNP7 != 1) & (SNP8 != 3).

For more details, see Details. Must be specified if list.snp is specified. If both list.ia and list.snp are NULL, then the interactions shown in the Details section are used.

list.snp

a list consisting of numeric vectors (if one interaction should be explanatory for a level of the response) or lists of numeric vectors (if there should be more than one explanatory interaction) specifying the SNPs that compose the interactions. list.snp must have the same structure as list.ia, and each entry of list.snp must be an integer between 1 and n.snp. If list.ia is specified but not list.snp, then the first \(n\) SNPs are used to generate the interactions, where \(n\) is the total number of values in list.ia. For the case that both list.ia and list.snp are not specified, see Details.

withRef

should there be an additional reference group (i.e.\ a control group) denoted by a zero? If TRUE, a multinomial logistic regression is used to specify the class labels. If FALSE, level-wise logistic regressions are employed to generate the class labels. For details, see Details.

beta0

a numeric value or vector of length(list.ia) specifying the intercept of the logistic regression models.

beta

either a non-negative numeric value or a list of non-negative numeric values specifying the parameters in the logistic regression models. If a numeric value, all parameters (except for the intercept) in all logistic regression models will be equal to this value. If a list, then this list must have the same length as list.ia, and each object must consist of as many numeric values as interactions are specified by the corresponding object in list.ia.

maf

either an integer, or a vector of length 2 or n.snp specifying the minor allele frequency. If an integer, all the SNPs will have the same minor allele frequency. If a vector of length n.snp, each SNP will have the minor allele frequency specified in the corresponding entry of maf. If length 2, then maf is interpreted as the range of the minor allele frequencies, and for each SNP, a minor allele frequency will be randomly drawn from a uniform distribution with the range given by maf.

sample.y

should the values of the response be randomly drawn using the probabilities determined by the logistic regression models? If FALSE, then for each of the n.obs observations, the value of the response is given by the level exhibiting the largest probability at this observation.

rand

a numeric value for setting the random number generator in a reproducible state.

the output of simulateSNPcatResponse

justify

a character string specifying whether the column of the summarizing table that names the explanatory interactions should be "left"- or "right"-adjusted.

spaces

integer specifying the distance from the left end of the column mentioned in justify to the position at which the column name is presented.

…

ignored.

Value

An object of class simSNPcatResponse consisting of

a matrix with n.obs rows and n.snp columns containing the simulated SNP values.

a vector of length n.obs composed of the values of the response.

models

a character vector naming the level-wise logistic regression models.

maf

a vector of length n.snp composed of the minor allele frequencies.

tab.explain

a data frame summarizing the results of the simulation.

Details

simulateSNPcatResponse first simulates a matrix consisting of n.obs observations and n.snp SNPs, where the minor allele frequencies of these SNPs are given by maf.

Note that all SNPs are currently simulated independently of each other such that they are unlinked. Moreover, an observation is currently not allowed to have genotypes/interactions that are explanatory for more than one of the levels of the response. If, e.g., the response has three categories, then an observation can either exhibit one (or more) of the genotypes explaining the first level, or one (or more) of the genotypes explanatory for the second level, or one (or more) of the genotypes explaining the third level, or none of these genotypes.

Afterwards, the response is generated by employing the specifications of list.ia, list.snp, beta0 and beta.

By default, i.e.\ if both list.ia and list.snp are NULL, list.ia is set to

list(c(-1, 1), c(1, 1, 1), list(c(-1, 1), c(1, 1, 1))),

and list.snp is set to

list(c(6, 7), c(3, 9, 10), list(c(2, 5), c(1, 4, 8)))

such that the interaction

(SNP6 != 1) & (SNP7 == 1)

is assumed to be explanatory for the first level of the three-categorical response, the interaction

(SNP3 == 1) & (SNP9 == 1) & (SNP10 == 1)

is assumed to be explanatory for the second level, and the interactions

(SNP2 != 1) & (SNP5 == 1)\ \ \ and

(SNP1 == 1) & (SNP4 == 1) & (SNP8 == 1),

are assumed to be explanatory for the third level.

If withRef = FALSE, then for each of the levels, the probability of having this level given that an observation exhibits one, two, ... of the interactions intended to be explanatory for that level is determined using the corresponding logistic regression model. Afterwards, the value of the response for an observation showing one, two, ... of the interactions explanatory for a particular level is randomly drawn using the above probability \(p\) for the particular level and \((1-p)/(n_{cat}-1)\) as probabilities for the other \((n_{cat}-1)\) levels. If an observation exhibits none of the explanatory interactions, its response value is randomly drawn using the probabilities \(\exp{beta0}/(1+\exp{beta0})\).

If withRef = TRUE, a multinomial logistic regression is used to specify the class labels. In this case the probabilities \(p_j\), \(j = 1, ..., n.cat\), are given by \(p_j = \exp(q_j) * p_0\), where \(q_j\) are the probabilities on the logit-scale (i.e.\ the probabilities on the scale of the linear predictors) and \(p_0^{-1} = 1 + p_1 + ... + p_{n.cat}\) is the reciprocal of the probability for the control/reference group.

Examples

Run this code

# NOT RUN {
# The simulated data set described in Details.

sim1 <- simulateSNPcatResponse()
sim1

# Specifying the values of the response by the levels with
# the largest probability.

sim2 <- simulateSNPcatResponse(sample.y = FALSE)
sim2

# If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3), and
# ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions
# (or variables) that are explanatory for the three levels
# of the response, list.ia and list.snp are specified as follows.

list.ia <- list(c(-2, 1), 3, c(-1,3))
list.snp <- list(c(4, 3), 5, c(12,9))

# The categorical response and a data set consisting of 
# 800 observations and 25 SNPs, where the minor allele
# frequency of each SNP is randomly drawn from a
# uniform distribution with minimum 0.1 and maximum 0.4,
# is then generated by

sim3 <- simulateSNPcatResponse(n.obs = 800, n.snp = 25,
  list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4))
sim3

# }