scrime (version 1.3.5)

simulateSNPs: Simulation of SNP data

Description

Simulates SNP data, where a specified proportion of cases and controls is explained by specified set of SNP interactions. Can also be used to simulate a data set with a multi-categorical response, i.e.\ a data set in which the cases are divided into several classes (e.g., different diseases or subtypes of a disease).

Usage

simulateSNPs(n.obs, n.snp, vec.ia, prop.explain = 1, 
  list.ia.val = NULL, vec.ia.num = NULL, vec.cat = NULL,
  maf = c(0.1, 0.4), prob.val = rep(1/3, 3), list.equal = NULL, 
  prob.equal = 0.8, rm.redundancy = TRUE, shuffle = FALSE, 
  shuffle.obs = FALSE, rand = NA)

Arguments

n.obs

either an integer specifying the total number of observations, or a vector of length 2 specifying the number of cases and the number of controls. If vec.cat is specified, then the partitioning of the number of cases to the different classes can be governed by vec.ia.num. If n.obs is an integer, then \(1 / c\) of the observations will be controls and the remaining observations will be cases, where \(c\) is the total number of groups (including the controls).

n.snp

integer specifying the number of SNPs.

vec.ia

a vector of integers specifying the orders of the interactions that explain the cases. c(3,1,2,3), e.g., means that a three-way, a one-way (i.e. just a SNP), a two-way, and a three-way interaction explain the cases.

prop.explain

either an integer or a vector of length(vec.ia) specifying the proportions of cases explained by the interactions of interest among all observation having the interaction of interest. Must be larger than 0.5. E.g., prop.explain = 1 means that only cases have the interactions of interest specified by vec.ia (and list.ia.val). E.g., vec.ia = c(3, 2) and prop.explain = c(1, 0.8) means that only cases have the three-way interaction of interest, while 80% of the observations having the two-way interaction of interest are cases, and 20% are controls.

list.ia.val

a list of length(vec.ia) specifying the exact interactions. The objects in this list must be vectors of length vec.ia[i], and consist of the values 0 (for homozygous reference), 1 (heterozygous variant), or 2 (homozygous variant). E.g., vec.ia = c(3, 2) and list.ia.val = list(c(2, 0, 1), c(0, 2)) and prob.equal = 1 (see also list.equal) means that ((SNP1 == 2) \& (SNP2 == 0) \& (SNP3 == 1)) and ((SNP4 == 0) \& (SNP5 == 2)) are the explanatory interactions (if additionally prob.equal = 1; see also list.equal). If NULL, the genotypes are randomly drawn using the probabilities given by prob.val.

vec.ia.num

a vector of length(vec.ia) specifying the number of cases (not observations) explained by the interactions in vec.ia. If NULL, all the cases are divided into length(vec.ia) groups of about the same size. sum(vec.ia.num) must be smaller than or equal to the total number of cases. Each entry of vec.ia.num must currently be >= 10.

vec.cat

a vector of the same length of vec.ia specifying the subclasses of the cases that are explained by the corresponding interaction in vec.ia. If NULL, no subclasses will be considered. This feature is currently not fully tested. So be careful if specifying vec.cat.

maf

either an integer, or a vector of length 2 or n.snp specifying the minor allele frequencies. If an integer, all SNPs will have the same minor allele frequency. If a vector of length n.snp, each SNP will have the minor allele frequency specified in the corresponding entry of maf. If length 2, then maf is interpreted as the range of the minor allele frequencies, and for each SNP, a minor allele frequency will be randomly drawn from a uniform distribution with the range given by maf. Note: If a SNP belongs to an explanatory interaction, then only the set of observations not explained by this interaction will have the minor allele frequency specified by maf.

prob.val

a vector consisting of the probabilities for drawing a 0, 1, or 2, if list.ia.val = NULL, i.e.\ if the genotypes of the SNPs explaining the case-control status should be randomly drawn. Ignored if list.ia.val is specified. By default, each genotype has the same probability of being drawn.

list.equal

list of same structure as list.ia.val containing only ones and zeros, where a 1 specifies the equality to the corresponding value in list.ia.val, and a 0 specifies the non-equality. Thus, the entries of list.equal specify if the corresponding SNP should be of a particular genotype (when the entry is 1) or should be not of this genotype (when entry is 0). If NULL, this list will be generated automatically using prob.equal. If, e.g., vec.ia = c(3, 2), list.ia.val = list(c(2, 0, 1), c(0, 2)), and list.equal = list(c(1, -1, 1), c(1, -1)), then the explanatory interactions are given by ((SNP1 == 2) \& (SNP2 != 0) \& (SNP3 == 1)) and ((SNP4 == 0) \& (SNP5 != 2))

prob.equal

a numeric value specifying the probability that a 1 is drawn when generating list.equal. prob.equal is thus the probability for an equal sign.

rm.redundancy

should redundant SNPs be removed from the explaining interactions? It is possible that one specify an explaining \(i\)-way interaction, but an interaction between \((i-1)\) of the variables contained in the \(i\)-way interaction already explains all the cases (and controls) that the \(i\)-way interaction should explain. In this case, the redundant SNP is removed if rm.redundancy = TRUE.

shuffle

logical. By default, the first sum(vec.ia) columns of the generated data set contain the explanatory SNPs in the same order as they appear in this data set. If TRUE, this order will be shuffled.

shuffle.obs

should the observations be shuffled?

rand

integer. Sets the random number generator in a reproducible state.

Value

An object of class simulatedSNPs composed of

data

a matrix with n.obs rows and n.snp columns containing the SNP data.

cl

a vector of length n.obs comprising the case-control status of the observations.

tab.explain

a table naming the explanatory interactions and the numbers of cases and controls explained by them.

ia

character vector naming the interactions.

maf

vector of length n.snp containing the minor allele frequencies.

See Also

simulateSNPglm, simulateSNPcatResponse

Examples

Run this code
# NOT RUN {
# Simulate a data set containing 2000 observations (1000 cases
# and 1000 controls) and 50 SNPs, where one three-way and two 
# two-way interactions are chosen randomly to be explanatory 
# for the case-control status.

sim1 <- simulateSNPs(2000, 50, c(3, 2, 2))
sim1

# Simulate data of 1200 cases and 800 controls for 50 SNPs, 
# where 90% of the observations showing a randomly chosen 
# three-way interaction are cases, and 95% of the observations 
# showing a randomly chosen two-way interactions are cases.

sim2 <- simulateSNPs(c(1200, 800), 50, c(3, 2), 
   prop.explain = c(0.9, 0.95))
sim2

# Simulate a data set consisting of 1000 observations and 50 SNPs,
# where the minor allele frequency of each SNP is 0.25, and
# the interactions 
# ((SNP1 == 2) & (SNP2 != 0) & (SNP3 == 1))   and 
# ((SNP4 == 0) & (SNP5 != 2))
# are explanatory for 200 and 250 of the 500 cases, respectively,
# and for none of the 500 controls.

list1 <- list(c(2, 0, 1), c(0, 2))
list2 <- list(c(1, 0, 1), c(1, 0))
sim3 <- simulateSNPs(1000, 50, c(3, 2), list.ia.val = list1,
    list.equal = list2, vec.ia.num = c(200, 250), maf = 0.25)

# }

Run the code above in your browser using DataLab