generateSNPs: Simulation of SNP data

Description

Simulates SNP data with genotypes coded by 0, 1 and 2 as well as a binary and a continuous covariate, together with case-control status specified by logistic regression.

Usage

generateSNPs(n, gene.no, block.no, block.size, p.same, p.different = NULL, p.minor, n.sample, SNPtoBETA)

Arguments

an integer specifying the number of observations (cases and controls with 1:1 match) that should be generated. n should be an even number.

gene.no

an integer specifying the number of genes that should be generated.

block.no

an integer specifying the number of blocks per gene.

block.size

an integer specifying the number of SNPs per block.

p.same

either a numeric value specifying the probability for neighborhood SNPs within a block or a numeric vector of length block.size. In the latter case the argument p.different is ignored and has to be specified in the first item of p.same. The remaining items in the p.same vector specify the probabilities for neighborhood SNPs within the blocks, i.e. the probability that two neighboring alleles are equal within a block. If a numeric value, all SNPs, except the first item of each a block, will have the same neighborhood probability. If a vector of length block.size, each SNP of each block will have the neighborhood probability specified in the corresponding entry in p.same.

p.different

a numeric value specifying the probability for neighborhood blocks within a gene which is used if p.same is a scalar. The argument is ignored if p.same is a numeric vector and has to be specified in the first entry in p.same.

p.minor

a vector of length block.no containing the allele frequencies of the SNPs within a block. All SNPs in a block will have the same allele frequency.

n.sample

an integer specifying the number of simulated subjects from which the observations (case-control status) n are drawn.

SNPtoBETA

a matrix of non-negative numeric values of dimension m * 2 consisting of the SNP index (first column) with

m <= snp.no<="" code=""> and the parameters (size of effect) of these SNPs (second column) for generating of case-control status.

Value

sim.data: a matrix with n rows and (snp.no+4) columns containing response (case-control status) values, simulated SNP values, continuous matching covariate, binary matching covariate and matchset numbers.
y: a numeric response vector coded with 0 (coding for controls) and 1 (coding for cases) of length n.
x: a numeric n * snp.no matrix containing the simulated SNP data with genotypes coded by 0, 1 and 2.
cov: a n * 2 matrix containing the continuous matching covariate (likewise to age) and the binary matching covariate (likewise to gender).
matchset: a numeric vector of length n containing the matching numbers (1:1 match).
snp.no: number of SNPs in the simulated data set.
SNPtoGene: the mapping matrix of dimension p x 2 comprising of SNP names (first column) and the name of the genes (second column) on which the SNPs are located.
call: call.

Details

generateSNPs generates a matrix consisting of n observations, snp.no=gene.no*block.no*block.size SNPs with genotypes coded by 0, 1 and 2, two automatically generated covariates for adjustment or matching and the matchset numbers. The neighborhood probabilities for SNPs is given by p.different and/or p.same and the allele frequencies for SNPs is given by p.minor. The allele frequencies (p.minor) and the probabilities for neighborhood blocks (p.different) and/or p.same, respectively, can differ between the blocks on a gene but are repeated similar over all genes gene.no. The simulated SNP data structure is similar as in Schwender et al. (2011).

The response is determined by a logistic regression model given the SNPs, the binary covariate and the continuous covariate in the sim.cov matrix:

P(Y=1|sim.cov)=exp(sim.cov*beta)/(1+exp(sim.cov*beta))

Using the the model P(Y=1|sim.cov) is computed for each subject in n.sample, then the case and one control status for each of the n.sample subjects are determined by drawn randomly from a Bernoulli distribution using the probability P(Y=1|sim.cov). From these n.sample subjects one case and one control observation is randomly drawn. This algorithm is repeated n/2 times for each randomly sampled value from the continuous covariate, i.e. one case and one control is randomly drawn from each of n/2 times to generate the complete response vector of length(n).

As output generateSNPs provides a response vector y, a SNP matrix x, a covariate matrix cov and a matchset vector matchset which can directly be used as input for the minPtest, see the example of the minPtest function.

References

Schwender, H. et al. (2011). Testing SNPs and sets of SNPs for importance in association studies. Biostatistics, 12, 18-32.

Examples

Run this code

# Generate a data set consisting of 100 subjects and 200 SNPs on 5 genes,
# with 4 blocks per gene with block size of 10, i.e. 10 SNPs per block
# yielding 40 SNPs per gene:

# specifying the matrix for 6 SNPs and corresponding parameters (effect size)
# for the generation of case-control status

SNP <- c(6,26,54,135,156,186)
BETA <- c(0.9,0.7,1.5,0.5,0.6,0.8)
SNPtoBETA <- matrix(c(SNP,BETA),ncol=2,nrow=6)
colnames(SNPtoBETA) <- c("SNP.item","SNP.beta")

set.seed(191)
sim1 <- generateSNPs(n=100,gene.no=5,block.no=4,block.size=10,p.same=0.9,
p.different=0.75,p.minor=c(0.1,0.4,0.1,0.4),n.sample=80,SNPtoBETA=SNPtoBETA)

# to reconstruct how to adopt the output from generateSNPs,
# see the example of the minPtest function.

Run the code above in your browser using DataLab