rbm.haplos.thresholds: Simulation of genetic data based on haplotypes and a libaility model

Description

Simulates genetic data with respect to allele frequency spectrum and linkage disequilibrium pattern observed on given haplotype data under a libaility model

Usage

rbm.haplos.thresholds(haplos, weights = c("SKAT", "constant"), 
                        max.maf.causal = 0.01, p.causal = 0.5, p.protect = 0, 
                        h2, prev, normal.approx = TRUE, size, 
                        replicates, rep.by.causal, verbose = TRUE)

Value

x: A bed matrix with simulated genotypes

Arguments

haplos: A matrix of haplotypes with one row per haplotype and one column per variant
weights: How to weight rare variants (if "constant", all variants have the same weight, if "SKAT", the rarest variants have the highest weights as in the SKAT paper: weights = -0.4*log10(MAF) )
max.maf.causal: The maf threshold to consider a rare variant (set at 0.01 by default), variants with a MAF upper this threshold will have a weight of 0
p.causal: The proportion of causal variants
p.protect: The proportion of protective variants among causal variants
h2: The variance explained by the gene
prev: A vector with the prevalence in each group of individuals
normal.approx: TRUE/FALSE: whether to use the normal approximation to compute thresholds. Set at TRUE by default
size: The sizes of each group of individuals
replicates: The number of simulations to perform
rep.by.causal: The number of time causal variants will be sampled
verbose: Whether to display information about the function actions

Details

nb.causal, p.protect, h2 and prev should be vectors of length corresponding to the number of groups to simulate. If they are of size 1, values will be duplicated.

All monomorphic variants and variants with a MAF higher than max.maf.causal will have a weight of 0. Causal variants are sampled among variants having weights greater than 0. Causal variants in each group of individuals are indicated in x@ped$Causal.

A liability model is built on haplotypes' burden computed on sampled causal variants using each variant's weights, and adjusted on the desired h2. Thresholds from this liability are then chosen to respect the given prev (from a standard normal distribution if normal.approx=TRUE, or using a distribution from 1e6 sampled burdens if normal.approx=FALSE). Please be carreful when using the normal approximation with high h2 values or low prev values. Haplotypes' probabilities in each group of individuals are then computed and two haplotypes are then sampled for each individual based on these probabilities.

To simulate a group of controls, prev needs to be set at 1, regardless of the other arguments.

N replicates will be performed, and to gain in computation time, the same causal variants can be used for multiple replicates as different haplotypes will be sampled for each individual. rep.by.causal indicates the number of replicates to perform for each set of causal variants. To ensure a variability in the simulations, we yet recommend to resample causal variants a few times when many replicates are to be performed. For example, if 1000 replicates are to be performed, we recommend to resample causal variants 20 times.

The phenotype will be stored in @ped$pheno, and the simulation number is @snps$genomic.region.

Examples

Run this code

# \donttest{
  #Load LCT dataset for haplotype matrix
  data(LCT.haplotypes)
  #LCT gene in the EUR population
  LCT.gene.hap <- LCT.hap[which(LCT.sample$super.population=="EUR"), 
                          which(LCT.snps$pos>=136545410 & LCT.snps$pos<=136594750)]

  #Simulation of 100 controls, and two groups of 50 cases with 30% causal variants
  #and with the second group having half h2 and twice the prevalence
  #compared to the first one
  #5 replicates are performed and causal variants are sampled once
  x <- rbm.haplos.thresholds(haplos=LCT.gene.hap, max.maf.causal = 0.01, p.causal=0.3,
                             p.protect=0, h2=c(0.01, 0.01, 0.02), prev=c(1, 0.01, 0.005),
                             size=c(100, 50, 50), replicates = 5, rep.by.causal = 5)
# }

Run the code above in your browser using DataLab