sim.readcounts: Simulate read counts from count data and return a pooldata object

Description

Simulate read counts from count data and return a pooldata object

Usage

sim.readcounts(
  x,
  lambda.cov = rep(50, x@npops),
  overdisp = 1,
  seq.eps = 0,
  exp.eps = 0,
  maf.thr = 0,
  min.rc = 2,
  genome.size = 0,
  verbose = TRUE
)

Value

A pooldata object containing simulated read counts

Arguments

x: A countdata object containing allele count information
lambda.cov: Numeric vector of length npop giving the expected coverage of each pool
overdisp: Numeric value giving overdispersion of coverages (see details)
seq.eps: Numeric value giving the sequencing error rate
exp.eps: Numeric value giving the experimental error leading to unequal contribution of individual to the pool reads
maf.thr: Float giving the MAF threshold for SNP filtering
min.rc: Integer giving the minimal read count for an allele to be considered as true allele
genome.size: Size of the genome (only considered when seq.eps>0 to simulated spurious SNPs generated at monomorphic position)
verbose: If TRUE extra information is printed on the terminal

Details

The function implements a simulation approach similar to that described in Gautier et al. (2021). Read coverages are sampled from a distribution specified by the lambda.cov vector and the overdisp scalar. Note that overdisp is the same for all pop sample but the expected coveragese (specified in the lambda.cov vector) may vary across pool. If overdisp=1 (default), coverages are assumed Poisson distributed with mean (and variance) equal to the value specified in the lambda.cov vector. If overdisp>1, coverages follows a Negative Binomial distribution with a mean equal to lambda and variance equal to overdisp*lambda. Finally, if overdisp<1, no variation in coverage is introduced and all coverages are equal to the value specified in the lambda vector although they may (slightly) vary in the output when seq.eps>0 due to the removal of error reads. The seq.eps parameter control sequencing error rate. Sequencing errors are modeled following Gautier et al. (2021) i.e. read counts for the four possible bases are sampled from a multinomial distribution Multinom(c,{f*(1-eps)+(1-f)*eps/3;f*eps/3+(1-f)*(1-eps),eps/3,eps/3}) where c is the read coverage and f the reference allele frequencies (obtained from the count data). When seq.eps>0, spurious SNPs may be generated at monomorphic positions (the number of which being equal to the size of the genome, provided with the genome.size argument, minus the number of SNPs in the countdata object). These spurious SNPs are simulated using the same error model (Multinom(c,{1-eps,eps/3,eps/3,eps/3}). Only bi-allelic SNPs passing filtering conditions specified by min.rc (which controls the minimal read count for an allele to be deemed as true, i.e. if more than two alleles have >= min.rc counts then the SNP is excluded because non-bi-allelic) and maf.thr (threshold on the major allele frequency computed over all read counts) are included in the output. Experimental error exp.eps control the contribution of individual (assumed diploid) to the pools following the model described in Gautier et al. (2013). The parameter exp.eps corresponds to the coefficient of variation of the individual contributions. For example, in a pool of 10 individuals and a Poisson distributed coverage of mean 100, exp.eps=0.5 correspond to a situation where the 5 most contributing individuals contribute $>2$ times reads than the others. When exp.eps tends toward 0, all individuals contribute equally to the pool and there is no experimental error. Note that the number of (diploid) individuals for each SNP and pop. sample is deduced from the input total count (it may thus differ over SNP when the total counts are not the same).

Examples

Run this code

 #not run