SimGeno: Simulate Genotypes

Description

Simulate SNP genotype data from a pedigree, with optional missingess and errors.

Usage

SimGeno(
  Pedigree,
  nSnp = 400,
  ParMis = 0.4,
  MAF = 0.3,
  CallRate = 0.99,
  SnpError = 5e-04,
  ErrorFM = "version2.0",
  ReturnStats = FALSE,
  OutFile = NA,
  Inherit = "autosomal",
  InheritFile = NA,
  quiet = FALSE
)

Arguments

Pedigree

dataframe, pedigree with the first three columns being id - dam - sire. Column names are ignored, as are additional columns, with the exception of a 'Sex' column when Inherit is not 'autosomal'.

nSnp

number of SNPs to simulate.

ParMis

single number or vector length two with proportion of parents with fully missing genotype. Ignored if CallRate is a named vector.

MAF

minimum minor allele frequency, and allele frequencies will be sampled uniformly between this minimum and 0.5, OR a vector with minor allele frequency at each locus. In both cases, this is the MAF among pedigree founders, the MAF in the sample will deviate due to drift.

CallRate

either a single number for the mean call rate (genotyping success), OR a vector with the call rate at each SNP, OR a named vector with the call rate for each individual. In the third case, ParMis is ignored, and individuals in the pedigree (as id or parent) not included in this vector are presumed non-genotyped.

SnpError

mean per-locus genotyping error rate across SNPs, and a beta-distribution will be used to simulate the number of missing cases per SNP, OR a vector with the genotyping error for each SNP.

ErrorFM

function taking the error rate (scalar) as argument and returning a 3x3 matrix with probabilities that actual genotype i (rows) is observed as genotype j (columns). Inbuilt ones are as used in sequoia 'version2.0', 'version1.3', or 'version1.1'. See details.

ReturnStats

in addition to the genotype matrix, return the input parameters and mean & quantiles of MAF, error rate and call rates.

OutFile

file name for simulated genotypes. If NA (default), return results within R.

Inherit

inheritance pattern, scalar or vector of length nSnp, Defaults to 'autosomal'. An excel file included in the package has inheritance patterns for the X and Y chromosome and mtDNA, and allows custom inheritance patterns. Note that these are experimental, and NOT currently supported by the pedigree reconstruction with sequoia !

InheritFile

file name of file with inheritance patterns, with extension csv, txt, xls or xlsx (the latter two require library xlsx).

quiet

suppress messages.

Value

If ReturnStats=FALSE (the default), a matrix with genotype data in sequoia's input format, encoded as 0/1/2/-9.

If ReturnStats=TRUE, a named list with three elements: list 'ParamsIN', matrix 'SGeno', and list 'StatsOUT':

Frequency in 'observed' genotypes of '1' allele

AF.act

Allele frequency in 'actual' (without genotyping errors & missingness)

SnpError

Error rate per SNP (actual /= observed AND observed /= missing)

SnpCallRate

Non-missing per SNP

IndivError

Error rate per individual

IndivCallRate

Non-missing per individual

Disclaimer

This simulation is highly simplistic and assumes that all SNPs segregate completely independently, that the SNPs are in Hardy-Weinberg equilibrium in the pedigree founders. It assumes that genotyping errors are not due to heritable mutations of the SNPs, and that missingness is random and not e.g. due to heritable mutations of SNP flanking regions. Results based on this simulated data will provide an minimum estimate of the number of SNPs required, and an optimistic estimate of pedigree reconstruction performance.

Details

Please ensure the pedigree is a valid pedigree, for example by first running PedPolish. For founders, i.e. individuals with no known parents, genotypes are drawn according to the provided MAF and assuming Hardy-Weinberg equilibrium. Offspring genotypes are generated following Mendelian inheritance, assuming all loci are completely independent. Individuals with one known parent are allowed: at each locus, one allele is inherited from the known parent, and the other drawn from the genepool according to the provided MAF.

Genotyping errors are generated following a user-definable 3x3 matrix with probabilities that actual genotype \(i\) (rows) is observed as genotype \(j\) (columns). This is specified as ErrorFM, which is a function of SnpError. By default (ErrorFM = "version2.0"), SnpError is interpreted as a locus-level error rate (rather than allele-level), and equals the probability that a homozygote is observed as heterozygote, and the probability that a heterozygote is observed as either homozygote (i.e., the probability that it is observed as AA = probability that observed as aa = SnpError/2). The probability that one homozygote is observed as the other is (SnpError/2\()^2\).

Note that this differs from versions up to 1.1.1, where a proportion of SnpError*3/2 of genotypes were replaced with random genotypes. This corresponds to ErrorFM = "Version111".

Error rates differ between SNPs, but the same error pattern is used across all SNPs, even when inheritance patterns vary. When two or more different error patterns are required, SimGeno should be run on the different SNP subsets separately, and results combined.

Variation in call rates is assumed to follow a highly skewed (beta) distribution, with many samples having call rates close to 1, and a narrowing tail of lower call rates. The first shape parameter defaults to 1 (but see MkGenoErrors), and the second shape parameter is defined via the mean as CallRate. For 99.9% of SNPs to have a call rate of 0.8 (0.9; 0.95) or higher, use a mean call rate of 0.969 (0.985; 0.993).

Variation in call rate between samples can be specified by providing a named vector to CallRate, which supersedes PropLQ in versions up to 1.1.1. Otherwise, variation in call rate and error rate between samples occurs only as side-effect of the random nature of which individuals are hit by per-SNP errors and drop-outs. Finer control is possible by first generating an error-free genotype matrix, and then calling MkGenoErrors directly on subsets of the matrix.

Examples

Run this code

# NOT RUN {
data(Ped_HSg5)
GenoM <- SimGeno(Pedigree = Ped_HSg5, nSnp = 100, ParMis = c(0.2, 0.7))

# }
# NOT RUN {
# Alternative genotyping error model
EFM <- function(E) {   # Whalen, Gorjanc & Hickey 2018
 matrix(c(1-E*3/4, E/4, E/4,
          E/4, 1/2-E/4, 1/2-E/4, E/4,
          E/4, E/4, 1-E*3/4),
          3,3, byrow=TRUE)  }
EFM(0.01)
GenoM <- SimGeno(Pedigree = Ped_HSg5, nSnp = 100, ParMis = 0.2,
 SnpError = 5e-3, ErrorFM = EFM)

# combination of high & low quality SNPs
Geno.HQ <- SimGeno(Ped_HSg5, nSnp=50, MAF=0.3, CallRate=runif(50, 0.7, 1))
Geno.LQ <- SimGeno(Ped_HSg5, nSnp=20, MAF=0.1, CallRate=runif(20, 0.1, 5))
Geno.HQLQ <- merge(Geno.HQ, Geno.LQ, by="row.names")
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab