Simulate SNP genotype data from a pedigree, with optional missingess and errors.
SimGeno(
  Pedigree,
  nSnp = 400,
  ParMis = 0.4,
  MAF = 0.3,
  CallRate = 0.99,
  SnpError = 5e-04,
  ErrorFM = "version2.0",
  ReturnStats = FALSE,
  OutFile = NA,
  Inherit = "autosomal",
  InheritFile = NA,
  quiet = FALSE
)dataframe, pedigree with the first three columns being id - dam - sire. Column names are ignored, as are additional columns, with the exception of a 'Sex' column when Inherit is not 'autosomal'.
number of SNPs to simulate.
single number or vector length two with proportion of parents with fully missing genotype. Ignored if CallRate is a named vector.
minimum minor allele frequency, and allele frequencies will be sampled uniformly between this minimum and 0.5, OR a vector with minor allele frequency at each locus. In both cases, this is the MAF among pedigree founders, the MAF in the sample will deviate due to drift.
either a single number for the mean call rate (genotyping success), OR a vector with the call rate at each SNP, OR a named vector with the call rate for each individual. In the third case, ParMis is ignored, and individuals in the pedigree (as id or parent) not included in this vector are presumed non-genotyped.
mean per-locus genotyping error rate across SNPs, and a beta-distribution will be used to simulate the number of missing cases per SNP, OR a vector with the genotyping error for each SNP.
function taking the error rate (scalar) as argument and returning a 3x3 matrix with probabilities that actual genotype i (rows) is observed as genotype j (columns). Inbuilt ones are as used in sequoia 'version2.0', 'version1.3', or 'version1.1'. See details.
in addition to the genotype matrix, return the input parameters and mean & quantiles of MAF, error rate and call rates.
file name for simulated genotypes. If NA (default), return results within R.
inheritance pattern, scalar or vector of length nSnp,
Defaults to 'autosomal'. An excel file included in the package has
inheritance patterns for the X and Y chromosome and mtDNA, and allows
custom inheritance patterns. Note that these are experimental, and NOT
currently supported by the pedigree reconstruction with
sequoia !
file name of file with inheritance patterns, with extension csv, txt, xls or xlsx (the latter two require library xlsx).
suppress messages.
If ReturnStats=FALSE (the default), a matrix with genotype
  data in sequoia's input format, encoded as 0/1/2/-9.
If ReturnStats=TRUE, a named list with three elements: list
  'ParamsIN', matrix 'SGeno', and list 'StatsOUT':
Frequency in 'observed' genotypes of '1' allele
Allele frequency in 'actual' (without genotyping errors & missingness)
Error rate per SNP (actual /= observed AND observed /= missing)
Non-missing per SNP
Error rate per individual
Non-missing per individual
This simulation is highly simplistic and assumes that all SNPs segregate completely independently, that the SNPs are in Hardy-Weinberg equilibrium in the pedigree founders. It assumes that genotyping errors are not due to heritable mutations of the SNPs, and that missingness is random and not e.g. due to heritable mutations of SNP flanking regions. Results based on this simulated data will provide an minimum estimate of the number of SNPs required, and an optimistic estimate of pedigree reconstruction performance.
Please ensure the pedigree is a valid pedigree, for example by first
  running PedPolish. For founders, i.e. individuals with no
  known parents, genotypes are drawn according to the provided MAF and
  assuming Hardy-Weinberg equilibrium. Offspring genotypes are generated
  following Mendelian inheritance, assuming all loci are completely
  independent. Individuals with one known parent are allowed: at each locus,
  one allele is inherited from the known parent, and the other drawn from the
  genepool according to the provided MAF.
Genotyping errors are generated following a user-definable 3x3 matrix with
  probabilities that actual genotype \(i\) (rows) is observed as genotype
  \(j\) (columns). This is specified as ErrorFM, which is a function
  of SnpError. By default (ErrorFM = "version2.0"),
  SnpError is interpreted as a locus-level error rate (rather than
  allele-level), and equals the probability that a homozygote is observed as
  heterozygote, and the probability that a heterozygote is observed as either
  homozygote (i.e., the probability that it is observed as AA = probability
  that observed as aa = SnpError/2). The probability that one
  homozygote is observed as the other is (SnpError/2\()^2\).
Note that this differs from versions up to 1.1.1, where a proportion of
  SnpError*3/2 of genotypes were replaced with random genotypes. This
  corresponds to ErrorFM = "Version111".
Error rates differ between SNPs, but the same error pattern is used across all SNPs, even when inheritance patterns vary. When two or more different error patterns are required, SimGeno should be run on the different SNP subsets separately, and results combined.
Variation in call rates is assumed to follow a highly skewed (beta)
  distribution, with many samples having call rates close to 1, and a
  narrowing tail of lower call rates. The first shape parameter defaults to 1
  (but see MkGenoErrors), and the second shape parameter is
  defined via the mean as CallRate. For 99.9% of SNPs to have a call
  rate of 0.8 (0.9; 0.95) or higher, use a mean call rate of 0.969 (0.985;
  0.993).
Variation in call rate between samples can be specified by providing a
  named vector to CallRate, which supersedes PropLQ in versions up to
  1.1.1. Otherwise, variation in call rate and error rate between samples
  occurs only as side-effect of the random nature of which individuals are
  hit by per-SNP errors and drop-outs. Finer control is possible by first
  generating an error-free genotype matrix, and then calling
  MkGenoErrors directly on subsets of the matrix.
The wrapper EstConf for repeated simulation and
  pedigree reconstruction; MkGenoErrors for fine control over
  the distribution of genotyping errors in simulated data.
# NOT RUN {
data(Ped_HSg5)
GenoM <- SimGeno(Pedigree = Ped_HSg5, nSnp = 100, ParMis = c(0.2, 0.7))
# }
# NOT RUN {
# Alternative genotyping error model
EFM <- function(E) {   # Whalen, Gorjanc & Hickey 2018
 matrix(c(1-E*3/4, E/4, E/4,
          E/4, 1/2-E/4, 1/2-E/4, E/4,
          E/4, E/4, 1-E*3/4),
          3,3, byrow=TRUE)  }
EFM(0.01)
GenoM <- SimGeno(Pedigree = Ped_HSg5, nSnp = 100, ParMis = 0.2,
 SnpError = 5e-3, ErrorFM = EFM)
# combination of high & low quality SNPs
Geno.HQ <- SimGeno(Ped_HSg5, nSnp=50, MAF=0.3, CallRate=runif(50, 0.7, 1))
Geno.LQ <- SimGeno(Ped_HSg5, nSnp=20, MAF=0.1, CallRate=runif(20, 0.1, 5))
Geno.HQLQ <- merge(Geno.HQ, Geno.LQ, by="row.names")
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab