Simulate SNP genotype data from a pedigree, with optional missingess and errors.
SimGeno(
Pedigree,
nSnp = 400,
ParMis = 0.4,
MAF = 0.3,
CallRate = 0.99,
SnpError = 5e-04,
ErrorFM = "version2.0",
ReturnStats = FALSE,
OutFile = NA,
Inherit = "autosomal",
InheritFile = NA,
quiet = FALSE
)
dataframe, pedigree with the first three columns being id - dam - sire. Column names are ignored, as are additional columns, with the exception of a 'Sex' column when Inherit is not 'autosomal'.
number of SNPs to simulate.
single number or vector length two with proportion of parents with fully missing genotype. Ignored if CallRate is a named vector.
minimum minor allele frequency, and allele frequencies will be sampled uniformly between this minimum and 0.5, OR a vector with minor allele frequency at each locus. In both cases, this is the MAF among pedigree founders, the MAF in the sample will deviate due to drift.
either a single number for the mean call rate (genotyping success), OR a vector with the call rate at each SNP, OR a named vector with the call rate for each individual. In the third case, ParMis is ignored, and individuals in the pedigree (as id or parent) not included in this vector are presumed non-genotyped.
mean per-locus genotyping error rate across SNPs, and a beta-distribution will be used to simulate the number of missing cases per SNP, OR a vector with the genotyping error for each SNP.
function taking the error rate (scalar) as argument and returning a 3x3 matrix with probabilities that actual genotype i (rows) is observed as genotype j (columns). Inbuilt ones are as used in sequoia 'version2.0', 'version1.3', or 'version1.1'. See details.
in addition to the genotype matrix, return the input parameters and mean & quantiles of MAF, error rate and call rates.
file name for simulated genotypes. If NA (default), return results within R.
inheritance pattern, scalar or vector of length nSnp,
Defaults to 'autosomal'. An excel file included in the package has
inheritance patterns for the X and Y chromosome and mtDNA, and allows
custom inheritance patterns. Note that these are experimental, and NOT
currently supported by the pedigree reconstruction with
sequoia
!
file name of file with inheritance patterns, with extension csv, txt, xls or xlsx (the latter two require library xlsx).
suppress messages.
If ReturnStats=FALSE
(the default), a matrix with genotype
data in sequoia's input format, encoded as 0/1/2/-9.
If ReturnStats=TRUE
, a named list with three elements: list
'ParamsIN', matrix 'SGeno', and list 'StatsOUT':
Frequency in 'observed' genotypes of '1' allele
Allele frequency in 'actual' (without genotyping errors & missingness)
Error rate per SNP (actual /= observed AND observed /= missing)
Non-missing per SNP
Error rate per individual
Non-missing per individual
This simulation is highly simplistic and assumes that all SNPs segregate completely independently, that the SNPs are in Hardy-Weinberg equilibrium in the pedigree founders. It assumes that genotyping errors are not due to heritable mutations of the SNPs, and that missingness is random and not e.g. due to heritable mutations of SNP flanking regions. Results based on this simulated data will provide an minimum estimate of the number of SNPs required, and an optimistic estimate of pedigree reconstruction performance.
Please ensure the pedigree is a valid pedigree, for example by first
running PedPolish
. For founders, i.e. individuals with no
known parents, genotypes are drawn according to the provided MAF and
assuming Hardy-Weinberg equilibrium. Offspring genotypes are generated
following Mendelian inheritance, assuming all loci are completely
independent. Individuals with one known parent are allowed: at each locus,
one allele is inherited from the known parent, and the other drawn from the
genepool according to the provided MAF.
Genotyping errors are generated following a user-definable 3x3 matrix with
probabilities that actual genotype \(i\) (rows) is observed as genotype
\(j\) (columns). This is specified as ErrorFM
, which is a function
of SnpError
. By default (ErrorFM
= "version2.0"),
SnpError
is interpreted as a locus-level error rate (rather than
allele-level), and equals the probability that a homozygote is observed as
heterozygote, and the probability that a heterozygote is observed as either
homozygote (i.e., the probability that it is observed as AA = probability
that observed as aa = SnpError
/2). The probability that one
homozygote is observed as the other is (SnpError
/2\()^2\).
Note that this differs from versions up to 1.1.1, where a proportion of
SnpError
*3/2 of genotypes were replaced with random genotypes. This
corresponds to ErrorFM
= "Version111".
Error rates differ between SNPs, but the same error pattern is used across all SNPs, even when inheritance patterns vary. When two or more different error patterns are required, SimGeno should be run on the different SNP subsets separately, and results combined.
Variation in call rates is assumed to follow a highly skewed (beta)
distribution, with many samples having call rates close to 1, and a
narrowing tail of lower call rates. The first shape parameter defaults to 1
(but see MkGenoErrors
), and the second shape parameter is
defined via the mean as CallRate
. For 99.9% of SNPs to have a call
rate of 0.8 (0.9; 0.95) or higher, use a mean call rate of 0.969 (0.985;
0.993).
Variation in call rate between samples can be specified by providing a
named vector to CallRate
, which supersedes PropLQ in versions up to
1.1.1. Otherwise, variation in call rate and error rate between samples
occurs only as side-effect of the random nature of which individuals are
hit by per-SNP errors and drop-outs. Finer control is possible by first
generating an error-free genotype matrix, and then calling
MkGenoErrors
directly on subsets of the matrix.
The wrapper EstConf
for repeated simulation and
pedigree reconstruction; MkGenoErrors
for fine control over
the distribution of genotyping errors in simulated data.
# NOT RUN {
data(Ped_HSg5)
GenoM <- SimGeno(Pedigree = Ped_HSg5, nSnp = 100, ParMis = c(0.2, 0.7))
# }
# NOT RUN {
# Alternative genotyping error model
EFM <- function(E) { # Whalen, Gorjanc & Hickey 2018
matrix(c(1-E*3/4, E/4, E/4,
E/4, 1/2-E/4, 1/2-E/4, E/4,
E/4, E/4, 1-E*3/4),
3,3, byrow=TRUE) }
EFM(0.01)
GenoM <- SimGeno(Pedigree = Ped_HSg5, nSnp = 100, ParMis = 0.2,
SnpError = 5e-3, ErrorFM = EFM)
# combination of high & low quality SNPs
Geno.HQ <- SimGeno(Ped_HSg5, nSnp=50, MAF=0.3, CallRate=runif(50, 0.7, 1))
Geno.LQ <- SimGeno(Ped_HSg5, nSnp=20, MAF=0.1, CallRate=runif(20, 0.1, 5))
Geno.HQLQ <- merge(Geno.HQ, Geno.LQ, by="row.names")
# }
# NOT RUN {
# }
Run the code above in your browser using DataLab