Perform pedigree reconstruction based on SNP data, including parentage assignment and sibship clustering.
sequoia(GenoM = NULL, LifeHistData = NULL, SeqList = NULL,
MaxSibIter = 10, Err = 1e-04, MaxMismatch = 3, Tfilter = -2,
Tassign = 0.5, MaxSibshipSize = 100, DummyPrefix = c("F", "M"),
Complex = "full", UseAge = "yes", args.AP = list(Flatten = TRUE,
Smooth = TRUE), FindMaybeRel = FALSE, CalcLLR = TRUE,
quiet = FALSE)
numeric matrix with genotype data: One row per individual, and
one column per SNP, coded as 0, 1, 2 or -9 (missing). Use
GenoConvert
to convert genotype files created in PLINK using
--recodeA or in Colony's 2-column format to this format.
Dataframe with 3 columns:
ID: max. 30 characters long,
Sex: 1 = females, 2 = males, other = unkown, except 4 = hermaphrodite,
BirthYear: (birth or hatching year) Integer, negative numbers are interpreted as missing values.
If the species has multiple generations per year, use an integer coding such that the candidate parents' `Birth year' is at least one smaller than their putative offspring's. Column names are ignored, so ensure column order is ID - sex - birth year.
list with output from a previous run, containing the elements
`Specs', `AgePriors' and/or `PedigreePar', as described below, to be used
in the current run. If SeqList$Specs
is provided, all other input
parameter values except MaxSibIter
are ignored.
number of iterations of sibship clustering, including assignment of grandparents to sibships and avuncular relationships between sibships. Set to 0 to not (yet) perform this step, which is by far the most time consuming and may take several hours for large datasets. Clustering continues until convergence or until MaxSibIter is reached.
estimated genotyping error rate. The error model aims to deal with scoring errors typical for SNP arrays.
maximum number of loci at which candidate parent and offspring are allowed to be opposite homozygotes. Setting a more liberal threshold can improve performance if the error rate is high, at the cost of decreased speed.
threshold log10-likelihood ratio (LLR) between a proposed relationship versus unrelated, to select candidate relatives. Typically a negative value, related to the fact that unconditional likelihoods are calculated during the filtering steps. More negative values may decrease non-assignment, but will increase computational time.
minimum LLR required for acceptance of proposed relationship, relative to next most likely relationship. Higher values result in more conservative assignments. Must be zero or positive.
maximum number of offspring for a single individual (a generous safety margin is advised).
character vector of length 2 with prefixes for dummy dams (mothers) and sires (fathers); maximum 20 characters each.
either "full" (default), "simp" (simplified, no explicit consideration of inbred relationships), "mono" (monogamous) or "herm" (hermaphrodites, otherwise like "full").
either "yes" (default), "no", or "extra" (additional rounds with extra reliance on ageprior, may boost assignments but increased risk of erroneous assignments); used during full reconstruction only.
list with arguments to be passed on to
MakeAgePrior
.
identify pairs of non-assigned likely relatives after
pedigree reconstruction. Can be time-consuming in large datasets. NOTE:
from v1.2 default changed from TRUE to FALSE; GetMaybeRel
can
now be called separately.
calculate log-likelihood ratios for all assigned parents (is parent vs. is otherwise related). Time-consuming in large datasets.
suppress messages: TRUE/FALSE/"verbose".
A list with some or all of the following components:
Matrix with age-difference based prior probability ratios,
used for full pedigree reconstruction. See MakeAgePrior
for
details.
Dataframe with pedigree for dummy individuals, as well as their sex, estimated birth year (point estimate, upper and lower bound of 95% confidence interval), number of offspring, and offspring IDs (genotyped offspring only).
Dataframe, duplicated genotypes (with different IDs, duplicate IDs are not allowed). The specified number of maximum mismatches is used here too. Note that this dataframe may include pairs of closely related individuals, and monozygotic twins.
Dataframe, rownumbers of duplicated IDs in life history dataframe. For convenience only, but may signal a problem. The first entry is used.
Individuals in GenoM which were excluded because of a too low genotyping success rate (<50%).
Column numbers of SNPs in GenoM which were excluded because of a too low genotyping success rate (<10%).
Provided dataframe with sex and birth year data.
Dataframe with pairs of individuals who are more likely parent-offspring than unrelated, but which could not be phased due to unknown age difference or sex, or for whom LLR did not pass Tassign.
Dataframe with pairs of individuals who are more likely to be first or second degree relatives than unrelated, but which could not be assigned.
Dataframe with non-assigned parent-parent-offspring trios (both parents are of unknown sex), with similar columns as the pedigree
Vector, IDs in genotype data for which no life history data is provided.
Dataframe with assigned genotyped and dummy parents from Sibship step; entries for dummy individuals are added at the bottom.
Dataframe with assigned parents from Parentage step.
Named vector with parameter values.
Numeric vector, Total likelihood of the genotype data at initiation and after each iteration during Parentage.
Numeric vector, Total likelihood of the genotype data at initiation and after each iteration during Sibship clustering.
List elements PedigreePar and Pedigree both have the following columns:
Individual ID
Assigned mother, or NA
Assigned father, or NA
Log10-Likelihood Ratio (LLR) of this female being the mother, versus the next most likely relationship between the focal individual and this female (see Details for relationships considered)
idem, for male parent
LLR for the parental pair, versus the next most likely configuration between the three individuals (with one or neither parent assigned)
Number of loci at which the offspring and mother are opposite homozygotes
idem, for father
Number of Mendelian errors between the offspring and the parent pair, includes OH as well as e.g. parents being opposing homzygotes, but the offspring not being a heterozygote. The offspring being OH with both parents is counted as 2 errors.
While every effort has been made to ensure that sequoia provides what it claims to do, there is absolutely no guarantee that the results provided are correct. Use of sequoia is entirely at your own risk.
For each pair of candidate relatives, the likelihoods are calculated of them being parent-offspring (PO), full siblings (FS), half siblings (HS), grandparent-grandoffspring (GG), full avuncular (niece/nephew - aunt/uncle; FA), half avuncular/great-grandparental/cousins (HA), or unrelated (U). Assignments are made if the likelihood ratio (LLR) between the focal relationship and the most likely alternative exceed the threshold Tassign.
Further explanation of the various options and interpretation of the output is provided in the vignette.
Huisman, J. (2017) Pedigree reconstruction from SNP data: Parentage assignment, sibship clustering, and beyond. Molecular Ecology Resources 17:1009--1024.
GenoConvert, SnpStats, GetMaybeRel,
EstConf, SummarySeq, writeSeq
, vignette("sequoia")
# NOT RUN {
# == EXAMPLE 1 ==
data(SimGeno_example, LH_HSg5, package="sequoia")
head(SimGeno_example[,1:10])
head(LH_HSg5)
SeqOUT <- sequoia(GenoM = SimGeno_example,
LifeHistData = LH_HSg5, MaxSibIter = 0)
names(SeqOUT)
SeqOUT$PedigreePar[34:42, ]
# }
# NOT RUN {
SeqOUT2 <- sequoia(GenoM = SimGeno_example,
LifeHistData = LH_HSg5, MaxSibIter = 10)
SeqOUT2$Pedigree[34:42, ]
# == EXAMPLE 2 ==
# ideally, select 400-700 SNPs: high MAF & low LD
# save in 0/1/2/NA format (PLINK's --recodeA)
GenoM <- GenoConvert(InFile = "inputfile_for_sequoia.raw")
SNPSTATS <- SnpStats(GenoM)
# perhaps after some data-cleaning:
write.table(GenoM, file="MyGenoData.txt", row.names=T, col.names=F)
# later:
GenoM <- as.matrix(read.table("MyGenoData.txt", row.names=1, header=F))
LHdata <- read.table("LifeHistoryData.txt", header=T) # ID-Sex-birthyear
SeqOUT <- sequoia(GenoM, LHdata, Err=0.005, MaxMismatch=10)
SummarySeq(SeqOUT)
writeSeq(SeqOUT, OutFormat = "xls") # needs library xlsx
# }
Run the code above in your browser using DataLab