sequoia: Pedigree Reconstruction

Description

Perform pedigree reconstruction based on SNP data, including parentage assignment and sibship clustering.

Usage

sequoia(
  GenoM = NULL,
  LifeHistData = NULL,
  SeqList = NULL,
  MaxSibIter = 10,
  Err = 1e-04,
  ErrFlavour = "version2.0",
  MaxMismatch = NA,
  Tfilter = -2,
  Tassign = 0.5,
  MaxSibshipSize = 100,
  DummyPrefix = c("F", "M"),
  Complex = "full",
  UseAge = "yes",
  args.AP = list(Flatten = NULL, Smooth = TRUE),
  FindMaybeRel = FALSE,
  CalcLLR = TRUE,
  quiet = FALSE,
  Plot = NULL
)

Arguments

GenoM

numeric matrix with genotype data: One row per individual, and one column per SNP, coded as 0, 1, 2 or -9 (missing). Use GenoConvert to convert genotype files created in PLINK using --recodeA or in Colony's 2-column format to this format.

LifeHistData

Dataframe with 3 columns (optionally 5):

ID: max. 30 characters long,
Sex: 1 = females, 2 = males, other = unknown, except 4 = hermaphrodite,
BirthYear: birth or hatching year, integer, with missing values as NA or any negative value.
BY.min: minimum birth year, only used if BirthYear is missing
BY.max: maximum birth year, only used if BirthYear is missing

If the species has multiple generations per year, use an integer coding such that the candidate parents' `Birth year' is at least one smaller than their putative offspring's. Column names are ignored, so ensure column order is ID - sex - birth year (- BY.min - BY.max).

SeqList

list with output from a previous run, containing the elements `Specs', `AgePriors' and/or `PedigreePar', as described below, to be used in the current run. If SeqList$Specs is provided, all other input parameter values except MaxSibIter are ignored.

MaxSibIter

number of iterations of sibship clustering, including assignment of grandparents to sibships and avuncular relationships between sibships. Set to 0 to not (yet) perform this step, which is by far the most time consuming and may take several hours for large datasets. Clustering continues until convergence or until MaxSibIter is reached.

Err

estimated genotyping error rate, as a single number or 3x3 matrix. If a matrix, this should be the probability of observed genotype (columns) conditional on actual genotype (rows). Each row must therefore sum to 1.

ErrFlavour

function that takes Err as input, and returns a 3x3 matrix of observed (columns) conditional on actual (rows) genotypes, or choose from inbuilt ones as used in sequoia 'version2.0', 'version1.3', or 'version1.1'. Ignored if Err is a matrix. See ErrToM.

MaxMismatch

DEPRECATED AND IGNORED. Now calculated using CalcMaxMismatch.

Tfilter

threshold log10-likelihood ratio (LLR) between a proposed relationship versus unrelated, to select candidate relatives. Typically a negative value, related to the fact that unconditional likelihoods are calculated during the filtering steps. More negative values may decrease non-assignment, but will increase computational time.

Tassign

minimum LLR required for acceptance of proposed relationship, relative to next most likely relationship. Higher values result in more conservative assignments. Must be zero or positive.

MaxSibshipSize

maximum number of offspring for a single individual (a generous safety margin is advised).

DummyPrefix

character vector of length 2 with prefixes for dummy dams (mothers) and sires (fathers); maximum 20 characters each.

Complex

either "full" (default), "simp" (simplified, no explicit consideration of inbred relationships), "mono" (monogamous) or "herm" (hermaphrodites, otherwise like "full").

UseAge

either "yes" (default), "no", or "extra" (additional rounds with extra reliance on ageprior, may boost assignments but increased risk of erroneous assignments); used during full reconstruction only.

args.AP

list with arguments to be passed on to MakeAgePrior.

FindMaybeRel

DEPRECATED, advised to run GetMaybeRel separately. TRUE/FALSE to identify pairs of non-assigned likely relatives after pedigree reconstruction. Can be time-consuming in large datasets.

CalcLLR

calculate log-likelihood ratios for all assigned parents (genotyped + dummy; parent vs. otherwise related). Time-consuming in large datasets. Can be done separately with CalcOHLLR.

quiet

suppress messages: TRUE/FALSE/"verbose".

Plot

display plots from SnpStats, MakeAgePrior, and SummarySeq. Defaults (NULL) to TRUE when quiet=FALSE or "verbose", and FALSE when quiet=TRUE. If you get errors an error 'figure margins too large', enlarge the plotting area (drag with mouse). 'invalid graphics state' error can be dealt with by clearing the plotting area with dev.off().

Value

A list with some or all of the following components:

AgePriors

Matrix with age-difference based probability ratios for each relationship, used for full pedigree reconstruction; see MakeAgePrior for details. When running only parentage assignment (MaxSibIter=0) the returned AgePriors has been updated to incorporate the information of the assigned parents, and is ready for use during full pedigree reconstruction.

DummyIDs

Dataframe with pedigree for dummy individuals, as well as their sex, estimated birth year (point estimate, upper and lower bound of 95% confidence interval), number of offspring, and offspring IDs (genotyped offspring only).

DupGenotype

Dataframe, duplicated genotypes (with different IDs, duplicate IDs are not allowed). The specified number of maximum mismatches is used here too. Note that this dataframe may include pairs of closely related individuals, and monozygotic twins.

DupLifeHistID

Dataframe, row numbers of duplicated IDs in life history dataframe. For convenience only, but may signal a problem. The first entry is used.

ErrM

Error matrix; probability of observed genotype (columns) conditional on actual genotype (rows)

ExcludedInd

Individuals in GenoM which were excluded because of a too low genotyping success rate (<50%).

ExcludedSNPs

Column numbers of SNPs in GenoM which were excluded because of a too low genotyping success rate (<10%).

LifeHist

Provided dataframe with sex and birth year data.

LifeHistPar

LifeHist with additional columns 'Sexx' (inferred Sex when assigned as part of parent-pair), 'BY.est' (mode of birth year probability distribution), 'BY.lo' (lower limit of 95% highest density region), 'BY.hi' (higher limit), inferred after parentage assignment. 'BY.est' is NA when the probability distribution is flat between 'BY.lo' and 'BY.hi'.

LifeHistSib

as LifeHistPar, but estimated after full pedigree reconstruction

MaybeParent

Dataframe with pairs of individuals who are more likely parent-offspring than unrelated, but which could not be phased due to unknown age difference or sex, or for whom LLR did not pass Tassign.

MaybeRel

Dataframe with pairs of individuals who are more likely to be first or second degree relatives than unrelated, but which could not be assigned.

MaybeTrio

Dataframe with non-assigned parent-parent-offspring trios (both parents are of unknown sex), with similar columns as the pedigree

NoLH

Vector, IDs in genotype data for which no life history data is provided.

Pedigree

Dataframe with assigned genotyped and dummy parents from Sibship step; entries for dummy individuals are added at the bottom.

PedigreePar

Dataframe with assigned parents from Parentage step.

Specs

Named vector with parameter values.

TotLikParents

Numeric vector, Total likelihood of the genotype data at initiation and after each iteration during Parentage.

TotLikSib

Numeric vector, Total likelihood of the genotype data at initiation and after each iteration during Sibship clustering.

AgePriorExtra

As AgePriors, but including columns for grandparents and avuncular pairs. NOT updated after parentage assignment, but returned as used during the run.

List elements PedigreePar and Pedigree both have the following columns:

Individual ID

dam

Assigned mother, or NA

sire

Assigned father, or NA

LLRdam

Log10-Likelihood Ratio (LLR) of this female being the mother, versus the next most likely relationship between the focal individual and this female (see Details for relationships considered)

LLRsire

idem, for male parent

LLRpair

LLR for the parental pair, versus the next most likely configuration between the three individuals (with one or neither parent assigned)

OHdam

Number of loci at which the offspring and mother are opposite homozygotes

OHsire

idem, for father

MEpair

Number of Mendelian errors between the offspring and the parent pair, includes OH as well as e.g. parents being opposing homozygotes, but the offspring not being a heterozygote. The offspring being OH with both parents is counted as 2 errors.

Disclaimer

While every effort has been made to ensure that sequoia provides what it claims to do, there is absolutely no guarantee that the results provided are correct. Use of sequoia is entirely at your own risk.

Details

Dummy parents of sibships are denoted by F0001, F0002, ... (mothers) and M0001, M0002, ... (fathers), are appended to the bottom of the pedigree, and may have been assigned real or dummy parents themselves (i.e. sibship-grandparents). A dummy parent is not assigned to singletons.

For each pair of candidate relatives, the likelihoods are calculated of them being parent-offspring (PO), full siblings (FS), half siblings (HS), grandparent-grandoffspring (GG), full avuncular (niece/nephew - aunt/uncle; FA), half avuncular/great-grandparental/cousins (HA), or unrelated (U). Assignments are made if the likelihood ratio (LLR) between the focal relationship and the most likely alternative exceed the threshold Tassign.

Further explanation of the various options and interpretation of the output is provided in the vignette.

References

Huisman, J. (2017) Pedigree reconstruction from SNP data: Parentage assignment, sibship clustering, and beyond. Molecular Ecology Resources 17:1009--1024.

Examples

Run this code

# NOT RUN {
# ===  EXAMPLE 1: simulate data  ===
data(SimGeno_example, LH_HSg5, package="sequoia")
head(SimGeno_example[,1:10])
head(LH_HSg5)
SeqOUT <- sequoia(GenoM = SimGeno_example, Err = 0.005,
                  LifeHistData = LH_HSg5, MaxSibIter = 0)
names(SeqOUT)
SeqOUT$PedigreePar[34:42, ]

# compare to true (or old) pedigree:
PC <- PedCompare(Ped_HSg5, SeqOUT$PedigreePar)
PC$Counts["GG",,]

# }
# NOT RUN {
SeqOUT2 <- sequoia(GenoM = SimGeno_example, Err = 0.005,
                  LifeHistData = LH_HSg5, MaxSibIter = 10)
SeqOUT2$Pedigree[34:42, ]

PC2 <- PedCompare(Ped_HSg5, SeqOUT2$Pedigree)
PC2$Counts["GT",,]

# important to run with (approx.) correct genotyping error rate:
SeqOUT2.b <- sequoia(GenoM = SimGeno_example, #  Err = 1e-4 by default
                  LifeHistData = LH_HSg5, MaxSibIter = 10)
PC2.b <- PedCompare(Ped_HSg5, SeqOUT2.b$Pedigree)
PC2.b$Counts["GT",,]


# ===  EXAMPLE 2: real data  ===
# ideally, select 400-700 SNPs: high MAF & low LD
# save in 0/1/2/NA format (PLINK's --recodeA)
GenoM <- GenoConvert(InFile = "inputfile_for_sequoia.raw",
                     InFormat = "raw")  # can also do Colony format
SNPSTATS <- SnpStats(GenoM)
# perhaps after some data-cleaning:
write.table(GenoM, file="MyGenoData.txt", row.names=T, col.names=F)

# later:
GenoM <- as.matrix(read.table("MyGenoData.txt", row.names=1, header=F))
LHdata <- read.table("LifeHistoryData.txt", header=T) # ID-Sex-birthyear
SeqOUT <- sequoia(GenoM, LHdata, Err=0.005)
SummarySeq(SeqOUT)

writeSeq(SeqOUT, folder="sequoia_output")  # several text files

# runtime:
SeqOUT$Specs$TimeEnd - SeqOUT$Specs$TimeStart
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab