EstConf: Estimate confidence probability

Description

Estimate confidence and assignment error rate by repeatedly simulating genotype data from a reference pedigree using SimGeno, reconstruction a pedigree from this using sequoia, and counting the number of mismatches using PedCompare.

Usage

EstConf(
  Pedigree = NULL,
  LifeHistData = NULL,
  args.sim = list(nSnp = 400, SnpError = 0.001, ParMis = c(0.4, 0.4)),
  args.seq = list(MaxSibIter = 10, Err = 0.001, Tassign = 0.5),
  nSim = 10,
  quiet = TRUE
)

Arguments

Pedigree

Reference pedigree from which to simulate, dataframe with columns id-dam-sire. Additional columns are ignored

LifeHistData

Dataframe with id, sex (1=female, 2=male, 3=unknown), and birth year.

args.sim

list of arguments to pass to SimGeno, such as nSnp (number of SNPs), SnpError (genotyping error rate) and ParMis (proportion of non-genotyped parents). Set to NULL to use all default values.

args.seq

list of arguments to pass to sequoia, such as MaxSibIter (max no. sibship clustering iterations, '0' for parentage assignment only) and Err (assumed genotyping error rate). May include (part of) SeqList, the list of sequoia output (i.e. as a list-within-a-list). Set to NULL to use all default values.

nSim

number of rounds of simulations to perform.

quiet

suppress messages. `very' also suppresses simulation counter, TRUE just runs SimGeno and sequoia quietly.

Value

a list, with the main results in dataframe ConfProb and array PedErrors. ConfProb has 7 columns:

id.cat, dam.cat, sire.cat

Category of the focal individual, dam, and sire, in the pedigree inferred based on the simulated data. Coded as G=genotyped, D=dummy, X=none

dam.conf

Probability that the dam is correct, given the categories of the assigned dam and sire (ignoring whether or not the sire is correct). Rounded to nchar(N) significant digits

sire.conf

as dam.conf, for the sire

pair.conf

Probability that both dam and sire are correct, given their categories

Number of individuals per category-combination, across all nSim simulations

array PedErrors has three dimensions:

class

FalseNeg(atives): could have been assigned but was not (individual + parent both genotyped or dummyfiable; P1only in PedCompare).
FalsePos(itives): no parent in reference pedigree, but one was assigned based on the simulated data (P2only)
Mismatch: different parents between the pedigrees

cat

Category of individual + parent, as a two-letter code where the first letter indicates the focal individual and the second the parent; G=Genotyped, D=Dummy, T=Total

parent

dam or sire

The other list elements are:

Pedigree.reference

the pedigree from which data was simulated

Pedigree.inferred

a list with for each iteration the inferred pedigree based on the simulated data

SimSNPd

a list with for each iteration the IDs of the individuals simulated to have been genotyped

RunParams

a list with the current call to EstConf, as well as the default parameter values for EstConf, SimGeno, and sequoia.

RunTime

sequoia runtime per simulation in seconds, as measured by system.time()['elapsed'].

Assumptions

Because the actual true pedigree is (typically) unknown, the provided reference pedigree is used as a stand-in and assumed to be the true pedigree, with unrelated founders. It is also assumed that the probability to be genotyped is equal for all parents; in each iteration, a new random set of parents (proportion set by ParMis) is mimicked to be non-genotyped. In addition, SNPs are assumed to segregate independently.

Details

The confidence probability is taken as the number of correct (matching) assignments, divided by all assignments made in the observed (inferred-from-simulated) pedigree. In contrast, the false negative & false positive assignment rates are proportions of the number of parents in the true (reference) pedigree. Each rate is calculated separatedly for dams & sires, and separately for each category (Genotyped/Dummy(fiable)/X (none)) of individual, parent and co-parent.

This function does not know which individuals in Pedigree are genotyped, so the confidence probabilities need to be added to the Pedigree by the user as shown in the example at the bottom.

A confidence of `1' assignments on simulated data were correct for that category-combination. It should be interpreted as (and perhaps modified to) \(> 1 - 1/N\), where sample size N is given in the last column of the ConfProb and PedErrors dataframes in the output. The same applies for a false negative/positive rate of `0'.

Examples

Run this code

# NOT RUN {
data(Ped_HSg5, LH_HSg5, package="sequoia")

## Example A: parentage assignment only
conf.A <- EstConf(Pedigree = Ped_HSg5, LifeHistData = LH_HSg5,
   args.sim = list(nSnp = 100, SnpError = 5e-3, ParMis=c(0.2, 0.5)),
   args.seq = list(MaxSibIter = 0, Err=1e-3, Tassign=0.5),
   nSim = 2)

# parent-pair confidence, per category:
conf.A$ConfProb

# calculate (correct) assignment rates (ignores co-parent)
1 - apply(conf.A$PedErrors, c(1,3), sum, na.rm=TRUE)

## Example B: with sibship clustering, based on sequoia inferred pedigree
RealGenotypes <- SimGeno(Ped = Ped_HSg5, nSnp = 100,
                         ParMis=c(0.19,0.53), SnpError = 6e-3)
SeqOUT <- sequoia(GenoM = RealGenotypes,
                  LifeHistData = LH_HSg5,
                  Err=5e-3, MaxSibIter=10)

conf.B <- EstConf(Pedigree = SeqOUT$Pedigree,
              LifeHistData = LH_HSg5,
               args.sim = list(nSnp = 100, SnpError = 5e-3,
                               ParMis=c(0.2, 0.5)),
              args.seq = list(Err=5e-3, MaxSibIter = 10),
              nSim = 3)
Ped.withConf <- getAssignCat(Pedigree = SeqOUT$Pedigree,
                             Genotyped = rownames(RealGenotypes))
Ped.withConf <- merge(Ped.withConf, conf.B$ConfProb, all.x=TRUE)
Ped.withConf <- Ped.withConf[, c("id","dam","sire", "dam.conf", "sire.conf",
                                 "id.cat", "dam.cat", "sire.cat")]
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab