pedigree_loglikelihood: Calculate the log-likelihoods of pedigrees

Description

For one or more pedigrees, this function calculates the natural logarithm of the pedigree likelihood that is on page 117 of (Lange, 2002), given inputs that correspond to the terms in this formula.

Usage

pedigree_loglikelihood(
  dat,
  geno_freq,
  trans,
  penet,
  monozyg = NULL,
  sum_loglik = TRUE,
  ncores = 1,
  load_balancing = TRUE
)

Value

Either a named vector giving the log-likelihood of each family or the sum of these log-likelihoods, depending on sum_loglik (see above).

Arguments

dat

A data frame with rows corresponding to people and columns corresponding to the following variables (other variables can be included but will be ignored), which will be coerced to character type:

family (optional), an identifier for each person's family, constant within families. If this variable is not supplied then dat will be treated as a single pedigree.
indiv, an individual identifier for each person. If there are any duplicated identifiers in the dataset then the family and an underscore (_) will be prepended to all identifiers, and if any duplicates remain after this then the function will stop executing, with an error message.
mother, the individual identifier of each person's mother, or missing (NA) for founders.
father, the individual identifier of each person's father, or missing (NA) for founders.

geno_freq

A vector of strictly positive numbers that sum to 1. If the possible genotypes of the underlying genetic model are 1:length(geno_freq) then geno_freq[j] is interpreted as the population frequency of genotype j, so geno_freq is essentially the function Prior in the pedigree likelihood on page 117 of (Lange, 2002). For certain genetic models that often occur in applications, these genotype frequencies can be calculated by geno_freq_monogenic, geno_freq_phased, etc.

trans

An ngeno^2 by ngeno matrix of non-negative numbers whose rows all sum to 1, where ngeno = length(geno_freq) is the number of possible genotypes. The rows of trans correspond to joint parental genotypes and the columns correspond to offspring genotypes. If the possible genotypes are 1:length(geno_freq) then the element trans[ngeno * gm + gf - ngeno, go] is interpreted as the conditional probability that a person has genotype go, given that his or her biological mother and father have genotypes gm and gf, respectively. So trans is essentially the transmission function Tran on page 117 of (Lange, 2002). For certain genetic models that often occur in applications, this transmission matrix can be calculated by trans_monogenic, trans_phased, etc.

penet

An nrow(dat) by length(geno_freq) matrix of non-negative numbers. The element penet[i,j] is interpreted as the conditional probability (or probability density) of the phenotype of the person corresponding to row i of dat, given that his or her genotype is j (where the possible genotypes are 1:length(geno_freq)). Therefore, penet is essentially the penetrance function Pen on page 117 of (Lange, 2002). If any row of penet consists entirely of zeroes then the likelihood is 0, so the returned log-likelihood will be -Inf. Note that genotype data can be incorporated into penet by regarding observed genotypes as part of the phenotype, i.e. by regarding observed genotypes as (possibly noisy) measurements of the underlying true genotypes. For example, if the observed genotype of person i is 1 (and if genotype measurement error is negligible) then penet[i,j] should be 0 for j != 1 and penet[i,1] should be the same as if person i were ungenotyped.

monozyg

An optional list that can be used to specify genetically identical persons, such as monozygotic twins, monozygotic triplets, a monozygotic pair within a set of dizygotic triplets, etc. Each element of the list should be a vector containing the individual identifiers of a group of genetically identical persons, e.g. if dat contains six sets of monozygotic twins and one set of monozygotic triplets then monozyg will be a list with seven elements, one element a vector of length three and the other six elements all vectors of length two. The order of the list and the orders within its elements do not affect the output of the function. Each group of genetically identical persons should contain two or more persons, the groups should not overlap, and all persons in each group must have the same (non-missing) parents.

sum_loglik

A logical flag. Return a named vector giving the log-likelihood of each family if sum_loglik is FALSE, or return the sum of these log-likelihoods if sum_loglik is TRUE (the default).

ncores

The number of cores to be used, with ncores = 1 (the default) corresponding to non-parallel computing. When ncores > 1, the parallel package is used to parallelize the calculation by dividing the pedigrees among the different cores.

load_balancing

A logical flag. When ncores > 1, parallelization is achieved either with the function parallel::parLapply (if load_balancing is FALSE) or with the load-balancing function parallel::parLapplyLB (if load_balancing is TRUE, the default). The load-balancing version will usually, but not always, be faster.

Details

This function provides a fast and general implementation of the Elston-Stewart algorithm to calculate the log-likelihoods of potentially large and complex pedigrees. General references for the Elston-Stewart algorithm are (Elston & Stewart, 1971), (Lange & Elston, 1975) and (Cannings et al., 1978).

Each family within dat should be a complete pedigree, meaning that each person should either have both parental identifiers missing (if a founder) or both non-missing (if a non-founder), and each (non-missing) mother or father should have a corresponding row of dat.

Observed genotypes should be incorporated into penet, as described above.

The function can handle pedigree loops, such as those caused by inbreeding or by two sisters having children with two brothers from an unrelated family (see (Totir et al., 2009) for a precise definition), though pedigrees with more than a few loops could greatly reduce the speed of the calculation.

In geno_freq, trans and penet, the order of the possible genotypes must match, in the sense that the genotype that corresponds to element j of geno_freq must also correspond to column j of trans and penet, for each j in 1:length(geno_freq).

Sex-specific genetics, such as X-linked genes or genetic loci with sex-specific recombination fractions, can be modelled by letting genotypes 1:nm be the possible male genotypes and letting (nm+1):(nm+nf) be the possible female genotypes, where nm and nf are the number of possible genotypes for males and females, respectively. Then, for example, penet[i,j] will be 0 if j %in% 1:nm and row i of dat corresponds to a female, and penet[i,j] will be 0 if j %in% (nm+1):(nm+nf) and row i of dat corresponds to a male.

References

Cannings C, Thompson E, Skolnick M. Probability functions on complex pedigrees. Advances in Applied Probability, 1978;10(1):26-61.

Elston RC, Stewart J. A general model for the genetic analysis of pedigree data. Hum Hered. 1971;21(6):523-542.

Lange K. Mathematical and Statistical Methods for Genetic Analysis (second edition). Springer, New York. 2002.

Lange K, Elston RC. Extensions to pedigree analysis I. Likehood calculations for simple and complex pedigrees. Hum Hered. 1975;25(2):95-105.

Totir LR, Fernando RL, Abraham J. An efficient algorithm to compute marginal posterior genotype probabilities for every member of a pedigree with loops. Genet Sel Evol. 2009;41(1):52.

Examples

Run this code

# Load pedigree files and penetrance matrices
data("dat_small", "penet_small", "dat_large", "penet_large")

# Settings for a single biallelic locus in Hardy-Weinberg equilibrium
# and with a minor allele frequency of 10%
geno_freq <- geno_freq_monogenic(c(0.9, 0.1))
trans <- trans_monogenic(2)

# In dat_small, ora024 and ora027 are identical twins, and so are aey063 and aey064
monozyg_small <- list(c("ora024", "ora027"), c("aey063", "aey064"))

# Calculate the log-likelihoods for 10 families, each with approximately
# 100 family members
pedigree_loglikelihood(
  dat_small, geno_freq, trans, penet_small, monozyg_small, sum_loglik = FALSE, ncores = 2
)

# Calculate the log-likelihood for one family with approximately 10,000 family members
# Note:  this calculation should take less than a minute on a standard desktop computer
# Note:  parallelization would achieve nothing here because there is only one family
str(dat_large)
# \donttest{
system.time(
  ll <- pedigree_loglikelihood(dat_large, geno_freq, trans, penet_large)
)
ll
# }

Run the code above in your browser using DataLab