zoorun: Run the ZooRoH model

Description

Apply the defined model on a group of individuals: parameter estimation, computation of realized autozygosity and homozygous-by-descent probabilities, and identification of HBD segments (decoding). It is also possible to apply the model to a pair of phased haplotypes with the 'ibd' option.

Usage

zoorun(
  zoomodel,
  zooin,
  ids = NULL,
  parameters = TRUE,
  fb = TRUE,
  vit = TRUE,
  localhbd = FALSE,
  nT = 1,
  optim_method = "L-BFGS-B",
  maxiter = 100,
  minmix = 1,
  maxr = 1e+08,
  ibd = FALSE,
  ibdpairs = NULL,
  haploid = FALSE,
  RecTable = FALSE,
  trim_ad = FALSE,
  hemiprob = 0
)

Value

The function return a zoores object with several slots accesses by the "@" symbol. The three main results are zoores@realized (the matrix with partitioning of the genome in different HBD classes for each individual), zoores@hbdseg (a data frame with identified HBD segments) and zoores@hbdp (a list of matrices with HBD probabilities per SNP and per class).

Here is a list with all the slots and their description:

zoores@nind the number of individuals in the analysis,
zoores@ids a vector containing the numbers of the analyzed individuals (their position in the data file),
zoores@mixc the (estimated) mixing coefficients per class for all individuals,
zoores@krates the (estimated) rates for the exponential distributions associated with each HBD or non-HBD class for all individuals,
zoores@niter the number of iterations for estimating the parameters - the number of calls of the forward algorithm (per individual),
zoores@modlik a vector containing the likelihood of the model for each individual,
zoores@modbic a vector containing the value of the BIC for each individual,
zoores@realized a matrix with estimated realized autozygosity per HBD class (columns) for each individual (rows). These values are obtained with the Forward-Backward algorithm - fb option),
zoores@hbdp a list of matrices with the local probabilities to belong to an underlying hidden state (computed for every class and every individual). Each matrix has one row per class and one column per snp. To access the matrix from individual i, use the brackets "[[]]", for instance zoores@hbdp[[i]],
zoores@hbdseg a data frame with the list of identified HBD segments with the Viterbi algorithm (the columns are the individual number, the chromosome number, the first and last SNP of the segment, the positions of the first and last SNP of the segment, the number of SNPs in the segment, the length of the segment, the HBD state of the segment). In case of IBD, the first column indicates the number of the pair of haplotypes,
zoores@optimerr a vector indicating whether optim ran with or without error (0 indicates successful completion / 1 indicates that the iteration limit has been reached / 51 and 52 indicate warnings from the "L-BFGS-B" method / 99 indicates numerical problem). See the optim R function for more details.
zoores@sampleids is a vector with the names of the samples (when provided in the zooin object through the zoodata function). When an IBD analysis is run, it indicates the pairs of haplotypes separated by "_". For diploid individuals, it combines individual ID, haplotype ID (1 or 2) for the two individuals.

Arguments

zoomodel: A valid zmodel object as defined by the zoomodel function. The model indicates whether rates of exponential distributions are estimated or predefined, the number of classes, the starting values for mixing coefficients and rates, the error probabilities. See "zoomodel" for more details.
zooin: A valid zdata object as obtained by the zoodata function. See "zoodata" for more details.
ids: An optional argument indicating the individual (its position in the data file) that must be proceeded. It can also be a vector containing the list of numbers that must be proceeded. By default, the model runs for all individuals.
parameters: Specifies whether the parameters are estimated by optimization with the L-BFGS-B method from the optim function (optional argument - true by default). If the user doesn't want to estimate the parameters he must set parameters=FALSE. In that case, the forward-backaward and Viterbi algorithms are run with the provided parameters.
fb: A logical indicating whether the forward-backward algorithm is run (optional argument - true by default). The Forward-Backward algorithm estimates the local probabilities to belong to each HBD or non-HBD class. By default, the function returns only the HBD probabilities for each class, averaged genome-wide, and corresponding to the realized autozygosity associated with each class. To obtain HBD probabilities at every marker position, the option localhbd must be set to true (this generates larger outputs).
vit: A logical indicating whether the Viterbi algorithm is run (optional argument - false by default). The Viterbi algorithm performs the decoding (determining the underlying class at every marker position). Whereas the Forward-Backward algorithms provide HBD probabilities (and how confident a region can be declared HBD), the Viterbi algorithm assigns every marker position to one of the defined classes (HBD or non-HBD). When informativity is high (many SNPs per HBD segments), results from the Forward-Backward and the Viterbi algorithm are very similar. The Viterbi algorithm is best suited to identify HBD segments. To estimate realized inbreeding and determine HBD status of a position, we recommend to use the Forward-Backward algorithm that better reflects uncertainty.
localhbd: A logical indicating whether the HBD probabilities for each individual at each marker are returned when using the Forward-Backward algorithm (fb option). This is an optional argument that is false by default.
nT: Indicates the number of threads used when running RZooRoH in parallel (optional argument - one thread by default).
optim_method: Indicates which method the optim R function will use to estimate the parameters of the model ("L-BFGS-B" by default). The possible methods are "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN" and "Brent". Type "? optim" to have more information. In our experience, the "L-BFGS-B" method works well but the method achieving the best likelihood is variable (according to the data sets, the model, the priors, the constraints). The same goes for the efficiency (speed). When the zoorun does not converge, you can test with another method. Note that the only method allowing to put constraints on parameters is "L-BFGS-B" (other methods are unconstrained).
maxiter: Indicates the maximum number of iterations when estimating the parameters with the R optim function (optional argument - 100 by default). Iterations are not defined identically across methods. For instance, in one iteration of the "L-BFGS-B" method, the likelihood of the model, estimated with the forward algorithm, is evaluated multiple times. So, a value of 100 iterations is good for the "L-BFGS-B" method but larger values are required for some other algorithms.
minmix: This indicates the minimal value for the mixing coefficients. By default it is set to 0 with the classical mixkl and kl models (unconstrained). However, when using the step option or the "Interval" HBDclass, the values is set to 1e-16 to avoid numerical problems. Note that constraints are only allowed with the "L-BFGS-B" method from optim.
maxr: This indicates the maximum difference between rates of successive classes. It is an optional argument set to an arbitrarily large value (100000000). Adding such constraints might slow down the speed of convergence and we recommend to run first without this constraint (constraints are only allowed with the "L-BFGS-B" method from optim).
ibd: A logical indicating whether the function will be used to compute IBD between pairs of phased haplotypes instead of HBD within individuals. In that case, the user must provide a matrix with the pairs of haplotypes that will be analyzed. This option can only be used if phased data are provided as input with the zformat set to "vcf" or "haps". This is an optional parameter set to false by default.
ibdpairs: A matrix with four columns, indicating the pair of haplotypes being analyzed in an IBD analysis. Haplotypes are indicated by two columns, one column for the id of the individuals and a second column for the haplotype number within individual (1 or 2). The first and third columns indicate the id of the individuals carrying the first and second haplotype, respectively. The second and four columns indicates the haplotype numbers within the first and second individuals, respectively. With haploid data, the matrix must have only two columns indicating the number of the first and second haplotypes from the pairs. This is an optional parameter, the matrix must be provided only when the ibd option is true.
haploid: This is an optional parameter indicating whether haplotypes belong to an haploid organisms or chromosome (false by default). It can be used only in combination with the 'ibd' option and requires phased data as input with the zformat set to "haps". When haploid is true, then the ibdpairs matrix has only two columns indicating simply the haplotype numbers. When haploid is true the number of haplotypes can be uneven while even numbers are required when haploid is set to false.
RecTable: This is an optional parameter indicating whether a finite number of genetic distances are used (false by default). This function can be used only with the "Interval" HBDclass. The "Interval" option can be slow, in particular if large "intervals" of generations are defined. To speed up computations, some variables are precomputed for a finite set of genetic distances, select to cover a broad range of possible values. The real genetic distance between two genetic markers is then replaced by the closest value in the table (the difference between the true and used genetic distances being also lower than 10"%").
trim_ad: This is an option still under evaluation (for testing only)
hemiprob: This is an option still under evaluation (for testing only)

Examples

Run this code


# Start with a small data set with six individuals and external frequencies.
freqfile <- (system.file("exdata","typsfrq.txt",package="RZooRoH"))
typfile <- (system.file("exdata","typs.txt",package="RZooRoH"))
frq <- read.table(freqfile,header=FALSE)
typ <- zoodata(typfile,supcol=4,chrcol=1,poscol=2,allelefreq=frq$V1)
# Define a model with two HBD classes with rates equal to 10 and 100.
Mod2L <- zoomodel(K=2,base_rate=10)
# Run the model on all individuals.
typ.res <- zoorun(Mod2L, typ)
# Observe some results: likelihood, realized autozygosity in different
# HBD classes and identified HBD segments.
typ.res@modlik
typ.res@realized
typ.res@hbdseg
# Define a model with one HBD and one non-HBD class and run it.
Mod1R <- zoomodel(K=1,predefined=FALSE)
typ2.res <- zoorun(Mod1R, typ)
# Print the estimated rates and mixing coefficients.
typ2.res@krates
typ2.res@mixc
# Get the name and location of a second example file and load the data:
myfile <- (system.file("exdata","genoex.txt",package="RZooRoH"))
ex2 <- zoodata(myfile)
# Run RZooRoH to estimate parameters on your data with the 1 HBD and 1 non-HBD
# class (parameter estimation with optim).
my.mod1R <- zoomodel(predefined=FALSE,K=1,krates=c(10))
my.res <- zoorun(my.mod1R, ex2, fb = FALSE, vit = FALSE)
# The estimated rates and mixing coefficients:
my.res@mixc
my.res@krates
# Run the same model and run the Forward-Backward alogrithm to estimate
# realized autozygosity and the Viterbi algorithm to identify HBD segments:
my.res2 <- zoorun(my.mod1R, ex2)
# The table with estimated realized autozygosity:
my.res2@realized
# Run a model with 3 layers (3 HBD classes / 1 non-HBD class) and estimate
# the rates of HBD classes with one thread:
my.mod3L <- zoomodel(predefined=FALSE,K=3,krates=c(16,64,256))
my.res3 <- zoorun(my.mod3L, ex2, fb = FALSE, vit = FALSE, nT =1)
# The estimated rates for the 3 classes and the 20 individuals:
my.res3@krates
# Run a model with 4 layers and predefined rates.
# The model is run only for a subset of four selected individuals.
# The parameters are estimated, the Forward-Backward alogrithm is ued to
# estimate realized autozygosity and the Viterbi algorithm to identify
# HBD segments. One thread is used.
mix4L <- zoomodel(K=4,base=10)
my.res4 <- zoorun(mix4L,ex2,ids=c(7,12,16,18), nT = 1)
# The table with all identified HBD segments:
my.res4@hbdseg

Run the code above in your browser using DataLab