Help the user to create a model for RZooRoH, including default parameters. The output is a zmodel object necessary to run RZooRoH.
zoomodel(
predefined = TRUE,
K = 10,
mix_coef = rep(0, K),
base_rate = 2,
krates = rep(0, K),
err = 0.001,
seqerr = 0.001,
step = FALSE,
XM = rep(0, K),
HBDclass = "SingleRate"
)
The function return an object that defines a model for RZooRoH and including the following elements: zmodel@typeModel equal to "kl", "mixkl" or "step_mixkl" according to the selected model, zmodel@mix_coef an array with mixing coefficients, zmodel@krates an array with the rates of the HBD classes, zmodel@err the parameter defining the probability to observe an heterozygous genotype in an HBD class, and zmodel@seqerr the parameter defining the probability of sequencing error per read, zmodel@XM the incidence matrix relating individual layers to blocks or steps, zmodel@typeClass indicating the type of HBD classes,
Logical (TRUE or FALSE) to define whether rates of HBD and non-HBD-classes will be estimated by the model ("kl" model) or whether the rates of these classes are fixed and pre-defined by the user ("mixkl" model). The default value is "predefined = TRUE".
The number of layers with the nested 1R model implemented in 2022, this represents thus the number of HBD classes. By default, K is set to 10 but this is not optimal for all data sets. Hence, we recommend to the users to select their own value. If K is set to 1 and rates are estimated, RZooRoH will use the same rate for the HBD and the non-HBD class (so-called 1R model).
The starting value for the mixing coefficients in each layer. The mixing coefficients determine the frequency of the segments from different classes, they determine the probability to start a new HBD segment in a given layer (after reaching that layer after a coancestry change). The mixing coefficients can also be interpreted as the rate of inbreeding in their respective layer and should take values between 0 and 1. The function expects K mixing coefficients (e.g. the number of layers). The default values are 0.01 for HBD classes (smaller values may be better for large values of K). In case the parameters are not estimated (e.g. when running the forward-backward or the Viterbi algorithm alone), these are the mixing coefficients used by the RZooRoH model. Is it possible to force some mixing coefficient to take the same value with the step option. In that case, the expected number of mixing coefficients might be different.
is a integer used to define the rates of successive layers or HBD classes (see krates below). This parameter is most useful when using a model with predefined rates. The rate of each HBD class will be equal to the base_rate raised to the exponent k (the class number). The non-HBD class will have the same rate as the last HBD class. For instance, with a base_rate of 2 and four layers, we have the following rates: 2, 4, 8, 16 and 16. Similarly, with a base_rate of 10 and three layers, we have 10, 100, 1000 and 1000. With this method, more HBD classes are defined for more recent ancestors (for which we have more information to estimate R) and less for ancient HBD classes (it doesn't make sense to try to distinguish R = 1000 from R = 1010). In addition, since the expected length of HBD segments is expected to be 1/R, the ratio between successive expected HBD lengths remains the same. This ratio also determines the ability of the model to distinguish segments from distinct classes. By keeping the ratio constant, the aptitude to discriminate between HBD classes is also constant. In addition, this method allows to cover a wide range of generations in the past, with more emphasis on recent ancestors. The default value for the base_rate is 2.
Is an array with a rate in each layer. The function expects K positive rates. These rates are parameters of the exponential distribution that together with the distance in centimorgans defines the probability to end a HBD segments between two markers. Each layer / HBD class has a distinct rate. Therefore, the expected length of HBD classes is defined by the rates. The expected length is equal to 1/R. These krates are associated with the age of the common ancestor of the HBD segment. The rate is approximately equal to the size of the inbreeding loop (twice the number of generations to the common ancestor) when the map is given in Morgans. This is not an exact dating but an approximation, in simplified conditions. By default, the rates are defined by the base_rate parameter (2, 4, 8, 16, ...).
Indicates the error term, the probability to observe an heterozygous genotype in a HBD segment. The genotype could be heterozygous due to a mutation occuring on the path to the common ancestor. It can also be associated with a genotype calling error or a technical error. In case GP or GL formats are used (with genotyped probabilities or phred scores) or when an AD format is used (based on read counts), this error term still represents the probability to observe an heterozygous genotype in a HBD segment. When an heterozygous genotype was called with a probability equal to 1.00, this heterozygosity in an HBD track might be associated to a mutation or to errors not accounted for by the model used to estimate the genotype probabilities (e.g., GATK). The emission probability to observe a heterozygous genotype in an HBD class will never go below the error term. The default value is 0.001.
This parameters is used only with the AD format. In the AD format the user gives the number of reads for both alleles. A simple model is then used to estimate the genotype probabilities based on the read counts. In that model, the seqerr represents the probability to have a sequencing error in one read. The default value is 0.001.
Logical (TRUE or FALSE). This allows to use a step function to model the mixing coefficients. First a model with many layers is defined. For example, one layer for each even number (50 layers from 2 to 100 for example). Although such a model allows a finer resolution, it requires the estimation of many parameters (probably without enough information to estimate precisely all mixing coefficients). Therefore, we propose a stepfunction that forces mixing coefficients to be constant for several consecutive layers (by blocks of layers). When this option is used, K represents the total number of fitted layers, all their rates must be defined. Regarding the mixing coefficients, one mixing coefficient must be defined per step (block). An incidence matrix relating the full set of mixing coefficients from HBD classes from all layers to the steps or blocks must be provided.
Is an incidence matrix relating the full set of mixing coefficients to the reduced set of mixing coefficients (corresponding to steps or blocks).
By default, this value is set to "SingleRate". This means that the layer is defined by a single rate (like an average or global rate for that layer). With the option "Interval", one rate is first defined for every past generation. If we assume discrete past generations (1, 2, 3, 4, 5, ...), the corresponding rates would approximately be equal to 2, 4, 6, ... We insist that this is a crude approximation. However, it allows easier interpretation of the results, for instance by understanding better how many generations are approximately captured per layer. One layer corresponds then to multiple "generations" and is defined as an interval going from generations g1 to g2. In this approach, we use only discrete generations and start at generation 1. The values in the krates vector correspond then to the last generation in each layer. For example, setting krates to (10,20,50,100) would define four layers from generations 1 to 10, 11 to 20, 21 to 50 and 51 to 100. To obtain the corresponding rates, these values are multiplied by two (inside the function). The rates used by ZooRoH would then go from 2 to 20, 22 to 40, 42 to 100 and 102 to 200.
# To define a the default model, with 10 layers (10 HBD and 1 non-HBD class)
# and with pre-defined rates for HBD classes with a base of 2 (2, 4, 8, ...):
Mix10L <- zoomodel()
# To see the parameters of the defined model, just type:
Mix10L
# To define a model with four layers and using a base of 10 to define
# rates (10, 100, 1000, ...):
Mix4L <- zoomodel(K=4,base=10)
# To define a model with two classes, with estimation of rates for HBD classes
# and starting with a rate 10:
my.mod1R <- zoomodel(predefined=FALSE,K=1,krates=c(10))
Run the code above in your browser using DataLab