scan_hh_full: Compute iHH, iES and inES over a whole chromosome without cut-offs

Description

Compute integrated EHH (iHH), integrated EHHS (iES) and integrated normalized EHHS (inES) for all markers of a chromosome (or linkage group). This function computes the statistics by a slightly different algorithm than scan_hh: it sidesteps the calculation of EHH and EHHS values and their subsequent integration and consequently no cut-offs relying on these values can be specified. Instead it computes the full lengths of pairwise shared haplotypes and averages them afterwords.

This function is (as yet) exclusively intended for the study of general properties of these statistics using simulated data. The omission of all cut-offs is not recommended for a scan on experimental data.

Usage

scan_hh_full(
  haplohh,
  phased = TRUE,
  polarized = TRUE,
  maxgap = NA,
  discard_integration_at_border = TRUE,
  geometric.mean = FALSE,
  threads = 1
)

Arguments

haplohh

an object of class haplohh (see data2haplohh)

phased

logical. If TRUE (default) chromosomes are expected to be phased. If FALSE, the haplotype data is assumed to consist of pairwise ordered chromosomes belonging to diploid individuals. EHH(S) is then estimated over individuals which are homozygous at the focal marker.

polarized

logical. TRUE by default. If FALSE, use major and minor allele instead of ancestral and derived. If there are more than two alleles then the minor allele refers to the second-most frequent allele.

maxgap

maximum allowed gap in bp between two markers. If exceeded, further calculation of EHH(S) is stopped at the gap (default=NA, i.e no limitation).

discard_integration_at_border

logical. If TRUE (default) and computation of any of the statistics reaches first or last marker or a gap larger than maxgap, iHH, iES and inES are set to NA.

geometric.mean

logical. If FALSE (default), the standard arithmetic mean is used to average shared haplotype lengths. If TRUE the geometric mean is used instead (IES values are undefined in this case). Note that usage of the geometric mean has not yet been studied formally and should be considered experimental!

threads

number of threads to parallelize computation

Value

The returned value is a dataframe with markers in rows and the following columns

chromosome name
position in the chromosome
sample frequency of the ancestral / major allele
sample frequency of the second-most frequent remaining allele
number of evaluated haplotypes at the focal marker for the ancestral / major allele
number of evaluated haplotypes at the focal marker for the second-most frequent remaining allele
iHH of the ancestral / major allele
iHH of the second-most frequent remaining allele
iES (used by Sabeti et al 2007)
inES (used by Tang et al 2007)

Note that in case of unphased data the evaluation is restricted to haplotypes of homozygous individuals which reduces the power to detect selection, particularly for iHS (for appropriate parameter setting see the main vignette and Klassmann et al (2020)).

Details

Integrated EHH (iHH), integrated EHHS (iES) and integrated normalized EHHS (inES) are computed for all markers of the chromosome (or linkage group). This function sidesteps the computation of EHH and EHHS values and their stepwise integration. Instead, the length of all shared haplotypes is computed and afterwords averaged. In the absence of missing values the statistics are identical to those calculated by scan_hh with settings limehh = 0, limehhs = 0 and interpolate = FALSE, yet this function is faster. The former two settings are however not recommended for the application on experimental data (see vignette).

If discard_integration_at_border is set to TRUE and the extension of shared haplotypes reaches a border (i.e. chromosomal boundaries or a gap larger than maxgap), this function discards all statistics, while scan_hh handles each statistic independently.

scan_hh "removes" chromosomes with missing values from further calculations, while this function treats each missing value as a different allele. This yields a somewhat faster decay of all statistics with respect to the distance to the focal marker.

References

Gautier, M. and Naves, M. (2011). Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Molecular Ecology, 20, 3128-3143.

Klassmann A., Vitalis R., and Gautier M. Detecting selection using Extended Haplotype Homozygosity (EHH)-based statistics on unphased or unpolarized data. Preprint. https://doi.org/10.22541/au.158584282.24875401.

Sabeti, P.C. et al. (2002). Detecting recent positive selection in the human genome from haplotype structure. Nature, 419, 832-837.

Sabeti, P.C. et al. (2007). Genome-wide detection and characterization of positive selection in human populations. Nature, 449, 913-918.

Tang, K. and Thornton, K.R. and Stoneking, M. (2007). A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome. Plos Biology, 7, e171.

Voight, B.F. and Kudaravalli, S. and Wen, X. and Pritchard, J.K. (2006). A map of recent positive selection in the human genome. Plos Biology, 4, e72.

Examples

Run this code

# NOT RUN {
#example haplohh object (280 haplotypes, 1424 SNPs)
#see ?haplohh_cgu_bta12 for details
data(haplohh_cgu_bta12)
scan <- scan_hh_full(haplohh_cgu_bta12)
# }

Run the code above in your browser using DataLab