For a given molecular dataset \(\boldsymbol{M}\) (in the format 0, 1 and 2) it produces a reduced molecular matrix by eliminating "redundant" markers using pruning techniques. This function finds and drops some of the SNPs in high linkage disequilibrium (LD).
snp.pruning(
M = NULL,
map = NULL,
marker = NULL,
chrom = NULL,
pos = NULL,
method = c("correlation"),
criteria = c("callrate", "maf"),
pruning.thr = 0.95,
by.chrom = FALSE,
window.n = 50,
overlap.n = 5,
iterations = 10,
seed = NULL,
message = TRUE
)Mpruned: a matrix containing the pruned marker M matrix.
map: an data frame containing the pruned map.
A matrix with marker data of full form (\(n \times p\)), with \(n\) individuals
and \(p\) markers. Individual and marker names are assigned to rownames and colnames, respectively.
Data in matrix is coded as 0, 1, 2 (integer or numeric) (default = NULL).
(Optional) A data frame with the map information with \(p\) rows.
If NULL a dummy map is generated considering a single chromosome and sequential positions
for markers. A map is mandatory if by.chrom = TRUE, where also option chrom
must also be non-null.
A character indicating the name of the column in data frame map
with the identification
of markers. This is mandatory if map is provided (default = NULL).
A character indicating the name of the column in data frame map with the identification
of chromosomes. This is mandatory if map is provided (default = NULL).
A character indicating the name of the column in data frame map with the identification
of marker positions (default = NULL).
A character indicating the method (or algorithm) to be used as reference for
identifying redundant markers.
The only method currently available is based on correlations (default = "correlation").
A character indicating the criteria to choose which marker to drop
from a detected redundant pair.
Options are: "callrate" (the marker with fewer missing values will be kept) and
"maf" (the marker with higher minor allele frequency will be kept) (default = "callrate").
A threshold value to identify redundant markers with Pearson's correlation larger than the
value provided (default = 0.95).
If TRUE the pruning is performed independently by chromosome (default = FALSE).
A numeric value with number of markers to consider in each
window to perform pruning (default = 50).
A numeric value with number of markers to overlap between consecutive windows
(default = 5).
An integer indicating the number of sequential times the pruning procedure
should be executed on remaining markers.
If no markers are dropped in a given iteration/run, the algorithm will stop (default = 10).
An integer to be used as seed for reproducibility. In case the criteria has the
same values for a given pair of markers, one will be dropped at random (default = NULL).
If TRUE diagnostic messages are printed on screen (default = TRUE).
Pruning is recommended as redundancies can affect the quality of matrices used for downstream analyses. The algorithm used is based on the Pearson's correlation between markers as a proxy for LD. In the event of a pairwise correlation higher than the selected threshold markers will be eliminated as specified by: call rate, minor allele frequency. In case of tie, one marker will be dropped at random.
Filtering markers (qc.filtering) is of high relevance before pruning. Poor quality markers (e.g., monomorphic markers) may prevent correlations from being calculated and may affect eliminations.
# Read and filter genotypic data.
M.clean <- qc.filtering(
M = geno.pine655,
maf = 0.05,
marker.callrate = 0.20, ind.callrate = 0.20,
Fis = 1, heterozygosity = 0.98,
na.string = "-9",
plots = FALSE)$M.clean
# Prune correlations > 0.9.
Mpr <- snp.pruning(
M = M.clean, pruning.thr = 0.90,
by.chrom = FALSE, window.n = 40, overlap.n = 10)
head(Mpr$map)
Mpr$Mpruned[1:5, 1:5]
Run the code above in your browser using DataLab