cleanclust: Clean, Impute, and Filter Markers

Description

Prepare marker data for use for amltest. This function can be used to remove markers with a high proportion of missing values, impute missing values with sample average, remove markers with very little variation, and if necessary, re-encode the minor allele as 1 and the majority allele as 0.

Usage

cleanclust(marker, nafrac=0.2, mafb=0.1, corbnd=0.5, method="complete")

Arguments

marker

A matrix or data frame for the marker information. The number of rows should equal the number of lines and the number of columns should equal the number of markers. The values of each element should be between 0 and 1 preferably with minor allele encoded as 1 and majority allele as 0. If minor allele is encoded as 1 instead for a marker, cleanclust change its value to 1 minus the original column. Each column has to have a unique name to identify the marker.

nafrac

The maximum proportion of missing values for a marker. Markers with higher proportion of missing values will be removed. The default is 0.2.

mafb

The minimum minor allele frequency, markers with lower minor allele frequency will be removed. The default is 0.1.

corbnd

The bound used for cutting the dendrogram after the hierarchical clustering, the default is 0.5. See Details.

method

The method of clustering passed to hclust. The values could be one of "complete", "average" or "single". The default is "complete".

Value

newmarker: The new marker matrix after removing markers with a high proportion of missing values or low minor allele frequency, with missing values replaced with sample means, and possibly removing some markers to avoid multiple highly correlated markers.
flip: A vector of marker names for which the minor allele and major allele has been flipped. Other functions in this package require the minor allele to be encoded as 1 and major allele as 0. If the opposite is the case for a marker, the value will be flipped and the marker name will be given in this vector.
tagged: A vector of integers indicating which columns (markers) from the original marker matrix is retained in newmarker.

Details

This is a simplified version of the Hclust method described in the paper Characterization of Multilocus Linkage Disequilibrium by Rinald,et al. (2005), tailored for use with amltest and other functions in this package. The R code for the original Hclust package can be find at http://www.epic.Pitt.ed/Accompaniment/hclust/hclust.ht, which provides more functionality.

The function cleanclust provides two main utilities. The first is to clean and impute the marker data, including removing markers with a high proportion of missing values or very low minor allele frequency as well as impute the remaining missing values by the sample mean regarding each marker. The second is to remove some markers when necessary so that no markers will be highly correlated. Like other LASSO type method, the performance of adaptive mixed LASSO can be improved when predictors are not highly correlated. This process follows that of Rinald et al. (2005). The correlation between each pair of markers are calculated and $r=1-cor^2$ is used as the distance between markers to perform hierarchical clustering with hclust. The resulted dendrogram is cut to form clusters according to the bound on $cor^2$, corbnd. Specifically, higher corbnd values will result in less clusters being formed and less markers in the output. One marker is retained for each cluster in newmarker.

References

Rinaldo, A., Bacanu, S.-A., Devlin, B., Sonpar, V., Wasserman, L. and Roeder, K. (2005), Characterization of multilocus linkage disequilibrium. Genetic Epidemiology, 28: 193-206.

Wang, D., Eskridge, K.M. and Crossa, J. (2011) Identifying QTLs and Epistasis in Structured Plant Populations Using Adaptive Mixed LASSO. Journal of Agricultural, Biological, and Environmental Statistics, 16:170-184.

Wang, D., et al. (2012) Prediction of genetic values of quantitative traits with epistatic effects in plant breeding populations. Heredity, 109: 313-319.

Examples

Run this code

     ## process the markers in the wheat data set.
     data("wheat")
     clmarker<- cleanclust(wheat$marker, nafrac=0.2, mafb=0.1, corbnd=0.5, method="complete")

Run the code above in your browser using DataLab