codeGeno: Recode genotypic data, imputation of missing values and preselection of markers

Description

This function combines all algorithms for processing of marker data within synbreed package. Raw marker data is a matrix with elements of arbitrary format (e.g. alleles coded as pair of observed alleles "A/T","G/C", ... , or by genotypes "AA", "BB", "AB"). The function is limited to biallelic markers with a maximum of 3 genotypes per locus. Raw data is recoded into the number of copies of a reference allele, i.e. 0, 1 and 2. Imputation of missing values can be done by random sampling from allele distribution, the Beagle software or family information (see details). Additional preselection of markers can be carried out according to the minor allele frequency and/or fraction of missing values.

Usage

codeGeno(gpData,impute=FALSE,
         impute.type=c("random","family","beagle","beagleAfterFamily","beagleNoRand",
                       "beagleAfterFamilyNoRand","fix"),
         replace.value=NULL, maf=NULL, nmiss=NULL, label.heter="AB",
         reference.allele="minor", keep.list=NULL, keep.identical=TRUE, verbose=FALSE,
         minFam=5, showBeagleOutput=FALSE, tester=NULL, print.report=FALSE, check=FALSE)

Arguments

gpData

object of class gpData with arbitrary coding in element geno. Missing values have to be coded as NA.

impute

logical. Should missing value be replaced by imputing?

impute.type

character with one out of "fix", "random" , "family", "beagle", "beagleAfterFamily" , "beagleAfterFamilyNoRand", "beagleAfterFamilyNoRand" (default = <

replace.value

numeric scalar to replace missing values in case impute.type="fix".

maf

numeric scalar. Threshold to discard markers due to the minor allele frequency (MAF). Markers with a MAF < maf are discarded, thus maf in [0,0.5]. If map in gpData is available, markers are

nmiss

numeric scalar. Markers with more than nmiss fraction of missing values are discarded, thus nmiss in [0,1]. If map in gpData is available, markers are also removed from map.

label.heter

This is either a scalar or vector of characters to identify heterozygous genotypes or a function returning TRUE if an element of the marker matrix is the heterozygous genotype. Defining a function is useful, if number of unique heterozygous g

reference.allele

Define the reference allele which is used for the coding. Default is "minor", i.e. data is coded by the number of copies of the minor allele. Alternatively, reference.allele can specify a single character defining the reference a

keep.list

A vector with the names of markers, which should be kept during the process of coding and filtering.

keep.identical

logical. Should duplicated markers be kept? NOTE: From a set of identical markers (with respect to the non-missing alleles) the one with the smallest number of missing values is kept. For those with an identical number of missing values, the

verbose

logical. If TRUE verbose output is generated during the steps of the algorithm. This is useful to obtain numbers of discarded markers due to different criteria.

minFam

For impute.type family and beagleAfterFamily, each family should have at least minFam members with available information for a marker to impute missing values according to the family. The default is 5.

showBeagleOutput

logical. Would you like to see the output of the Beagle software package? The default is FALSE.

tester

This option is in testing mode at the moment.

print.report

logical. Should a file SNPreport.txt be generated containing further information on SNPs. This includes SNP name, original coding of major and minor allele, MAF and number of imputed values.

check

This option has as default FALSE. If something seems to be wrong with the coding, with the option check=TRUE the function tries to catch the error.

Value

An object of class gpData containing the recoded marker matrix. If maf or nmiss were specified or keep.identical=FALSE, dimension of geno and map may be reduced due to selection of markers. The genotype which is homozygous for the minor allele is coded as 2, the other homozygous genotype is coded as 0 and heterozygous genotype is coded as 1.

Details

Coding of genotypic data is done in the following order (depending on choice of arguments; not all steps are performed): 1. Discarding markers with fraction > nmiss of missing values 2. Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0. Note that function codeGeno will terminate with an error whenever more than three genotypes are found. 2.1 Discarding duplicated markers if keep.identical=FALSE before starting of the imputing step. From identical marker based on pairwise complete oberservations one is discarded randomly. For getting identical results use the function set.seed() before code.geno(). 3. Replace missing values by replace.value or impute missing values according to one of the following methods: Imputing is done according to impute.type [object Object],[object Object],[object Object],[object Object],[object Object],[object Object] 4. Recoding of alleles after imputation, if necessary due to changes in allele frequencies caused by the imputed alleles 5. Discarding markers with a minor allele frequency of <= maf 6. Discarding duplicated markers if keep.identical=FALSE. From identical marker based on pairwise complete oberservations one is discarded randomly. For getting identical results use the function set.seed() before code.geno(). 7. Restoring original data format (gpData, matrix or data.frame) Information about imputing is reported after a call of codeGeno. Note: Beagle is included in the synbreed package. Once required, Beagle is called using path.package().

References

S R Browning and B L Browning (2007) Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am J Hum Genet 81:1084-1097 B L Browning and S R Browning (2013) Improving the accuracy and efficiency of identity by descent detection in population data. Genetics 194(2):459-471

Examples

Run this code

# create marker data for 9 SNPs and 10 homozygous individuals
snp9 <- matrix(c(
  "AA",   "AA",   "AA",   "BB",   "AA",   "AA",   "AA",   "AA",  NA,
  "AA",   "AA",   "BB",   "BB",   "AA",   "AA",   "BB",   "AA",  NA,
  "AA",   "AA",   "AB",   "BB",   "AB",   "AA",   "AA",   "BB",  NA,
  "AA",   "AA",   "BB",   "BB",   "AA",   "AA",   "AA",   "AA",  NA,
  "AA",   "AA",   "BB",   "AB",   "AA",   "BB",   "BB",   "BB",  "AB",
  "AA",   "AA",   "BB",   "BB",   "AA",   NA,     "BB",   "AA",  NA,
  "AB",   "AA",   "BB",   "BB",   "BB",   "AA",   "BB",   "BB",  NA,
  "AA",   "AA",    NA,    "BB",    NA,    "AA",   "AA",   "AA",  "AA",
  "AA",    NA,     NA,    "BB",   "BB",   "BB",   "BB",   "BB",  "AA",
  "AA",    NA,    "AA",   "BB",   "BB",   "BB",   "AA",   "AA",  NA),
  ncol=9,byrow=TRUE)

# set names for markers and individuals
colnames(snp9) <- paste("SNP",1:9,sep="")
rownames(snp9) <- paste("ID",1:10+100,sep="")

# create object of class 'gpData'
gp <- create.gpData(geno=snp9)

# code genotypic data
gp.coded <- codeGeno(gp,impute=TRUE,impute.type="random")

# comparison
gp.coded$geno
gp$geno

# example with heterogeneous stock mice
library(synbreedData)
data(mice)
summary(mice)
# heterozygous values must be labeled  (may run some seconds)
mice.coded <- codeGeno(mice,label.heter=function(x) substr(x,1,1)!=substr(x,3,3))


# example with maize data and imputing by family
data(maize)
# first only recode alleles
maize.coded <- codeGeno(maize,label.heter=NULL)

# set 200 random chosen values to NA
set.seed(123)
ind1 <- sample(1:nrow(maize.coded $geno),200)
ind2 <- sample(1:ncol(maize.coded $geno),200)
original <- maize.coded$geno[cbind(ind1,ind2)]

maize.coded$geno[cbind(ind1,ind2)] <- NA
# imputing of missing values by family structure
maize.imputed <- codeGeno( maize.coded,impute=TRUE,impute.type="family",label.heter=NULL)


# compare in a cross table
imputed <- maize.imputed$geno[cbind(ind1,ind2)]
(t1 <- table(original,imputed) )
# sum of correct replacements
sum(diag(t1))/sum(t1)

# compare with random imputation
maize.random <- codeGeno(maize.coded,impute=TRUE,impute.type="random",label.heter=NULL)
imputed2 <- maize.random$geno[cbind(ind1,ind2)]
(t2 <- table(original,imputed2) )
# sum of correct replacements
sum(diag(t2))/sum(t2)

Run the code above in your browser using DataLab