raw.data: Preparation of genomic data

Description

This function gets genomic data ready to be used in packages or softwares that perform genomic predictions

Usage

raw.data(data, frame = c("long","wide"), hapmap=NULL, base=TRUE, sweep.sample=1,
        call.rate=0.95, maf=0.05, imput=TRUE, imput.type=c("wright", "mean","knni"), 
        outfile=c("012","-101","structure"), plot = FALSE)

Arguments

data

object of class matrix

frame

Format of genomic data to be imputed. Two formats are currently supported. "long" is used for objects with sample ID (1st column), marker ID (2nd column), fist allele (3rd column) and second allele (4th column). While "wide" is for objects with dimension $n \times m$ where markers must be in columns and individuals in rows

hapmap

matrix. Object with information on SNPs, chromosome and position

base

logical. Are genotypes coded as nitrogenous bases? if TRUE, data are converted to numeric. If FALSE, it follows to clean up

sweep.sample

numeric. Threshold for removing samples from data by missing rate. Samples with missing rate above the defined threshold are removed from dataset

call.rate

numeric. Threshold for removing marker by missing genotype rate. SNP with call rate below threshold are removed from dataset. Must be between 0, 1

maf

Threshold for removing SNP by minor allele frequency. Must be between 0, 1

imput

Should imputation of missing data be performed?. Default is TRUE

imput.type

Type of imputation. It can be "wright", "mean" or "knni". See details

outfile

character. Type of output to be produced. "012" outputs matrix coded as 0 to AA, 1 to Aa and 2 to aa. "-101" presents marker matrix coded as -1, 0 and 1 to aa, Aa and AA, respectively. "structure" returns a matrix suitable for STRUCTURE Software. For this, each individual is splited in two rows, one for each allele. Nitrogenous bases are then recoded to a specific number, so A is 1, C is 2, G is 3 and T is 4. This format is only acceptable if base are TRUE

plot

If TRUE, a graphical output of quality control is produced.

Value

Returns a properly coded marker matrix output and a report specifying which individuals are removed by sweep.sample and which markers are removed by "call.rate" and maf. Also, a plot with proportion of removed markers and imputed data, for each chromosome, when the map is included, is produced when plot is TRUE

Details

The function allows flexible input of genomic data. Data might be in long format with 4 columns or in wide format where markers are in columns and individuals in rows. Both numeric and nitrogenous bases are accepted. Samples and markers can be eliminated based on missing data rate. Markers can also be eliminated based on the frequency of the minor allele. Three methods of imputation are currently implemented. One is carried out through combination of allele frequency and individual observed heterozygosity estimated from markers. $$ p(x_{ij}) = \left \{ \begin{array}{ll} 0 = (1 - p_j)^2 + p_j (1 - p_j) F_i \\ 1 = 2 p_j (1 - p_j) - 2 p_j (1 - p_j) F_i\\ 2 = p_j^2 + p_j (1 - p_j) F_i \end{array} \right. $$ Hence, for missing values, genotypes are imputed based on their probability of occurrence. This probability depends both on genotype frequency and inbreeding of the individual a specific locus. The second method is based on mean of SNP. Thus, each missing point in a SNP $j$ is replaced by mean of SNP $j$ $$x_{ij} = 2p_j$$ The "knni" imputes missing markers using the mean of the k-nearest markers. Nearest markers are found by computing the Euclidian distance between markers. If you use this option, please refer to the package impute (Hastie et al. 2017) in publications.

Examples

Run this code

# NOT RUN {
data(maize.line)
M <- as.matrix(maize.line)
mrc <- raw.data(M, frame="long", base=TRUE, sweep.sample= 0.8, 
         call.rate=0.95, maf=0.05, imput=FALSE, outfile="-101")

# }

Run the code above in your browser using DataLab