pcrelate: PC-Relate: Model-Free Estimation of Recent Genetic Relatedness

Description

pcrelate is used to estimate kinship coefficients, IBD sharing probabilities, and inbreeding coefficients using genome-wide SNP data. PC-Relate accounts for population structure (ancestry) among sample individuals through the use of ancestry representative principal components (PCs) to provide accurate relatedness estimates due only to recent family (pedigree) structure.

Usage

pcrelate(genoData, pcMat = NULL, freq.type = "individual", scale = "overall", ibd.probs = TRUE, scan.include = NULL, training.set = NULL, scan.block.size = 5000, 
			snp.include = NULL, chromosome = NULL, snp.block.size = 10000, 
			MAF = 0.01, write.to.gds = FALSE, gds.prefix = NULL, 
			correct = TRUE, verbose = TRUE)

Arguments

genoData

An object of class GenotypeData from the package GWASTools containing the genotype data for SNPs and samples to be used for the analysis. This object can easily be created from a matrix of SNP genotype data, PLINK files, or GDS files. Alternatively, this could be an object of class SeqVarData from the package SeqVarTools containing the genotype data for the sequencing variants and samples to be used for the analysis.

pcMat

An optional matrix of principal components (PCs) to be used for ancestry adjustment. Each column represents a PC, and each row represents an individual. IDs for each individual must be set as the row names of the matrix.

freq.type

A character string taking the values 'individual' or 'population' indicating whether genotype values should be adjusted by individual-specific allele frequencies or population average allele frequencies. This should be set to 'individual' (the default) in order to do a PC-Relate analysis; see 'Details' for more information.

scale

A character string taking the values 'overall', 'variant', or 'none' indicating how genotype values should be standardized. This should be set to 'overall' (the default) in order to do a PC-Relate analysis; see 'Details' for more information.

ibd.probs

Logical indicator of whether pairwise IBD sharing probabilities (k0, k1, k2) should be estimated; the default is TRUE.

scan.include

A vector of IDs for samples to include in the analysis. If NULL, all samples in genoData are included.

training.set

An optional vector of IDs identifying which samples to use for estimation of the ancestry effect when estimating individual-specific allele frequencies. If NULL, all samples in scan.include are used. See 'Details' for more information.

scan.block.size

The number of individuals to read-in/analyze at once; the default value is 5000. See 'Details' for more information.

snp.include

A vector of SNP IDs to include in the analysis. If NULL, see chromosome for further details.

chromosome

A vector of integers specifying which chromosomes to analyze. This parameter is only considred when snp.include is NULL; if chromosome is also NULL, then all SNPs are included.

snp.block.size

The number of SNPs to read-in/analyze at once. The default value is 10000.

MAF

Minor allele frequency filter. When freq.type is 'individual', if an individual's estimated individual-specific minor allele frequency at a SNP is less than this value, that SNP will be excluded from the analysis for that individual. When freq.type is 'population', any SNP with a population minor allele frequency less than this value will be excluded from the analysis. The default value is 0.01.

write.to.gds

Logical indicator of whether the output should be written to GDS files. If FALSE (the default), then the output is returned to the R console as expected. See 'Details' for more information.

gds.prefix

File path specifying where to save the output when write.to.gds = TRUE. If NULL, the prefix 'tmp' is used. See 'Details' for more information.

correct

Logical indicator of whether to implement a small sample correction.

verbose

Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.

Value

sample.id: A vector of IDs for samples included in the analysis.
kinship: A matrix of estimated pairwise kinship coefficients. The order of samples matches sample.id.
ibd.probs: A matrix of estimated pairwise IBD sharing probabilities; the lower triangle gives k0 (the probability of sharing 0 alleles IBD), the upper triangle gives k2 (the probability of sharing 2 alleles IBD), and the diagonal is missing. The order of samples matches sample.id. This matrix is returned only if ibd.probs = TRUE in the input.
nsnp: A matrix specifying the the number of SNPs used to estimate the relatedness measures for each pair of individuals. The order of samples matches sample.id.
kincorrect: A vector specifying the correction factors used for the small sample correction, or NULL.
k2correct: A vector specifying the correction factors used for the small sample correction, or NULL.
call: The function call passed to pcrelate.
method: A character string. Either 'PC-Relate' or 'Unadjusted' identifying which method was used for computing relatedness estimates. 'Unadjusted' is used when pcMat = NULL and corresponds to an assumption of population homogeneity.

Details

The basic premise of PC-Relate is to estimate kinship coefficients, IBD sharing probabilities, and inbreeding coefficients that reflect recent family (pedigree) relatedness by conditioning out genetic similarity due to distant population structure (ancestry) with ancestry representative principal components (PCs).

It is important that the PCs used in pcMat to adjust for ancestry are representative of ancestry and NOT family structure, so we recommend using PCs calculated with PC-AiR.

It is important that the order of individuals in the matrix pcMat matches the order of individuals in genoData. In order to perform relatedness estimation, allele frequency estimates are required for centering and scaling genotype values. When freq.type is 'individual', individual-specific allele frequencies calculated for each individual at each SNP using the PCs specified in pcMat are used. When freq.type is 'population', population average allele frequencies calculated at each SNP are used for all individuals. (Note that when freq.type is set to 'population' there is no ancestry adjustment and the relatedness estimates will be confounded with population structure (ancestry)). There are muliple choices for how genotype values are scaled. When scale is 'variant', centered genotype values at each SNP are divided by their expected variance under Hardy-Weinberg equilibrium. When scale is 'overall', centered genotype values at all SNPs are divided by the average across all SNPs of their expected variances under Hardy-Weinberg equilibrium; this scaling leads to more stable behavior when using low frequency variants. When scale is 'none', genotype values are only centered and not scaled; this won't provide accurate kinship coefficient estimates but may be useful for other purposes. At a particular SNP, the variance used for scaling is either calculated separately for each individual using their individual-specific allele frequncies (when freq.type is 'individual') or once for all individuals using the population average allele frequency (when freq.type is 'population'). Set freq.type to 'individual' and scale to 'overall' to perform a standard PC-Relate analysis; these are the defaults. If freq.type is set to 'individual' and scale is set to 'variant', the estimators are very similar to REAP. If freq.type is set to 'population' and scale is set to 'variant', the estimators are very similar to EIGENSOFT.

The optional input training.set allows the user to specify which samples are used to estimate the ancestry effect when estimating individual-specific allele frequencies (if freq.type is 'individual') or to estimate the population allele frequency (if freq.type is 'population'. Ideally, training.set is a set of mutually unrelated individuals. If prior information regarding pedigree structure is available, this can be used to select training.set, or if pcair was used to obtain the PCs, then the individuals in the PC-AiR 'unrelated subset' can be used. If no prior information is available, all individuals should be used.

The scan.block.size can be specified to alleviate memory issues when working with very large data sets. If scan.block.size is smaller than the number of individuals included in the analysis, then individuals will be analyzed in separate blocks. This reduces the memory required for the analysis, but genotype data must be read in multiple times for each block (to analyze all pairs), which increases the number of computations required. NOTE: if individuals are broken up into more than 1 block, write.to.gds must be TRUE (see below).

If write.to.gds = TRUE, then the output is written to two GDS files rather than returned to the R console. Use of this option requires the gdsfmt package. The first GDS file, named ``_freq.gds'', contains the individual-specific allele frequency estimates for each individual at each SNP (when freq.type is 'individual') or the population allele frequency estimates at each SNP (when freq.type is 'population'. The second GDS file, named ``_pcrelate.gds'', contains the PC-Relate output as described in Value below.

References

Conomos M.P., Reiner A.P., Weir B.S., & Thornton T.A. (2016). Model-free Estimation of Recent Genetic Relatedness. American Journal of Human Genetics, 98(1), 127-148. Gogarten, S.M., Bhangale, T., Conomos, M.P., Laurie, C.A., McHugh, C.P., Painter, I., ... & Laurie, C.C. (2012). GWASTools: an R/Bioconductor package for quality control and analysis of Genome-Wide Association Studies. Bioinformatics, 28(24), 3329-3331.

Examples

Run this code

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(genoData = HapMap_genoData, kinMat = HapMap_ASW_MXL_KINGmat, 
                divMat = HapMap_ASW_MXL_KINGmat)
# run PC-Relate
mypcrel <- pcrelate(genoData = HapMap_genoData, pcMat = mypcair$vectors[,1],
				training.set = mypcair$unrels)
close(HapMap_genoData)

Run the code above in your browser using DataLab