duplicateDiscordanceAcrossDatasets: Duplicate discordance across datasets

Description

Finds number of discordant genotypes by SNP in pairs of duplicate scans of the same subject across multiple datasets.

Usage

duplicateDiscordanceAcrossDatasets(genoData1, genoData2, match.snps.on=c("position", "alleles"), subjName.cols, snpName.cols=NULL, one.pair.per.subj=TRUE, minor.allele.only=FALSE, missing.fail=c(FALSE, FALSE), scan.exclude1=NULL, scan.exclude2=NULL,  snp.exclude1=NULL, snp.exclude2=NULL,  snp.include=NULL, verbose=TRUE)
minorAlleleDetectionAccuracy(genoData1, genoData2, match.snps.on=c("position", "alleles"), subjName.cols, snpName.cols=NULL, missing.fail=TRUE, scan.exclude1=NULL, scan.exclude2=NULL,  snp.exclude1=NULL, snp.exclude2=NULL,  snp.include=NULL, verbose=TRUE)

Arguments

genoData1

GenotypeData object containing the first dataset.

genoData2

GenotypeData object containing the second dataset.

match.snps.on

One or more of ("position", "alleles", "name") indicating how to match SNPs. "position" will match SNPs on chromosome and position, "alleles" will also require the same alleles (but A/B designations need not be the same), and "name" will match on the columns give in snpName.cols.

subjName.cols

2-element character vector indicating the names of the annotation variables that will be identical for duplicate scans in the two datasets. (Alternatively, one character value that will be recycled).

snpName.cols

2-element character vector indicating the names of the annotation variables that will be identical for the same SNPs in the two datasets. (Alternatively, one character value that will be recycled).

one.pair.per.subj

A logical indicating whether a single pair of scans should be randomly selected for each subject with more than 2 scans.

minor.allele.only

A logical indicating whether discordance should be calculated only between pairs of scans in which at least one scan has a genotype with the minor allele (i.e., exclude major allele homozygotes).

missing.fail

For duplicateDiscordanceAcrossDatasets, a 2-element logical vector indicating whether missing values in datasets 1 and 2, respectively, will be considered failures (discordances with called genotypes in the other dataset). For minorAlleleDetectionAccuracy, a single logical indicating whether missing values in dataset 2 will be considered false negatives (missing.fail=TRUE) or ignored (missing.fail=FALSE).

scan.exclude1

An integer vector containing the ids of scans to be excluded from the first dataset.

scan.exclude2

An integer vector containing the ids of scans to be excluded from the second dataset.

snp.exclude1

An integer vector containing the ids of snps to be excluded from the first dataset.

snp.exclude2

An integer vector containing the ids of snps to be excluded from the second dataset.

snp.include

List of SNPs to include in the comparison. Should match the contents of the columns referred to by snpName.cols. Only valid if match.snps.on includes "name".

verbose

Logical value specifying whether to show progress information.

Value

discordance.by.snp: data frame with 4 columns: 1. discordant (number of discordant pairs), 2. npair (number of pairs examined), 3. n.disc.subj (number of subjects with at least one discordance), 4. discord.rate (discordance rate i.e. discordant/npair). Row names are the common snp ID.
discordance.by.subject: a list of matrices (one for each subject) with the pair-wise discordance between the different genotyping instances of the subject
npair: number of sample pairs compared (non-missing in genoData1)
sensitivity: sensitivity
specificity: specificity
positivePredictiveValue: Positive predictive value
negativePredictiveValue: Negative predictive value

Details

duplicateDiscordanceAcrossDatasets calculates discordance metrics both by scan and by SNP. If one.pair.per.subj=TRUE (the default), each subject with more than two duplicate genotyping instances will have one scan from each dataset randomly selected for computing discordance. If one.pair.per.subj=FALSE, discordances will be calculated pair-wise for all possible cross-dataset pairs for each subject.

If minor.allele.only=TRUE, the allele frequency will be calculated in genoData1, using only samples common to both datasets. If snp.include=NULL (the default), discordances will be found for all SNPs common to both datasets.

genoData1 and genoData2 should each have "alleleA" and "alleleB" defined in their SNP annotation. If allele coding cannot be found, the two datasets are assumed to have identical coding.

minorAlleleDetectionAccuracy summarizes the accuracy of minor allele detection in genoData2 with respect to genoData1 (the "gold standard"). TP=number of true positives, TN=number of true negatives, FP=number of false positives, and FN=number of false negatives. Accuracy is represented by four metrics:

sensitivity for each SNP as TP/(TP+FN)
specificity for each SNP as TN/(TN+FP)
positive predictive value for each SNP as TP/(TP+FP)
negative predictive value for each SNP as TN/(TN+FN).

TP, TN, FP, and FN are calculated as follows:

			genoData1
			mm
Mm	MM		mm
2TP	1TP + 1FP	2FP	genoData2
Mm	1TP + 1FN	1TN + 1TP	1TN + 1FP
	MM	2FN	1FN + 1TN
2TN		--	2FN
1FN

"M" is the major allele and "m" is the minor allele (as calculated in genoData1). "-" is a missing call in genoData2. Missing calls in genoData1 are ignored. If missing.fail=FALSE, missing calls in genoData2 (the last row of the table) are also ignored.

Examples

Run this code

# first set
snp1 <- data.frame(snpID=1:10, chromosome=1L, position=101:110, 
                   rsID=paste("rs", 101:110, sep=""),
                   alleleA="A", alleleB="G", stringsAsFactors=FALSE)
scan1 <- data.frame(scanID=1:3, subjectID=c("A","B","C"), sex="F", stringsAsFactors=FALSE)
mgr <- MatrixGenotypeReader(genotype=matrix(c(0,1,2), ncol=3, nrow=10), snpID=snp1$snpID,
                            chromosome=snp1$chromosome, position=snp1$position, scanID=1:3)
genoData1 <- GenotypeData(mgr, snpAnnot=SnpAnnotationDataFrame(snp1),
                          scanAnnot=ScanAnnotationDataFrame(scan1))

# second set
snp2 <- data.frame(snpID=1:5, chromosome=1L, 
                   position=as.integer(c(101,103,105,107,107)), 
                   rsID=c("rs101", "rs103", "rs105", "rs107", "rsXXX"),
                   alleleA= c("A","C","G","A","A"),
                   alleleB=c("G","T","A","G","G"),
                   stringsAsFactors=FALSE)
scan2 <- data.frame(scanID=1:3, subjectID=c("A","C","C"), sex="F", stringsAsFactors=FALSE)
mgr <- MatrixGenotypeReader(genotype=matrix(c(1,2,0), ncol=3, nrow=5), snpID=snp2$snpID,
                            chromosome=snp2$chromosome, position=snp2$position, scanID=1:3)
genoData2 <- GenotypeData(mgr, snpAnnot=SnpAnnotationDataFrame(snp2),
                          scanAnnot=ScanAnnotationDataFrame(scan2))

duplicateDiscordanceAcrossDatasets(genoData1, genoData2, 
  match.snps.on="position",
  subjName.cols="subjectID")

duplicateDiscordanceAcrossDatasets(genoData1, genoData2, 
  match.snps.on=c("position", "alleles"),
  subjName.cols="subjectID")

duplicateDiscordanceAcrossDatasets(genoData1, genoData2, 
  match.snps.on=c("position", "alleles", "name"),
  subjName.cols="subjectID", 
  snpName.cols="rsID")

duplicateDiscordanceAcrossDatasets(genoData1, genoData2, 
  subjName.cols="subjectID", 
  one.pair.per.subj=FALSE)

minorAlleleDetectionAccuracy(genoData1, genoData2, 
  subjName.cols="subjectID")

Run the code above in your browser using DataLab