hapFabia: IBD segment extraction by FABIA

Description

hapFabia: R implementation of the hapFabia method.

hapFabia extracts short IBD segments tagged by rare variants from phased or unphased genotypes. hapFabia is designed for rare variants in very large sequencing data. The method is based on FABIA biclustering and utilizes the package fabia.

Usage

hapFabia(fileName,prefixPath="",sparseMatrixPostfix="_mat", annotPostfix="_annot.txt",individualsPostfix="_individuals.txt", labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.05, p=10,iter=40,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1, write_file=0,norm=0,lap=100.0,IBDsegmentLength=50,Lt = 0.1, Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.03, haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3, simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

Arguments

fileName

name of the file that contains the genotype matrix in sparse format.

prefixPath

path to the genotype file.

sparseMatrixPostfix

postfix string for the sparse matrix.

annotPostfix

postfix string for the SNV annotation file.

individualsPostfix

postfix string for the file containing the names of the individuals.

labelsA

annotation of the individuals as matrix individuals x 4 (individual ID, subpopulation, population, genotyping platform).

pRange

indicates which DNA interval is processed.

individuals

vector of individuals that are included into the analysis; default = 0 (all individuals).

lowerBP

lower bound on minor allele frequencies (MAF); however at least two occurrences are required to remove private SNVs.

upperBP

upper bound on minor allele frequencies (MAF) to extract rare variants.

number of biclusters per fabia iteration.

iter

number of fabia iterations.

quant

percentage of fabia loadings L that are removed after each iteration.

eps

lower bound on fabia variational parameter lapla; default 1e-5.

alpha

fabia sparseness of the loadings; default = 0.03.

cyc

number of cycles per fabia iteration; default 50.

non_negative

non-negative fabia factors and loadings if non_negative = 1; default = 1 (yes).

write_file

fabia results are written to files (L in sparse format), default = 0 (not written).

norm

fabia data normalization; default 1 (no normalization).

lap

minimal value of fabia's variational parameter; default 100.0.

IBDsegmentLength

expected typical IBD segment length in kbp.

percentage of largest fabia L values to consider for IBD segment extraction.

percentage of largest fabia Z values to consider for IBD segment extraction.

thresCount

p-value of random histogram hit; default 1e-5.

mintagSNVsFactor

percentage of the histogram tagSNVs count threshold (mintagSNVs in extractIBDsegments); used to define minimal overlap of individual intervals in an IBD segment; default 3/4.

pMAF

averaged and corrected (for non-uniform distributions) minor allele frequency.

haplotypes

haplotypes = TRUE indicates phased genotypes that is two chromosomes per individual otherwise unphased genotypes.

cut

cutoff for merging IBD segments after a hierarchical clustering; default 0.8.

procMinIndivids

Percentage of cluster individuals a tagSNV must tag to be considered as tagSNV for the IBD segment.

thresPrune

Threshold for pruning border tagSNVs based on an exponential distribution where border tagSNVs with large distances to the next tagSNV are pruned.

simv

Similarity measure for merging clusters: "minD" (percentage of smaller explained by larger set), "jaccard" (Jaccard index), "dice" (Dice index), or "maxD"; default "minD".

minTagSNVs

Minimum matching tagSNVs for cluster similarity; otherwise the similarity is set to zero.

minIndivid

Minimum matching individuals for cluster similarity; otherwise the similarity is set to zero.

avSNVsDist

average distance between SNVs in base pairs - used together with IBDsegmentLength to compute the number of SNVs in the histogram bins; default=100.

SNVclusterLength

if IBDsegmentLength=0 then the number of SNVs in the histogram bins can be given directly; default 100.

Value

mergedIBDsegmentList: an object of the class IBDsegmentList that contains the extracted IBD segments that were extracted from two histograms with different offset.
res: the result of FABIA.
sPF: individuals per loading of this FABIA result.
annot: annotation for the genotype data.
IBDsegmentList1: an object of the class IBDsegmentList that contains the result of IBD segment extraction from the first histogram.
IBDsegmentList2: an object of the class IBDsegmentList that contains the result of IBD segment extraction from the second histogram.
mergedIBDsegmentList1: an object of the class IBDsegmentList that contains the merged result of the first IBD segment extraction (redundancies removed).
mergedIBDsegmentList2: an object of the class IBDsegmentList that contains the merged result of the second IBD segment extraction (redundancies removed).

Details

This function uses a genotype matrix in sparse matrix format and extracts IBD segments by biclustering. First, it performs Fabia biclustering and then extracts local accumulations of loadings. Then it disentangles IBD segments and prunes off spurious correlated SNVs. Finally, it merges similar IBD segments to account for larger IBD segments that were broken during the analysis.

Annotation file ..._annot.txt for SNVs:

first line: number individuals;
second line: number SNVs;
for each SNV a line containing following field that are blank separated: "chromosome", "physical position", "snvNames", "snvMajor", "snvMinor", "quality", "pass", "info of vcf file", "fields in vcf file", "frequency", "0/1: 1 is changed if major allele is actually minor allele".

labelsA is a matrix ("number individuals" x 4), where for each individual following characteristics are given:

id;
subPopulation;
population;
platform.

The probability of observing $k$ or more correlated SNVs in a histogram bin is computed. The minimal $k$ which pushes the probability below the threshold thresCount is used to find accumulation of correlated SNVs that are assumed to belong to a IBD segment.

Let $p$ be the probability of a random minor allele match between $t$ individuals. The probability of observing $k$ or more matches for $n$ SNVs in a histogram bin is given by one minus the binomial distribution $F(k;n,p)$:

$$1 \ - \ F(k-1;n,p) \ = \ \Pr(K \geq k) \ = \ \sum_{i=k}^\infty \ {n\choose i} \ p^i \ (1-p)^{n-i}$$ If $q$ is the minor allele frequency (MAF) for one SNV, the probability $p$ of observing the minor allele of this SNV in all $t$ individuals is $p=power(q,t)$. The value $q$ is given by the parameter pMAF.

Implementation in R.

References

S. Hochreiter et al., ‘FABIA: Factor Analysis for Bicluster Acquisition’, Bioinformatics 26(12):1520-1527, 2010.

Examples

Run this code


old_dir <- getwd()
setwd(tempdir())

data(simu)

namesL <- simu[["namesL"]]
haploN <- simu[["haploN"]]
snvs <- simu[["snvs"]]
annot <- simu[["annot"]]
alleleIimp <- simu[["alleleIimp"]]
write.table(namesL,file="dataSim1fabia_individuals.txt",
   quote = FALSE,row.names = FALSE,col.names = FALSE)
write(as.integer(haploN),file="dataSim1fabia_annot.txt",
   ncolumns=100)
write(as.integer(snvs),file="dataSim1fabia_annot.txt",
   append=TRUE,ncolumns=100)
write.table(annot,file="dataSim1fabia_annot.txt",
   sep = " ",quote = FALSE,row.names = FALSE,
   col.names = FALSE,append=TRUE)
write(as.integer(haploN),file="dataSim1fabia_mat.txt",
   ncolumns=100)
write(as.integer(snvs),file="dataSim1fabia_mat.txt",
   append=TRUE,ncolumns=100)

for (i in 1:haploN) {

  a1 <- which(alleleIimp[i,]>0.01)

  al <- length(a1)
  b1 <- alleleIimp[i,a1]

  a1 <- a1 - 1
  dim(a1) <- c(1,al)
  b1 <- format(as.double(b1),nsmall=1)
  dim(b1) <- c(1,al)

  write.table(al,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)
  write.table(a1,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)
  write.table(b1,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)

}


hapRes <- hapFabia(fileName="dataSim1fabia",prefixPath="",
   sparseMatrixPostfix="_mat",
   annotPostfix="_annot.txt",individualsPostfix="_individuals.txt",
   labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.15,
   p=10,iter=1,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1,
   write_file=0,norm=0,lap=100.0,IBDsegmentLength=10,Lt = 0.1,
   Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.1,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

summary(hapRes$mergedIBDsegmentList)

plot(hapRes$mergedIBDsegmentList[[1]],filename="dataSim1fabia_mat")


### Another Example
simulateIBDsegmentsFabia(fileprefix="dataSim",
   minruns=2,maxruns=2,snvs=1000,individualsN=100,
   avDistSnvs=100,avDistMinor=10,noImplanted=1,
   implanted=10,length=50,minors=30,mismatches=0,
   mismatchImplanted=0.5,overlap=50)

hapRes <- hapFabia(fileName="dataSim2fabia",prefixPath="",
   sparseMatrixPostfix="_mat",
   annotPostfix="_annot.txt",individualsPostfix="_individuals.txt",
   labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.15,
   p=10,iter=1,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1,
   write_file=0,norm=0,lap=100.0,IBDsegmentLength=10,Lt = 0.1,
   Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.1,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

## Summary of the IBD segment list
summary(hapRes$mergedIBDsegmentList)

## Summary of the IBD segment 
summary(hapRes$mergedIBDsegmentList[[1]])


## Plot an IBD segment
plot(hapRes$mergedIBDsegmentList[[1]],filename="dataSim2fabia_mat")

## Not run: 
# ## It is interactive, thus dontrun!
# 
# ## Plot an IBD segment list
# plot(hapRes$mergedIBDsegmentList,filename="dataSim2fabia_mat")
# 
# ## End(Not run)

setwd(old_dir)

Run the code above in your browser using DataLab