hapFabia (version 1.14.0)

hapFabia: IBD segment extraction by FABIA

Description

hapFabia: R implementation of the hapFabia method.

hapFabia extracts short IBD segments tagged by rare variants from phased or unphased genotypes. hapFabia is designed for rare variants in very large sequencing data. The method is based on FABIA biclustering and utilizes the package fabia.

Usage

hapFabia(fileName,prefixPath="",sparseMatrixPostfix="_mat", annotPostfix="_annot.txt",individualsPostfix="_individuals.txt", labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.05, p=10,iter=40,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1, write_file=0,norm=0,lap=100.0,IBDsegmentLength=50,Lt = 0.1, Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.03, haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3, simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

Arguments

fileName
name of the file that contains the genotype matrix in sparse format.
prefixPath
path to the genotype file.
sparseMatrixPostfix
postfix string for the sparse matrix.
annotPostfix
postfix string for the SNV annotation file.
individualsPostfix
postfix string for the file containing the names of the individuals.
labelsA
annotation of the individuals as matrix individuals x 4 (individual ID, subpopulation, population, genotyping platform).
pRange
indicates which DNA interval is processed.
individuals
vector of individuals that are included into the analysis; default = 0 (all individuals).
lowerBP
lower bound on minor allele frequencies (MAF); however at least two occurrences are required to remove private SNVs.
upperBP
upper bound on minor allele frequencies (MAF) to extract rare variants.
p
number of biclusters per fabia iteration.
iter
number of fabia iterations.
quant
percentage of fabia loadings L that are removed after each iteration.
eps
lower bound on fabia variational parameter lapla; default 1e-5.
alpha
fabia sparseness of the loadings; default = 0.03.
cyc
number of cycles per fabia iteration; default 50.
non_negative
non-negative fabia factors and loadings if non_negative = 1; default = 1 (yes).
write_file
fabia results are written to files (L in sparse format), default = 0 (not written).
norm
fabia data normalization; default 1 (no normalization).
lap
minimal value of fabia's variational parameter; default 100.0.
IBDsegmentLength
expected typical IBD segment length in kbp.
Lt
percentage of largest fabia L values to consider for IBD segment extraction.
Zt
percentage of largest fabia Z values to consider for IBD segment extraction.
thresCount
p-value of random histogram hit; default 1e-5.
mintagSNVsFactor
percentage of the histogram tagSNVs count threshold (mintagSNVs in extractIBDsegments); used to define minimal overlap of individual intervals in an IBD segment; default 3/4.
pMAF
averaged and corrected (for non-uniform distributions) minor allele frequency.
haplotypes
haplotypes = TRUE indicates phased genotypes that is two chromosomes per individual otherwise unphased genotypes.
cut
cutoff for merging IBD segments after a hierarchical clustering; default 0.8.
procMinIndivids
Percentage of cluster individuals a tagSNV must tag to be considered as tagSNV for the IBD segment.
thresPrune
Threshold for pruning border tagSNVs based on an exponential distribution where border tagSNVs with large distances to the next tagSNV are pruned.
simv
Similarity measure for merging clusters: "minD" (percentage of smaller explained by larger set), "jaccard" (Jaccard index), "dice" (Dice index), or "maxD"; default "minD".
minTagSNVs
Minimum matching tagSNVs for cluster similarity; otherwise the similarity is set to zero.
minIndivid
Minimum matching individuals for cluster similarity; otherwise the similarity is set to zero.
avSNVsDist
average distance between SNVs in base pairs - used together with IBDsegmentLength to compute the number of SNVs in the histogram bins; default=100.
SNVclusterLength
if IBDsegmentLength=0 then the number of SNVs in the histogram bins can be given directly; default 100.

Value

List containing
mergedIBDsegmentList
an object of the class IBDsegmentList that contains the extracted IBD segments that were extracted from two histograms with different offset.
res
the result of FABIA.
sPF
individuals per loading of this FABIA result.
annot
annotation for the genotype data.
IBDsegmentList1
an object of the class IBDsegmentList that contains the result of IBD segment extraction from the first histogram.
IBDsegmentList2
an object of the class IBDsegmentList that contains the result of IBD segment extraction from the second histogram.
mergedIBDsegmentList1
an object of the class IBDsegmentList that contains the merged result of the first IBD segment extraction (redundancies removed).
mergedIBDsegmentList2
an object of the class IBDsegmentList that contains the merged result of the second IBD segment extraction (redundancies removed).

Details

This function uses a genotype matrix in sparse matrix format and extracts IBD segments by biclustering. First, it performs Fabia biclustering and then extracts local accumulations of loadings. Then it disentangles IBD segments and prunes off spurious correlated SNVs. Finally, it merges similar IBD segments to account for larger IBD segments that were broken during the analysis.

Annotation file ..._annot.txt for SNVs:

  1. first line: number individuals;
  2. second line: number SNVs;
  3. for each SNV a line containing following field that are blank separated: "chromosome", "physical position", "snvNames", "snvMajor", "snvMinor", "quality", "pass", "info of vcf file", "fields in vcf file", "frequency", "0/1: 1 is changed if major allele is actually minor allele".

labelsA is a matrix ("number individuals" x 4), where for each individual following characteristics are given:

  1. id;
  2. subPopulation;
  3. population;
  4. platform.

The probability of observing $k$ or more correlated SNVs in a histogram bin is computed. The minimal $k$ which pushes the probability below the threshold thresCount is used to find accumulation of correlated SNVs that are assumed to belong to a IBD segment.

Let $p$ be the probability of a random minor allele match between $t$ individuals. The probability of observing $k$ or more matches for $n$ SNVs in a histogram bin is given by one minus the binomial distribution $F(k;n,p)$:

$$1 \ - \ F(k-1;n,p) \ = \ \Pr(K \geq k) \ = \ \sum_{i=k}^\infty \ {n\choose i} \ p^i \ (1-p)^{n-i}$$ If $q$ is the minor allele frequency (MAF) for one SNV, the probability $p$ of observing the minor allele of this SNV in all $t$ individuals is $p=power(q,t)$. The value $q$ is given by the parameter pMAF.

Implementation in R.

References

S. Hochreiter et al., ‘FABIA: Factor Analysis for Bicluster Acquisition’, Bioinformatics 26(12):1520-1527, 2010.

See Also

IBDsegment-class, IBDsegmentList-class, analyzeIBDsegments, compareIBDsegmentLists, extractIBDsegments, findDenseRegions, hapFabia, hapFabiaVersion, hapRes, chr1ASW1000G, IBDsegmentList2excel, identifyDuplicates, iterateIntervals, makePipelineFile, matrixPlot, mergeIBDsegmentLists, mergedIBDsegmentList, plotIBDsegment, res, setAnnotation, setStatistics, sim, simu, simulateIBDsegmentsFabia, simulateIBDsegments, split_sparse_matrix, toolsFactorizationClass, vcftoFABIA

Examples

Run this code

old_dir <- getwd()
setwd(tempdir())

data(simu)

namesL <- simu[["namesL"]]
haploN <- simu[["haploN"]]
snvs <- simu[["snvs"]]
annot <- simu[["annot"]]
alleleIimp <- simu[["alleleIimp"]]
write.table(namesL,file="dataSim1fabia_individuals.txt",
   quote = FALSE,row.names = FALSE,col.names = FALSE)
write(as.integer(haploN),file="dataSim1fabia_annot.txt",
   ncolumns=100)
write(as.integer(snvs),file="dataSim1fabia_annot.txt",
   append=TRUE,ncolumns=100)
write.table(annot,file="dataSim1fabia_annot.txt",
   sep = " ",quote = FALSE,row.names = FALSE,
   col.names = FALSE,append=TRUE)
write(as.integer(haploN),file="dataSim1fabia_mat.txt",
   ncolumns=100)
write(as.integer(snvs),file="dataSim1fabia_mat.txt",
   append=TRUE,ncolumns=100)

for (i in 1:haploN) {

  a1 <- which(alleleIimp[i,]>0.01)

  al <- length(a1)
  b1 <- alleleIimp[i,a1]

  a1 <- a1 - 1
  dim(a1) <- c(1,al)
  b1 <- format(as.double(b1),nsmall=1)
  dim(b1) <- c(1,al)

  write.table(al,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)
  write.table(a1,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)
  write.table(b1,file="dataSim1fabia_mat.txt",
     sep = " ", quote = FALSE,row.names = FALSE,
     col.names = FALSE,append=TRUE)

}


hapRes <- hapFabia(fileName="dataSim1fabia",prefixPath="",
   sparseMatrixPostfix="_mat",
   annotPostfix="_annot.txt",individualsPostfix="_individuals.txt",
   labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.15,
   p=10,iter=1,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1,
   write_file=0,norm=0,lap=100.0,IBDsegmentLength=10,Lt = 0.1,
   Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.1,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

summary(hapRes$mergedIBDsegmentList)

plot(hapRes$mergedIBDsegmentList[[1]],filename="dataSim1fabia_mat")


### Another Example
simulateIBDsegmentsFabia(fileprefix="dataSim",
   minruns=2,maxruns=2,snvs=1000,individualsN=100,
   avDistSnvs=100,avDistMinor=10,noImplanted=1,
   implanted=10,length=50,minors=30,mismatches=0,
   mismatchImplanted=0.5,overlap=50)

hapRes <- hapFabia(fileName="dataSim2fabia",prefixPath="",
   sparseMatrixPostfix="_mat",
   annotPostfix="_annot.txt",individualsPostfix="_individuals.txt",
   labelsA=NULL,pRange="",individuals=0,lowerBP=0,upperBP=0.15,
   p=10,iter=1,quant=0.01,eps=1e-5,alpha=0.03,cyc=50,non_negative=1,
   write_file=0,norm=0,lap=100.0,IBDsegmentLength=10,Lt = 0.1,
   Zt = 0.2,thresCount=1e-5,mintagSNVsFactor=3/4,pMAF=0.1,
   haplotypes=FALSE,cut=0.8,procMinIndivids=0.1,thresPrune=1e-3,
   simv="minD",minTagSNVs=6,minIndivid=2,avSNVsDist=100,SNVclusterLength=100)

## Summary of the IBD segment list
summary(hapRes$mergedIBDsegmentList)

## Summary of the IBD segment 
summary(hapRes$mergedIBDsegmentList[[1]])


## Plot an IBD segment
plot(hapRes$mergedIBDsegmentList[[1]],filename="dataSim2fabia_mat")

## Not run: 
# ## It is interactive, thus dontrun!
# 
# ## Plot an IBD segment list
# plot(hapRes$mergedIBDsegmentList,filename="dataSim2fabia_mat")
# 
# ## End(Not run)

setwd(old_dir)


Run the code above in your browser using DataLab