biomvRseg: Homogeneous segmentation of multi-sample genomic data

Description

The function will perform a two stage segmentation on multi-sample genomic data from array experiment or high throughput sequencing data.

Usage

biomvRseg(x, maxk=NULL, maxbp=NULL, maxseg=NULL, xPos=NULL, xRange=NULL, usePos='start', family='norm', penalty='BIC', twoStep=TRUE, segDisp=FALSE, useMC=FALSE, useSum=TRUE, comVar=TRUE, maxgap=Inf, tol=1e-06, grp=NULL, cluster.m=NULL, avg.m='median', trim=0, na.rm=TRUE)

Arguments

input data matrix, or a GRanges object with input stored in the meta DataFrame

maxk

maximum length of a segment

maxbp

maximum length of a segment in bp, given positional information specified in xPos / xRange / or x

maxseg

maximum number of segment the function will try

xPos

a vector of positions for each x row

xRange

a IRanges/GRanges object, same length as x rows

usePos

character value to indicate whether the 'start', 'end' or 'mid' point position should be used

family

family of x distribution, only the following types are supported: 'norm', 'nbinom', 'pois'

penalty

penalty method used for determining the optimal number of segment using likelihood, possible values are 'none','AIC','AICc','BIC','SIC','HQIC', 'mBIC'

twoStep

TRUE if a second stage merging will be performed after the initial group segmentation

segDisp

TRUE if a segment-wise estimation of dispersion parameter rather than using a overall estimation

useMC

TRUE if mclapply should be used to speed up the calculation for nbinom dispersion estimation

useSum

TRUE if using grand sum across sample / x columns, like in the tilingArray solution

comVar

TRUE if assuming common variance across samples (x columns)

maxgap

max distance between neighbouring feature to consider a split

tol

tolerance level of the likelihood change to determining the termination of the EM run

grp

vector of group assignment for each sample, with a length the same as columns in the data matrix, samples within each group would be processed simultaneously if a multivariate emission distribution is available

cluster.m

clustering method for prior grouping, possible values are 'ward','single','complete','average','mcquitty','median','centroid'

avg.m

method to calculate average value for each segment, 'median' or 'mean' possibly trimmed

trim

the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.

na.rm

TRUE if NA value should be ignored

Value

x:: Object of class "GRanges", with range information either from real positional data or just indices, with input data matrix stored in the meta columns.
res:: Object of class "GRanges" , each range represent one continuous segment identified, with sample name slot 'SAMPLE' and segment mean slot 'MEAN' stored in the meta columns
param:: Object of class "list", list of all parameters used in the model run.

Details

A homogeneous segmentation algorithm, using dynamic programming like in tilingArray; however capable of handling count data from sequencing.

References

Piegorsch, W. W. (1990). Maximum likelihood estimation for the negative binomial dispersion parameter. Biometrics, 863-867.

Picard,F. et al. (2005) A statistical approach for array CGH data analysis. BMC Bioinformatics, 6, 27. Huber,W. et al. (2006) Transcript mapping with high density oligonucleotide tiling arrays. Bioinformatics, 22, 1963-1970. .

Zhang, N. R. and Siegmund, D. O. (2007). A Modified Bayes Information Criterion with Applications to the Analysis of Comparative Genomic Hybridization Data. Biometrics 63 22-32.

Robinson MD and Smyth GK (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332

Examples

Run this code

	data(coriell)
	xgr<-GRanges(seqnames=paste('chr', coriell[,2], sep=''), IRanges(start=coriell[,3], width=1, names=coriell[,1]))
	values(xgr)<-DataFrame(coriell[,4:5], row.names=NULL)
	xgr<-xgr[order(xgr)]
	resseg<-biomvRseg(x=xgr, maxbp=4E4, maxseg=10, family='norm', grp=c(1,2))

Run the code above in your browser using DataLab