pooldata.subset: Create a subset of the pooldata object that contains Pool-Seq data as a function of pool and/or SNP indexes

Description

Create a subset of the pooldata object that contains Pool-Seq data as a function of pool and/or SNP indexes

Usage

pooldata.subset(
  pooldata,
  pool.index = 1:pooldata@npools,
  snp.index = 1:pooldata@nsnp,
  min.maf = -1,
  min.cov = 0,
  max.cov = 1e+09,
  cov.qthres = c(0, 1),
  min.cov.per.pool = -1,
  max.cov.per.pool = 1e+06,
  cov.qthres.per.pool = c(0, 1),
  return.snp.idx = FALSE,
  verbose = TRUE
)

Value

A pooldata object with 7 elements:

"refallele.readcount": a matrix with nsnp rows and npools columns containing read counts for the reference allele (chosen arbitrarily) in each pool
"readcoverage": a matrix with nsnp rows and npools columns containing read coverage in each pool
"snp.info": a matrix with nsnp rows and four columns containing respectively the contig (or chromosome) name (1st column) and position (2nd column) of the SNP; the allele in the reference assembly (3rd column); the allele taken as reference in the refallele matrix.readcount matrix (4th column); and the alternative allele (5th column)
"poolsizes": a vector of length npools containing the haploid pool sizes
"poolnames": a vector of length npools containing the names of the pools
"nsnp": a scalar corresponding to the number of SNPs
"npools": a scalar corresponding to the number of pools

Arguments

pooldata: A pooldata object containing Pool-Seq information
pool.index: Indexes of the pools (at least two), that should be selected to create the new pooldata object (default=all the pools)
snp.index: Indexes of the SNPs (at least two), that should be selected to create the new pooldata object (default=all the SNPs)
min.maf: Minimal allowed Minor Allele Frequency (computed from the ratio over all read counts for the reference allele over the read coverage)
min.cov: Minimal allowed read count (over all the pools).
max.cov: Maximal allowed read count (over all the pools).
cov.qthres: A two-elements vector containing the minimal (qmin) and maximal (qmax) quantile thresholds (0<=qmin<qmax<=1) for the overall coverage (i.e., summing over all pools). See details below
min.cov.per.pool: Minimal allowed read count (per pool). If at least one pool is not covered by at least min.cov.perpool reads, the position is discarded
max.cov.per.pool: Maximal allowed read count (per pool). If at least one pool is covered by more than min.cov.perpool reads, the position is discarded
cov.qthres.per.pool: A two-elements vector containing the minimal (qmin) and maximal (qmax) quantile coverage thresholds applied to each pools (0<=qmin<qmax<=1). See details below
return.snp.idx: If TRUE, the row.names of the snp.info slot of the returned pooldata object are named as "rsx" where x is the index of SNP in the initial pooldata object (default=FALSE)
verbose: If TRUE return some information

Details

This function subsets a pooldata object by selecting a subset of pools and/or SNPs (e.g., based on genomic position).

Additional SNP-level filtering can be applied to the resulting subset to remove poorly polymorphic SNPs using min.maf, or SNPs with low or excessively high coverage using min.cov, max.cov, and cov.qthres.

Coverage filtering can also be performed on a per-pool basis with min.cov.per.pool, max.cov.per.pool, and cov.qthres.per.pool.

For the cov.qthres and cov.qthres.per.pool arguments, empirical coverage quantiles are computed and used as filtering thresholds. SNPs with coverage outside the specified quantile range are discarded.

For example, if qmax = 0.95 in cov.qthres.per.pool, a SNP is removed in a given pool if its coverage exceeds the 95th percentile of the empirical coverage distribution for that pool (computed over SNPs selected by snp.index). Conversely, if qmin = 0.05, SNPs with coverage below the 5th percentile are discarded.

Quantile-based filtering is particularly useful when pools display heterogeneous sequencing depth.

Examples

Run this code

 make.example.files(writing.dir=tempdir())
 pooldata=popsync2pooldata(sync.file=paste0(tempdir(),"/ex.sync.gz"),poolsizes=rep(50,15))
 subset.by.pools=pooldata.subset(pooldata,pool.index=c(1,2))
 subset.by.snps=pooldata.subset(pooldata,snp.index=10:100)
 subset.by.pools.and.snps=pooldata.subset(pooldata,pool.index=c(1,2),snp.index=10:100)
 subset.by.pools.qcov.thr=pooldata.subset(pooldata,pool.index=1:8,cov.qthres.per.pool=c(0.05,0.95))