readVCF: Read SNP data in tabixed VCF format

Description

This function reads tabixed VCF-files, as distributed from the 1000 Genomes project (human).

Usage

readVCF(filename, numcols, tid, frompos, topos,
        samplenames=NA, gffpath = FALSE, include.unknown=FALSE, approx=FALSE,
	out="", parallel=FALSE)

Arguments

filename

the corresponding tabixed VCF-file

numcols

number of SNPs that should be read in as a chunk

tid

which chromosome ? (character)

frompos

start of the region

topos

end of the region

samplenames

a vector of individuals

gffpath

the corresponding GFF file

include.unknown

includ positions with unknown/missing nucleotides

approx

see details !

out

a folder suffix where the temporary files should be saved

parallel

parallel computation using mclapply

Value

The function creates an object of class "GENOME" --------------------------------------------------------- The following slots will be filled in the "GENOME" object --------------------------------------------------------- rll{ Slot Description 1. n.sites total number of sites 2. n.biallelic.sites number of biallelic sites 3. region.data some detailed information about the data read 4. region.names names of regions }

Details

The readVCF function expects a tabixed VCF file with a diploid GT field. In case of haploid data, the GT field has to be transformed to a pseudo-diploid field (such as 0 -> 0|0). An alternative is to use readData(..., format="VCF"), which can read non-tabixed haploid and any kind of polyploid VCFs directly. When approx=TRUE, the algorithm will apply a logical OR to the GT-field: (0|0=0,1|0=1,0|1=1,1|1=1). Note, this is an approximation for diploid data, which will speed up calculations. In case of haploid data, approx should be switched to TRUE. If approx=FALSE, the full diploid information will be considered. The ff-package PopGenome uses to store the SNP information limits total data size to individuals * (number of SNPs) <= .machine$integer.max="" in="" case="" of="" very="" large="" data="" sets,="" the="" bigmemory="" package="" will="" be="" used;="" this="" slow="" down="" calculations="" (e.g.="" have="" to="" installed="" first="" !!!).="" use="" function="" vcf_handle <-.Call("VCF_open", filename) to open a VCF-file and .Call("VCF_getSampleNames",vcf_handle) to get and define the individuals which should be considered in the analysis. See also readData(..., format="VCF") !

Examples

Run this code

# GENOME.class <- readVCF("...\chr1.vcf.gz", 1000, "1", 1, 100000)
# GENOME.class
# GENOME.class@region.names
# GENOME.class <- neutrality.stats(GENOME.class,FAST=TRUE)
# show the result:
# get.sum.data(GENOME.class)
# GENOME.class@region.data

Run the code above in your browser using DataLab