VCFloci: Information From VCF Files

Description

These functions help to extract information from VCF files and to select which loci to read with read.vcf.

Usage

VCFloci(file, what = "all", chunck.size = 1e9, quiet = FALSE)
# S3 method for VCFinfo
print(x, …)
VCFheader(file)
VCFlabels(file)
# S3 method for VCFinfo
is.snp(x)
rangePOS(x, from, to)
selectQUAL(x, threshold = 20)
getINFO(x, what = "DP", as.is = FALSE)

Arguments

file

file name of the VCF file.

what

a character specifying the information to be extracted (see details).

chunck.size

the size of data in bytes read at once.

quiet

a logical: should the progress of the operation be printed?

an object of class "VCFinfo".

from, to

integer values giving the range of position values.

threshold

a numerical value indicating the minimum value of quality for selecting loci.

as.is

a logical. By default, getINFO tries to convert its output as numeric: if too many NA's are produced, the output is returned as character. Use as.is = TRUE to force the output to be in character mode.

…

further arguments passed to and from other methods.

Value

VCFloci returns an object of class "VCFinfo" which is a data frame with a specific print method.

VCFheader returns a single character string which can be printed nicely with cat.

VCFlabels returns a vector of mode character.

is.snp returns a vector of mode logical.

rangePOS and selectQUAL return a vector of mode numeric.

getINFO returns a vector of mode character or numeric (see above).

Details

The variant call format (VCF) is described in details in the References. Roughly, a VCF file is made of two parts: the header and the genotypes. The last line of the header gives the labels of the genotypes: the first nine columns give information for each locus and are (always) "CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", and "FORMAT". The subsequent columns give the labels (identifiers) of the individuals; these may be missing if the file records only the variants. Note that the data are arranged as the transpose of the usual way: the individuals are as columns and the loci are as rows.

VCFloci is the main function documented here: it reads the information relative to each locus. The option what specifies which column(s) to read. By default, all of them are read. If the user is interested in only the locus positions, the option what = "POS" would be used.

Since VCF files can be very big, the data are read in portions of chunk.size bytes. The default (1 Gb) should be appropriate in most situations. This value should not exceed 2e9.

VCFheader returns the header of the VCF file (excluding the line of labels). VCFlabels returns the individual labels.

The output of VCFloci is a data frame with as many rows as there are loci in the VCF file and storing the requested information. The other functions help to extract specific information from this data frame: their outputs may then be used to select which loci to read with read.vcf.

is.snp tests whether each locus is a SNP (i.e., the reference allele, REF, is a single charater and the alternative allele, ALT, also). It returns a logical vector with as many values as there are loci. Note that some VCF files have the information VT (variant type) in the INFO column.

rangePOS and selectQUAL select some loci with respect to values of position or quality. They return the indices (i.e., row numbers) of the loci satisfying the conditions.

getINFO extracts a specific information from the INFO column. By default, these are the total depths (DP) which can be changed with the option what. The meaning of these information should be described in the header of the VCF file.

References

http://www.1000genomes.org/node/101

https://github.com/samtools/hts-specs

Examples

Run this code

# NOT RUN {
## see ?read.vcf
# }

Run the code above in your browser using DataLab